R在数据框中生成非重复对,避免使用相同的组成员

2023年1月8日 217次阅读

因此,目的是通过获取距离来比较每个ID与彼此的ID.此外,一些ID可能通过属于同一组而相关,这意味着如果它们相关则不必比较它们.

考虑以下数据帧Df

ID AN     AW      Group
a  white  green   1
b  black  yellow  1
c  purple gray    2
d  white  gray    2

以下代码有助于实现此结果(来自问题：R Generate non repeating pairs in dataframe)：

ids <- combn(unique(df$ID), 2)
data.frame(df[match(ids[1,], df$ID), ], df[match(ids[2,], df$ID), ])

#ID   AN     AW    ID2   AN2    AW2
a   white  green   b   black  yellow
a   white  green   c   purple gray
a   white  green   d   white  gray
b   black  yellow  c   purple gray 
b   black  yellow  d   white  gray
c   purple gray    d   white  gray

我想知道是否有可能不计算某些计算才能得到这个答案：

#ID   AN     AW    Group   ID2   AN2    AW2   Group2
a   white  green     1      c   purple gray    2
a   white  green     1      d   white  gray    2
b   black  yellow    1      c   purple gray    2
b   black  yellow    1      d   white  gray    2

意思是我可以避免这种计算：

#ID   AN     AW    Group   ID2   AN2    AW2    Group2
a   white  green     1      b   black  yellow    1
c   purple gray      2      d   white  gray      2

如果我比较组,我能够进行子集化,但这意味着由于数据帧很大而计算时间更长,并且组合遵循n *(n-1)/ 2

这可能吗？或者我是否必须进行所有组合,然后将同一组之间的比较分组？

最佳答案这是一个相当冗长的基础R解决方案,假设可能有两个以上的组.

# create test data.frame
df <- data.frame(ID=letters[1:4], AN=c("white", "black", "purple", "white"),
                 AW=c("green", "yellow", "gray", "gray"),
                 Group=rep(c(1,2),each=2), stringsAsFactors=FALSE)

# split data.frame by group, subset df to needed variables
dfList <- split(df[, c("ID", "Group")], df$Group)
# use combn to get all group-pair combinations
groupPairs <- combn(unique(df$Group), 2)

接下来,我们遍历(通过sapply)所有组的成对组合.对于每个组合,我们构建一个data.frame,它是通过expand.grid在每个组之间的ID的成对组合.使用来自groupPairs [1,i]和groupPairs [2,i]的名称从命名列表dfList中提取ID(使用[[]]运算符).

# get a list of all ID combinations by group combination
myComparisonList <- sapply(1:ncol(groupPairs), function(i) {
                           expand.grid(dfList[[groupPairs[1,i]]]$ID,
                                       dfList[[groupPairs[2,i]]]$ID,
                                       stringsAsFactors=F)
                           })
# extract list of combinations to matrix
idsMat <- sapply(myComparisonList, rbind)

# bind comparison pairs together by column
dfDone <- cbind(df[match(idsMat[,1], df$ID), ], df[match(idsMat[,2], df$ID), ])
# differentiate names
names(dfDone) <- paste0(names(dfDone), rep(c(".1", ".2"),
                        each=length(names(df))))