通过大数字位置向量非常快速地顺序对数字位置列上的大数据帧(或数据表)进行子集化

2024年1月2日 236次阅读

我想更快地做到这一点：

set.seed(100)
pos     <- sample(1:100000000, 10000000, replace=F)
df      <- data.table(pos, name="arbitrary_string")
query   <- sample(1:100000000, 10000, replace=F)
df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
output  <- rbindlist(df_list)

所以基本上,我循环遍历位置X的向量并从数据帧中提取每一行,该数据帧在“pos”列中的数字落在定义为[X – 10000,X 10000]的范围之间.我希望在“输出”中多次表示一些行;这是可取的.排序不必与此问题中发布的代码返回的顺序相同.

这个玩具示例基于更大的数据集,我估计在上面编程的单核上运行大约需要10,000小时.因此,对这个问题有一个更快速的解决方案对我来说很有价值.我想要一个纯粹的R解决方案,但我对涉及其他语言的解决方案持开放态度.

最佳答案在这个解决方案中,我将假设行排序无关紧要(不幸的是,如果您需要原始帖子中的确切行排序,它将无法工作).我建议的方法是：

>使用cumsum确定您需要来自pos变量中每个范围的元素的次数.
>使用对cut函数的单个调用确定df $pos的每个元素落入的范围.
>抓住每一行适当的次数,仅将df子集化一次.

这种方法减少了扫描df并抓取子集的次数,这应该会产生显着的加速.让我们从一个可重现的例子开始：

library(data.table)
set.seed(144)
pos <- sample(1:100000000, 10000000, replace=F)
df <- data.table(pos, name="arbitrary_string")
query <- c(100000, 101000, 200000)

现在,让我们确定每个范围内需要行的范围和次数：

query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
query.cut <- query.cut[order(query.cut$x),]
query.cut$y <- cumsum(query.cut$y)
query.cut
#        x y
# 1  90000 1
# 2  91000 2
# 4 110001 1
# 5 111001 0
# 3 190000 1
# 6 210001 0

我们将取pos值为90000-90999的行,pos值为91000-110000的行为两次,pos值为110001-111000的行为一次,行的值为190000-210000一次.

要确定元素属于哪个范围,我们可以使用cut函数,在query.cut表中查找相关的复制数：

num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
num.rep[is.na(num.rep)] <- 0
table(num.rep)
# num.rep
#       0       1       2 
# 9995969    2137    1894

对于我们当前的小查询,几乎所有行都不会被采用.最后一步是抓住每一行适当的次数.

output <- df[rep(1:nrow(df), times=num.rep),]

即使使用相对较少的查询集(我这里300),我们也可以获得相当稳定的加速：

OP <- function(query) {
  df_list <- lapply(query, function(x) subset(df, pos >= x - 10000 & pos <= x + 10000))
  rbindlist(df_list)
}    
josilber <- function(query) {
  query.cut <- rbind(data.frame(x=query-10000, y=1), data.frame(x=query+10001, y=-1))
  query.cut <- query.cut[order(query.cut$x),]
  query.cut$y <- cumsum(query.cut$y)
  query.cut <- query.cut[!duplicated(query.cut$x, fromLast=T),]
  num.rep <- query.cut$y[as.numeric(cut(df$pos, query.cut$x))]
  num.rep[is.na(num.rep)] <- 0
  df[rep(1:nrow(df), times=num.rep),]
}

set.seed(144)
big.query <- sample(df$pos, 300)
system.time(OP(big.query))
#    user  system elapsed 
# 196.693  17.824 217.141 
system.time(josilber(big.query))
#    user  system elapsed 
#   3.418   0.124   3.673

随着查询集的大小增加,新方法的优势应该变得更大,因为它仍然只通过df $pos进行一次传递,而原始方法为查询中的每个元素进行一次传递(也就是新方法渐近更快) .