检查一个表(X)中的值是否在具有R data.table的另一个表(Y)中的两列中的值之间

可怕的标题问题,但这是我想要实现的.对于Table1,我想添加“BETWEEN”列,验证“POSITION”是否落在表2中相应“BIN”的任何“START”和“STOP”值之间.

表格1. BIN中的BIN名称(字符)和POSITION(数字):

  BIN    POSITION
    1          12
    1          52
    1          86
    7           6
    7          22
    X         112
    X         139
   MT           3
   MT          26

表2:BIN名称(字符)和START和STOP位置(数字)

  BIN    START    STOP
    1        2      64
    1       90     110
    7       20     100
    7      105     200
    X        1       5
   MT        1    1000

并且期望的结果 – 表1与“BETWEEN”:

CHROM    POSITION      BETWEEN
    1          12         TRUE
    1          52         TRUE
    1          86        FALSE
    7           6        FALSE
    7          22         TRUE
    X         112        FALSE
    X         139        FALSE
   MT           3         TRUE
   MT          26         TRUE

我的表1有大约4,000,000行,表2有大约500,000行,我想出的任何东西都很慢.

作为更大表的示例,请使用以下内容:

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")

tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)

期望的输出是:

tab1
        bin   pos    between
     1:   A     1    FALSE
     2:   A    11    TRUE
     3:   A    21    TRUE
     4:   A    31    FALSE
     5:   A    41    FALSE

最佳答案 以下方法要求对于给定的bin,这些bin是互斥的. (例如,你不能拥有带有界限1-5的bin A和带有界限4-8的另一个bin A.)另外,我修改了你的例子.

positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")

tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]

tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]

tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]

setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]

setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")

result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]

我现在没有时间深入解释我的解决方案.相反,我将采取廉价的方式,并引用您研究data.table的roll参数.上面的基本方法是我加入tab1和tab2,将pos向前滚动到最近的结束边界.然后我加入tab1和tab2,将pos向后滚动到最近的开始边界.然后我在这两个集合上进行内连接,给我tab1中的所有行,它们都在bin的边界内.从那时起,这只是笨拙的工作.

点赞