可怕的标题问题,但这是我想要实现的.对于Table1,我想添加“BETWEEN”列,验证“POSITION”是否落在表2中相应“BIN”的任何“START”和“STOP”值之间.
表格1. BIN中的BIN名称(字符)和POSITION(数字):
BIN POSITION
1 12
1 52
1 86
7 6
7 22
X 112
X 139
MT 3
MT 26
表2:BIN名称(字符)和START和STOP位置(数字)
BIN START STOP
1 2 64
1 90 110
7 20 100
7 105 200
X 1 5
MT 1 1000
并且期望的结果 – 表1与“BETWEEN”:
CHROM POSITION BETWEEN
1 12 TRUE
1 52 TRUE
1 86 FALSE
7 6 FALSE
7 22 TRUE
X 112 FALSE
X 139 FALSE
MT 3 TRUE
MT 26 TRUE
我的表1有大约4,000,000行,表2有大约500,000行,我想出的任何东西都很慢.
作为更大表的示例,请使用以下内容:
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50), stop = start+25)
期望的输出是:
tab1
bin pos between
1: A 1 FALSE
2: A 11 TRUE
3: A 21 TRUE
4: A 31 FALSE
5: A 41 FALSE
最佳答案 以下方法要求对于给定的bin,这些bin是互斥的. (例如,你不能拥有带有界限1-5的bin A和带有界限4-8的另一个bin A.)另外,我修改了你的例子.
positions <- seq(1,100000,10)
bins <- c("A","B","C","D","E","F","G","H","I","J")
tab1 <- data.table(bin = rep(bins,1,each=length(positions)), pos = rep(positions,10))
setkey(tab1,"bin","pos")
tab2 <- data.table(bin = rep(bins,1,each=2000), start = seq(5,100000,50))
tab2[, end := start+25]
tab2[,pos:=start]
setkey(tab2,"bin","pos")
x<-tab2[tab1, roll=TRUE, nomatch=0]
tab2[,pos:=end]
setkey(tab2,"bin","pos")
y<-tab2[tab1, roll=-Inf, nomatch=0]
setkey(x,"bin","pos","start")
setkey(y,"bin","pos","start")
inBin<-x[y,nomatch=0]
inBin[, between:=TRUE]
setkey(tab1,"bin","pos")
setkey(inBin,"bin","pos")
result<-inBin[,list(bin,pos,between)][tab1]
result[is.na(between), between:=FALSE]
我现在没有时间深入解释我的解决方案.相反,我将采取廉价的方式,并引用您研究data.table的roll参数.上面的基本方法是我加入tab1和tab2,将pos向前滚动到最近的结束边界.然后我加入tab1和tab2,将pos向后滚动到最近的开始边界.然后我在这两个集合上进行内连接,给我tab1中的所有行,它们都在bin的边界内.从那时起,这只是笨拙的工作.