我写了一些函数来填充一个空矩阵,其中包含我从数据集中选择的相关数据点.
该函数有效,但随着数据集的大小增加(完整数据集大约有100k行),它变慢了,因为我使用了很多循环.
如果有人有任何关于如何更有效地做到这一点的提示,我将不胜感激.我已经实现了table()[]函数,并在apply系列中尝试了很多其他的东西,但这是我能做的最好的.
让我们说数据集看起来像这样:
data<-structure(c("concentration permitted by column 3", "concentration permitted under the national",
"concentration phenomena nonlinear dynamics", "concentration phosphorus concentrations phosphorus load",
"concentration plan in greek language", "concentration plan in political science",
"58", "104", "43", "114", "102", "58"), .Dim = c(6L, 2L), .Dimnames = list(
c("", "", "", "", "", ""), NULL))
我们假设矩阵看起来像这样:
mat<-structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), .Dim = c(4L,
4L), .Dimnames = list(c("phosphorus", "interest", "concentration", "phenomena"
), c("phosphorus", "interest", "concentration", "phenomena")))
如果rownames列名中包含的单词,例如mat [1,3],在数据[,1]中一起出现,我想将数据[,2]中的相应计数保存到mat [1,3].
换句话说,“磷”和“浓度”一起出现在数据集(数据[4,])中,并且计数为“114”,数据[4,2].该值应写入mat [1,3].
因此,我希望是这样的:
mat
phosphorus interest concentration phenomena
phosphorus 114 0 114 0
interest 0 0 0 0
concentration 114 0 479 43
phenomena 0 0 43 43
这就是我现在这样做的方式:
data_words<-list()
length(data_words)<-nrow(data)
for (i in 1:nrow(data)){
data_words[[i]]<-unlist(regmatches(data[i,1],gregexpr("(\\S+)",data[i,1],perl=TRUE)))
}
for(i in 1:nrow(mat)){
for(j in 1:ncol(mat)){
for(k in seq_along(data_words)){
if( sum(table(rownames(mat)[i])[data_words[[k]]],na.rm = T)>0 &
sum(table(colnames(mat)[j])[data_words[[k]]],na.rm = T)>0){
mat[i,j]<-as.numeric(mat[i,j])+as.numeric(data[k,2])
}
}
}
}
最佳答案
y <- sapply(colnames(mat), function(x) grepl(x,data[,1]))
z <- expand.grid(seq_along(colnames(mat)),seq_along(colnames(mat)))
x <- matrix(0,dim(z)[1],length(colnames(mat)))
x[cbind(seq_along(z[,1]),z[,1])] <- 1
x[cbind(seq_along(z[,1]),z[,2])] <- x[cbind(seq_along(z[,1]),z[,2])] + 1
mat[as.matrix(z)] <- (x %*% t(y) > 1) %*% as.numeric(data[,2])
> mat
phosphorus interest concentration phenomena
phosphorus 114 0 114 0
interest 0 0 0 0
concentration 114 0 479 43
phenomena 0 0 43 43