我有一个R函数来生成
K-Skip-N-Grams:
我的完整功能可以在
github找到.
我的代码确实生成了所需的k-skip-ngram:
> kSkipNgram("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", n=2, skip=1)
[1] "Lorem dolor" "Lorem ipsum" "ipsum sit"
[4] "ipsum dolor" "dolor amet" "dolor sit"
[7] "sit consectetur" "sit amet" "amet adipiscing"
[10] "amet consectetur" "consectetur elit" "consectetur adipiscing"
[13] "adipiscing elit"
但是我想概括/简化嵌套for循环的以下switch语句:
# x - should be text, sentense
# n - n-gramm
# skip - number of skips
###################################
switch(as.character(n),
"0" = {ngram<-c(ngram, paste(x[i]))},
"1" = {for(j in skip:1)
{
if (i+j <= length(x))
{ngram<-c(ngram, paste(x[i],x[i+j]))}
}
},
"2" = {for(j in skip:1)
{for (k in skip:1)
{
if (i+j <= length(x) && i+j+k <= length(x))
{ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k]))}
}
}
},
"3" = {for(j in skip:1)
{for (k in skip:1)
{for (l in skip:1)
{
if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x))
{ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l]))}
}
}
}
},
"4" = {for(j in skip:1)
{for (k in skip:1)
{for (l in skip:1)
{for (m in skip:1)
{
if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x) && i+j+k+l+m <= length(x))
{ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l],x[i+j+k+l+m]))}
}
}
}
}
}
)
}
}
最佳答案 我使用了一般k-skip-n-gram的递归解决方案.我把它包含在Python中;我对R没有经验,但希望你能翻译它.我使用了本文的定义:
http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
如果您要在长句中使用它,可能应该使用一些动态编程进行优化,因为它目前有大量冗余计算(重复计运算符克数).我也没有彻底测试过这种情况,可能会出现问题.
def kskipngrams(sentence,k,n):
"Assumes the sentence is already tokenized into a list"
if n == 0 or len(sentence) == 0:
return None
grams = []
for i in range(len(sentence)-n+1):
grams.extend(initial_kskipngrams(sentence[i:],k,n))
return grams
def initial_kskipngrams(sentence,k,n):
if n == 1:
return [[sentence[0]]]
grams = []
for j in range(min(k+1,len(sentence)-1)):
kmjskipnm1grams = initial_kskipngrams(sentence[j+1:],k-j,n-1)
if kmjskipnm1grams is not None:
for gram in kmjskipnm1grams:
grams.append([sentence[0]]+gram)
return grams