K-Skip-N-Gram:R中for循环的推广

我有一个R函数来生成
K-Skip-N-Grams

我的完整功能可以在
github找到.

我的代码确实生成了所需的k-skip-ngram:

> kSkipNgram("Lorem ipsum dolor sit amet, consectetur adipiscing elit.", n=2, skip=1)
 [1] "Lorem dolor"            "Lorem ipsum"            "ipsum sit"             
 [4] "ipsum dolor"            "dolor amet"             "dolor sit"             
 [7] "sit consectetur"        "sit amet"               "amet adipiscing"       
[10] "amet consectetur"       "consectetur elit"       "consectetur adipiscing"
[13] "adipiscing elit"       

但是我想概括/简化嵌套for循环的以下switch语句:

# x - should be text, sentense
# n - n-gramm
# skip - number of skips
###################################
  switch(as.character(n),
         "0" = {ngram<-c(ngram, paste(x[i]))},
         "1" = {for(j in skip:1)
                  {
                    if (i+j <= length(x)) 
                      {ngram<-c(ngram, paste(x[i],x[i+j]))}
                  }
                },
         "2" = {for(j in skip:1)
                  {for (k in skip:1)
                    {
                      if (i+j <= length(x) && i+j+k <= length(x)) 
                        {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k]))}
                    }
                  }
                },
         "3" = {for(j in skip:1)
                  {for (k in skip:1)
                    {for (l in skip:1)
                      {
                      if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x)) 
                          {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l]))}
                      }
                    }
                  }
                },
         "4" = {for(j in skip:1)
                  {for (k in skip:1)
                      {for (l in skip:1)
                        {for (m in skip:1)
                            {
                            if (i+j <= length(x) && i+j+k <= length(x) && i+j+k+l <= length(x) && i+j+k+l+m <= length(x)) 
                                  {ngram<-c(ngram, paste(x[i],x[i+j],x[i+j+k],x[i+j+k+l],x[i+j+k+l+m]))}
                            }
                        }
                      }
                    }
                  }
        )
  }
}

最佳答案 我使用了一般k-skip-n-gram的递归解决方案.我把它包含在Python中;我对R没有经验,但希望你能翻译它.我使用了本文的定义:

http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

如果您要在长句中使用它,可能应该使用一些动态编程进行优化,因为它目前有大量冗余计算(重复计运算符克数).我也没有彻底测试过这种情况,可能会出现问题.

def kskipngrams(sentence,k,n):
    "Assumes the sentence is already tokenized into a list"
    if n == 0 or len(sentence) == 0:
        return None
    grams = []
    for i in range(len(sentence)-n+1):
        grams.extend(initial_kskipngrams(sentence[i:],k,n))
    return grams

def initial_kskipngrams(sentence,k,n):
    if n == 1:
        return [[sentence[0]]]
    grams = []
    for j in range(min(k+1,len(sentence)-1)):
        kmjskipnm1grams = initial_kskipngrams(sentence[j+1:],k-j,n-1)
        if kmjskipnm1grams is not None:
            for gram in kmjskipnm1grams:
                grams.append([sentence[0]]+gram)
    return grams
点赞