simple_triplet_matrix出错 – 无法使用RWeka来计算短语

2019年8月4日 235次阅读

使用TM,我将DocumentTermMatrix与字典列表进行比较以计算总数：

totals <- inspect(DocumentTermMatrix(x, list(dictionary = d)))

这适用于单个单词,但我想包含双字,但无法弄清楚如何执行此操作.

我试过RWeka：

TrigramTokenizer <- function(x) NGramTokenizer(x, 
                                               Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(v.corpus, 
                          control = list(tokenize = TrigramTokenizer))

BUt收到以下错误消息：

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  : 
  'i, j, v' different lengths
In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
  all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion.

你能帮忙解决错误信息吗？

谢谢！！

最佳答案看我的答案
here

Seems there are problems using RWeka with parallel package. I
found workaround solution 07001
07002:
07003
The most important point is not loading the RWeka package and use the namespace in a encapsulated function.

所以你的tokenizer应该是这样的
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}