斯坦福NLP CoreNLP不会为中文分句

2024年1月26日 633次阅读

我的环境：

> CoreNLP 3.5.1
> stanford-chinese-corenlp-2015-01-30-模型
>中文的默认属性文件：StanfordCoreNLP-chinese.properties

> annotators = segment,ssplit

我的测试文字是“这是第一个句子.这是第二个句子.”
我被判刑

val sentences = annotation.get(classOf[SentencesAnnotation])
for (sent <- sentences) {
  count+=1
  println("sentence{$count} = " + sent.get(classOf[TextAnnotation]))
}

它总是将整个测试文本打印为一个句子,而不是预期的两个句子：

sentence1 = 這是第一個句子.這是第二個句子.

预期的是：

expected sentence1 = 這是第一個句子.
expected sentence2 = 這是第二個句子.

如果我添加更多属性,即使是相同的结果：

ssplit.eolonly = false
ssplit.isOneSentence = false
ssplit.newlineIsSentenceBreak = always
ssplit.boundaryTokenRegex = [.]|[!?]+|[.]|[！？]+

CoreNLP日志是

Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
Adding annotator segment
Loading Segmentation Model [edu/stanford/nlp/models/segmenter/chinese/ctb.gz]...Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... Loading Chinese dictionaries from 1 files:
  edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz

loading dictionaries from edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200
done [56.9 sec].
done. Time elapsed: 57041 ms
Adding annotator ssplit
Adding Segmentation annotation...output: [null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null]
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
這是第一個句子.這是第二個句子.
--->
[這是, 第一, 個, 句子, ., 這是, 第二, 個, 句子, .]
done. Time elapsed: 419 ms

我曾经看过someone得到以下日志(CoreNLP 3.5.0);但奇怪的是我没有这个日志：

Adding annotator ssplit edu.stanford.nlp.pipeline.AnnotatorImplementations:ssplit.boundaryTokenRegex=[.]|[!?]+|[.]|[！？]+

有什么问题？有解决方法吗？如果不可解析,我可以自己拆分,但我不知道如何将我的拆分集成到CoreNLP管道中.

最佳答案好的,我完成了一项工作.

自己定义ssplit注释器.

为了方便我在这里硬编码参数,虽然正确的方法应解析道具.

class MyWordsToSentencesAnnotator extends WordsToSentencesAnnotator(
  true,
  "[.]|[!?]+|[.]|[！？]+",
  null,
  null,
  "never") {
  def this(name: String, props: Properties) { this() }
}

并在属性文件中指定类.

customAnnotatorClass.myssplit = ...

显然,我猜默认的CoreNLP管道设置或代码有错误？