scikit-learn – CountVectorizer给出空词汇错误,文件是基数

2023年7月4日 333次阅读

我在使用sklearn CountVectorizer时遇到了一个问题,该文档包含一个单词 – ‘one’.我已经知道当文档只包含POS标签CD(基数)的单词时会发生错误.以下文档都导致空词汇错误：

[‘一二’]

[‘百’]

ngram_code=1
cv = CountVectorizer(stop_words='english', analyzer='word', lowercase=True,\
token_pattern="[\w']+", ngram_range=(ngram_code, ngram_code))
cv_array = cv.fit_transform(['one', 'two'])

得到错误：
ValueError：空词汇;也许这些文件只包含停用词

以下不会导致错误,因为(我认为)基数字与其他单词混合：
[‘一个’,’两个’,’人’]

有趣的是,在这种情况下,只有’人’被添加到词汇表中,’one’,’two’不会被添加：

cv_array = cv.fit_transform(['one', 'two', 'people'])
cv.vocabulary_
Out[143]: {'people': 0}

作为单个单词文档的另一个例子,[‘hello’]工作正常,因为它不是基数：

cv_array = cv.fit_transform(['hello'])
cv.vocabulary_
Out[147]: {'hello': 0}

由于像’one’,’two’这样的单词不是停用词,我希望它们由CountVectorizer处理.我该如何处理这些词？

另外：我对“系统”一词也有同样的错误.为什么这个词会出错呢？

cv_array = cv.fit_transform(['system'])

ValueError：空词汇;也许这些文件只包含停用词

最佳答案他们之所以得到空词汇是因为这些词属于sklearn使用的停用词列表.您可以查看列表
here或测试：

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

>>> 'one' in ENGLISH_STOP_WORDS 
True

>>> 'two' in ENGLISH_STOP_WORDS 
True

>>> 'system' in ENGLISH_STOP_WORDS 
True

如果你想处理这些单词,只需像这样初始化你的CountVectorizer：

cv = CountVectorizer(stop_words=None, ...