python – MultinomialNB – 理论与实践

2024年1月31日 712次阅读

好的,我只是在研究Andrew Ng的机器学习课程.我目前正在阅读
this chapter,并希望使用SKLearn和Python为自己尝试Multinomial Naive Bayes(第12页底部).所以Andrew提出了一种方法,在这种方法中,每种电子邮件都是如此编码的

We let x_i denote the identity of the i-th word in the email. Thus, x_i is now an integer taking values in {1, . . . , |V|}, where |V| is
the size of our vocabulary (dictionary). An email of n words is now
represented by a vector (x1, x2, . . . , xn) of length n note that n
can vary for different documents. For instance, if an email starts
with “A NIPS . . . ,” then x_1 = 1 (“a” is the first word in the
dictionary), and x2 = 35000 (if “nips” is the 35000th word in the
dictionary).

见亮点.

所以这也是我在Python中所做的.我有一个词汇表,这是一个包含502个单词的列表,我对每个“电子邮件”进行编码,使其表示方式与安德鲁描述的方式相同,例如消息“这是sparta”由[495,296,359]表示[495,296,415,359],“这不是斯巴达”.

所以这就是问题所在.

显然,SKLearn的MultinomialNB要求输入具有统一的形状(我不确定这一点,但截至目前,我正在获得ValueError：设置一个带有序列的数组元素.我认为这是因为输入向量不相同尺寸).

所以我的问题是,如何将MultinomialNB用于多个长度的消息？可能吗？我错过了什么？

以下是我正在使用的代码：

X = posts['wordsencoded'].values
y = posts['highview'].values
clf = MultinomialNB()
clf.fit(X, y)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
print(clf.predict())

输入的内容如下：

堆栈跟踪：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-933-dea987cd8603> in <module>()
      3 y = posts['highview'].values
      4 clf = MultinomialNB()
----> 5 clf.fit(X, y)
      6 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
      7 print(clf.predict())

/usr/local/lib/python3.4/dist-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    525             Returns self.
    526         """
--> 527         X, y = check_X_y(X, y, 'csr')
    528         _, n_features = X.shape
    529 

/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    508     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    509                     ensure_2d, allow_nd, ensure_min_samples,
--> 510                     ensure_min_features, warn_on_dtype, estimator)
    511     if multi_output:
    512         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/usr/local/lib/python3.4/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    371                                       force_all_finite)
    372     else:
--> 373         array = np.array(array, dtype=dtype, order=order, copy=copy)
    374 
    375         if ensure_2d:

ValueError: setting an array element with a sequence.

最佳答案是的你正在思考.您必须使用固定长度向量对每封邮件进行编码.对于每个训练集的电子邮件,此向量称为502维度的字数向量(在您的情况下).

每个字数统计矢量包含训练文件中502个字典单词的频率.当然,你现在可能已经猜到了大多数都是零.例如：“这不是sparta不是这个sparta”将编码如下.
[0,0,0,0,0,…… 0,0,2,0,0,0,……,0,0,2,0,0,…… 0,0,2,0,0,…… 2,0,0,0,0,0,0]

这里,所有四个2都位于502长度字数向量的第296,359,415,495个索引处.

因此,将生成特征向量矩阵,其行表示训练集的文件数,列表示502字的字典.
索引’ij’的值将是第i个文件中字典的第j个字的出现次数.

这种生成的电子邮件编码(特征向量矩阵)可以被提供给MultinomialNB用于训练.

在预测课程之前,您还必须为测试电子邮件生成类似的502长度编码.

您可以使用以下博客轻松地在ling-spam数据集上使用multinomialNB构建垃圾邮件过滤器分类器.博客帖子使用sklearn和python来实现.

https://appliedmachinelearning.wordpress.com/2017/01/23/nlp-blog-post/