Latent Semantic Analysis (LSA) Tutorial 潜语义分析LSA介绍 四

WangBen 20110916 Beijing

Part 2 – Modify the Counts with TFIDF

计算TFIDF替代简单计数

In sophisticated Latent Semantic Analysis systems, the raw matrix countsare usually modified so that rare words are weighted more heavily than commonwords. For example, a word that occurs in only 5% of the documents shouldprobably be weighted more heavily than a word that occurs in 90% of thedocuments. The most popular weighting is TFIDF (Term Frequency – InverseDocument Frequency). Under this method, the count in each cell is replaced bythe following formula.

在复杂的LSA系统中,为了重要的词占据更重的权重,原始矩阵中的计数往往会被修改。例如,一个词仅在5%的文档中应该比那些出现在90%文档中的词占据更重的权重。最常用的权重计算方法就是TFIDF(词频-逆文档频率)。基于这种方法,我们把每个单元的数值进行修改:

TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di) where

  • Ni,j = the number of times word i appears in document j (the original cell count).
  • N*,j = the number of total words in document j (just add the counts in column j).
  • D = the number of documents (the number of columns).
  • Di = the number of documents in which word i appears (the number of non-zero columns in row i).

Nij = 某个词i出现在文档j的次数(矩阵单元中的原始值)
N*j= 在文档j中所有词的个数(就是列j上所有数值的和)
D = 文档个数(也就是矩阵的列数)
Di= 包含词i的文档个数(也就是矩阵第i行非0列的个数)

In this formula, words that concentrate in certain documents areemphasized (by the Ni,j / N*,jratio) and words that onlyappear in a few documents are also emphasized (by the log( D / Di )term).

Since we have such a small example, we will skip this step and move on theheart of LSA, doing the singular value decomposition of our matrix of counts.However, if we did want to add TFIDF to our LSA class we could add the followingtwo lines at the beginning of our python file to import the log, asarray, andsum functions.

在这个公式里,在某个文档中密集出现的词被加强(通过Nij/N*j),那些仅在少数文档中出现的词也被加强(通过log(D/Di))

因为我们的例子过小,这里将跳过这一个步骤直接进入LSA的核心部分,对我们的计数矩阵做SVD。然而,如果我们需要增加TFIDF到这个LSA类中,我们需要加入以下两行代码。

from math importlog
from numpy import asarray, sum

Then we would add the following TFIDF method to our LSA class. WordsPerDoc(N*,j) just holds the sum of each column, which is the total numberof index words in each document. DocsPerWord (Di) uses asarray tocreate an array of what would be True and False values, depending on whetherthe cell value is greater than 0 or not, but the ‘i’ argument turns it into 1’sand 0’s instead. Then each row is summed up which tells us how many documentseach word appears in. Finally, we just step through each cell and apply theformula. We do have to change cols (which is the number of documents) into afloat to prevent integer division.

接下来需要增加下面这个TFIDF方法到我们的LSA类中。WordsPerDoc 就是矩阵每列的和,也就是每篇文档的词语总数。DocsPerWord 利用asarray方法创建一个0、1数组(也就是大于0的数值会被归一到1),然后每一行会被加起来,从而计算出每个词出现在了多少文档中。最后,我们对每一个矩阵单元计算TFIDF公式

def TFIDF(self):

    WordsPerDoc = sum(self.A, axis=0)       

    DocsPerWord = sum(asarray(self.A > 0,'i'), axis=1)

    rows, cols = self.A.shape

    for i in range(rows):

        for j in range(cols):

            self.A[i,j] = (self.A[i,j] /WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
点赞