朴素贝叶斯实现垃圾邮件分类------matlab实现

2019年11月4日 259次阅读

之所以用matlab实现，是因为这是数据挖掘课的几个大作业之一，作业要求，不然也不会这么蛋疼用matlab….(因为我不会matlab…)

朴素贝叶斯原理非常简单，最重要的就是概率公式：

《朴素贝叶斯实现垃圾邮件分类------matlab实现》

其余的内容介绍可以参考：http://zh.wikipedia.org/wiki/%E6%9C%B4%E7%B4%A0%E8%B4%9D%E5%8F%B6%E6%96%AF%E5%88%86%E7%B1%BB%E5%99%A8

下面贴用matlab的具体实现算法：

补上readMatrix.m代码：

function [matrix, tokenlist, category] = readMatrix(filename)

fid = fopen(filename);

%Read the header line
headerline = fgetl(fid);

%Read number of documents and tokens
rowscols = fscanf(fid, '%d %d\n', 2);

%Read the list of tokens - just a long string!
%blah = fscanf(fid, '%s', 1); % required for octave
tokenlist = fgetl(fid);

% Document word matrix
% Each row represents a document (mail)
% Each column represents a distinct token
% The (i,j)-th element represents the number of times token j appeared in
% document i
matrix = sparse(1, 1, 0, rowscols(2), rowscols(1)); % the transpose!

% Vector containing the categories corresponding to each row in the
% document word matrix
% The i-th component is 1 if the i-th document (row) in the document word
% matrix is SPAM, and 0 otherwise.
category = matrix(rowscols(1));

%Read in the matrix and the categories
for m = 1:rowscols(1) % as many rows as number of documents
  line = fgetl(fid);
  nums = sscanf(line, '%d');
  category(m) = nums(1);
  matrix(1 + cumsum(nums(2:2:end - 1)), m) = nums(3:2:end - 1);
end

matrix = matrix'; % flip it back

fclose(fid);

train阶段：

[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN');

trainMatrix = full(spmatrix);
numTrainDocs = size(trainMatrix, 1);
numTokens = size(trainMatrix, 2);

% trainMatrix is now a (numTrainDocs x numTokens) matrix.
% Each row represents a unique document (email).
% The j-th column of the row $i$ represents the number of times the j-th
% token appeared in email $i$. 

% tokenlist is a long string containing the list of all tokens (words).
% These tokens are easily known by position in the file TOKENS_LIST

% trainCategory is a (1 x numTrainDocs) vector containing the true 
% classifications for the documents just read in. The i-th entry gives the 
% correct class for the i-th email (which corresponds to the i-th row in 
% the document word matrix).

% Spam documents are indicated as class 1, and non-spam as class 0.
% Note that for the SVM, you would want to convert these to +1 and -1.


% YOUR CODE HERE
positiveSize = length(find(trainCategory));
negitiveSize = length(trainCategory)-positiveSize;
p1 = positiveSize/numTrainDocs;
p0 = negitiveSize/numTrainDocs;
trainCategory = full(trainCategory);
trainMatrixResult1 = linspace(0,0,numTokens);
trainMatrixResult0 = linspace(0,0,numTokens);
for i=1:numTrainDocs
    for j=1:numTokens
        if abs(trainCategory(1,i)-1)<=1e-10
            trainMatrixResult1(j) = trainMatrixResult1(j)+trainMatrix(i,j);
        else
            trainMatrixResult0(j) = trainMatrixResult0(j)+trainMatrix(i,j);
        end
    end
end
class1sum = sum(trainMatrixResult1);
class0sum = sum(trainMatrixResult0);
for i=1:numTokens
    trainMatrixResult1(i) = 1000*trainMatrixResult1(i)/class1sum;
    trainMatrixResult0(i) = 1000*trainMatrixResult0(i)/class0sum;
end

test阶段：

[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

testMatrix = full(spmatrix);
numTestDocs = size(testMatrix, 1);
numTokens = size(testMatrix, 2);

% Assume nb_train.m has just been executed, and all the parameters computed/needed
% by your classifier are in memory through that execution. You can also assume 
% that the columns in the test set are arranged in exactly the same way as for the
% training set (i.e., the j-th column represents the same token in the test data 
% matrix as in the original training data matrix).

% Write code below to classify each document in the test set (ie, each row
% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry 
% of this vector is the predicted class (1/0) for the i-th  email (i-th row 
% in testMatrix) in the test set.
output = zeros(numTestDocs, 1);

%---------------
% YOUR CODE HERE
%---------------
for i=1:numTestDocs
    belongTo1 = 1;
    belongTo0 = 1;
    for j=1:numTokens
        if testMatrix(i,j) ~= 0
            tokenIndex = j;
            belongTo1 = belongTo1*trainMatrixResult1(tokenIndex);
            belongTo0 = belongTo0*trainMatrixResult0(tokenIndex);
        end
    end
    if belongTo1>belongTo0
        output(i) = 1;
    else
        output(i) = 0;
    end
end
            

% Compute the error on the test set
error=0;
for i=1:numTestDocs
  if (category(i) ~= output(i))
    error=error+1;
  end
end

%Print out the classification error on the test set
error/numTestDocs

分类结果：

《朴素贝叶斯实现垃圾邮件分类------matlab实现》

图形化展示：

《朴素贝叶斯实现垃圾邮件分类------matlab实现》

请各位亲不要直接复制黏贴就交上去哦~