MLA Review之三：朴素贝叶斯分类

本文主要是介绍MLA Review之三：朴素贝叶斯分类，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

朴素贝叶斯（Naive Bayes）,贝叶斯概率论在整个统计学习上都是泰山北斗一样的存在，《Pattern Recognization and Machine Learning》这一扛鼎之作全书的思想其实就是贝叶斯概率论，简单的说就是先验代替后验。

我们先来给朴素贝叶斯找一点理论支持

P(C1|X,Y)=P(C1)*P(X,Y|C1)/P(X,Y)

P(C1|X,Y)=P(C1)*P(X|C1)*P(Y|C1)，这样就容易计算了。

背景粗略交代完毕，现在回到具体问题：

邮件分类器：邮件分类垃圾邮件个正常邮件，根据已有的邮件训练样本，训练出邮件分类模型。

数据说明：25封垃圾邮件，25封正常邮件，在这50封邮件里面随机选取10篇作为测试数据，剩下40篇作为训练数据

算法说明：

根据所给的训练邮件，得到所有的不重复单词数组，记为wordlist
将训练数据和测试数据按照wordlist的顺序转换成词向量
根据词向量使用NaiveBayes训练模型
使用测试数据集测试结果

下面是具体代码：

Python代码    
 # -*- coding: UTF8 -*-  
 """ 
 author:luchi 
 date:16/2/18 
 desc: 
 朴素贝叶斯做邮件分类 
 """  
   
 """ 
 获取训练与测试文本,构建训练集与测试集 
 """  
 import re  
 import random  
 from numpy import *  
   
   
 def splitWords(str):  
     listTokens=re.split(r'\W*',str)  
     return [token.lower() for token in listTokens  if len(token)>2 ]  
   
 def initDataset():  
   
     wordList=[]  
     docList=[]  
     labels=[]  
     for i in range(1,26):  
         fr=open('email/spam/%d.txt' % i)  
         frStr=fr.read()  
         l=splitWords(frStr)  
         docList.append(l)  
         labels.append(0)  
         wordList.extend(l)  
         fr=open('email/ham/%d.txt' % i)  
         frStr=fr.read()  
         l=splitWords(frStr)  
         docList.append(l)  
         labels.append(1)  
         wordList.extend(l)  
     # print wordList  
     # print docList  
     #随机选出10个组作为测试  
     length=len(docList)  
     testList=[]  
     testLabels=[]  
     for i in range(10):  
         randIndex=int(random.uniform(0,len(docList)))  
         testList.append(docList[randIndex])  
         testLabels.append(labels[randIndex])  
         del(docList[randIndex])  
         del(labels[randIndex])  
     return wordList,docList,labels,testList,testLabels,length  
   
 """ 
  
 创建训练和测试向量 
  
 """  
   
 def getVecDataset(wordList,trainList,testList):  
     wordList=set(wordList)  
     wordvec=[token for token in wordList]  
     feature_num=len(wordvec)  
     print len(wordvec)  
     trainVec=zeros((len(trainList),feature_num))  
     testVec=zeros((len(testList),feature_num))  
   
     for i,l in enumerate(trainList):  
         for word in l:  
             if word in wordvec:  
                 trainVec[i][wordvec.index(word)]+=1  
     for i,l in enumerate(testList):  
         for word in l:  
             if word in wordvec:  
                 testVec[i][wordvec.index(word)]+=1  
     return  trainVec,testVec  
   
   
   
   
 def NaiveBayes(traingList,trainLabel):  
   
     trainMat=array(traingList)  
     labelMat=array(trainLabel)  
     class0=ones(len(trainMat[0]))  
     sumClass0=2.0  
     class1=ones(len(trainMat[0]))  
     sumClass1=2.0  
     m=len(trainMat)  
     pclass0=0  
   
     for i in range(m):  
         if(trainLabel[i]==0):  
             class0+=trainMat[i]  
             sumClass0+=sum(trainMat[i])  
             pclass0+=1  
         elif trainLabel[i]==1:  
             class1+=trainMat[i]  
             sumClass1+=sum(trainMat[i])  
     # print class0  
     # print sumClass0  
     class0=class0/sumClass0  
     class1=class1/sumClass1  
     class0=log(class0)  
     class1=log(class1)  
     return class0,class1,pclass0  
   
 def testNaiveBayes(testVec,vec0,vec1,pclass0):  
     p0=sum(testVec*vec0)+log(pclass0)  
     p1=sum(testVec*vec1)+log(1-pclass0)  
     if(p0>p1):  
         return 0  
     else:  
         return 1  
   
   
   
 def test():  
     wordList,trainList,trainLabels,testList,testLabels,doc_num=initDataset()  
     trainVec,testVec=getVecDataset(wordList,trainList,testList)  
     class0Vec,class1Vec,pclass0=NaiveBayes(trainVec,trainLabels)  
     m=len(testVec)  
     err=0  
     pclass0=float(pclass0)/len(trainVec)  
     for i in range(m):  
         vec=testVec[i]  
         label=testLabels[i]  
         result=testNaiveBayes(array(vec),class0Vec,class1Vec,pclass0)  
         if result!=label:  
             err+=1  
     print ("error rate is %f" % (float(err)/m))  
   
   
 if __name__=="__main__":  
     test()