本文主要是介绍k-近邻算法(KNN)--2改进约会网站的配对效果---by香蕉麦乐迪,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
参考书籍:《机器学习实战》
实验说明:预测约会对象对用户是否具有吸引力
输入数据:每个待约会的对象有三个属性,分别是 每年飞行里程数、玩游戏占时间比、每周吃的冰淇淋(单位公升);(ps:我觉得这三个参数,分别代表一个人是否有钱,生活娱乐,饮食习惯)
样本集:有1000个约会对象的数据,并且每个对象有一个标签,标签有三大类,分别是 不喜欢、魅力一般、非常有魅力
实验过程:
1、将样本集90%当做训练集,10%当做测试集,测试classify.py的错误率
2、用户输入一个约会对象的参数,给出分类的标签,为用户提供建议
代码文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
autoNorm.py:将参数归一化,每个参数的大小范围不一致
datingClassTest.py:错误率测试
classify.py:预测分类函数
classifyPerson.py:输入某个人的参数,给出预测结果
knn.py:主函数
样本集及源文件下载:点击打开链接
源文件:
file2Matrix.py:样本集存在txt文件中,该函数将样本集输入到内存中,以array的方式存储起来
__author__ = 'root'import numpy as npdef file2Matrix(filename):#open filefileHandle=open(filename,mode='r')#read lines, here lines is a listlines=fileHandle.readlines()#for saving datai=0datingDataSet=np.zeros((len(lines),3))labels=[]#traverse all lines,save to matrixfor line in lines:line=line.strip()listFromLine=line.split('\t')datingDataSet[i,:]=listFromLine[0:3]labels.append(int(listFromLine[-1]))i+=1#return dataSet and labelsreturn datingDataSet, labels
plotDataSet:将样本集的数据画出来(每个样本只能画出两个变量)
__author__ = 'root'import numpy as np
import matplotlib.pyplot as pltdef plotDataSet(datingDataSet,labels):fig=plt.figure()ax=fig.add_subplot(111)ax.scatter(datingDataSet[:,0],datingDataSet[:,1],15*np.array(labels[:]),15*np.array(labels[:]))plt.show()
autoNorm.py:将参数归一化,每个参数的大小范围不一致
__author__ = 'root'import file2Matrix
import numpy as npdef autoNorm(datingDataSet):#get the minimum and maximum value of each featuredataSetMin=datingDataSet.min(axis=0)dataSetMinTiled=np.tile(dataSetMin,(datingDataSet.shape[0],1))dataSetMax=datingDataSet.max(axis=0)dataSetMaxTiled=np.tile(dataSetMax,(datingDataSet.shape[0],1))# =(value-min)/(max-min)datingDataSet=(datingDataSet-dataSetMinTiled)/(dataSetMaxTiled-dataSetMinTiled)return datingDataSet,dataSetMin,dataSetMax
datingClassTest.py:错误率测试
__author__ = 'root'import numpy as np
import classifydef datingClassTest(datingDataSet,labels):#set:ratio of test,kratio=0.1k=4#num of testDatalenOfDataSet=datingDataSet.shape[0]numOfTest=int(ratio*lenOfDataSet)print numOfTest#variable:num of errornumOfError=0#traverse all test datafor i in range(numOfTest):#prepare input datainX=datingDataSet[i,:]label=labels[i]ans=classify.classify(inX,datingDataSet[numOfTest:lenOfDataSet,:],labels[numOfTest:lenOfDataSet],k)if ans!=label:numOfError+=1.0print 'predict error'return numOfError/numOfTest
classify.py:预测分类函数
__author__ = 'root'import numpy as np
import operatordef classify(inX,dataSet,labels,k):#calculate euclidean distance between k and dataSetdataSetSize=dataSet.shape[0]diffMat=np.tile(inX,(dataSetSize,1))-dataSetsqDiffMat=diffMat**2sqDistances=sqDiffMat.sum(axis=1)distance=sqDistances**0.5#sort distance, min to max, return index listsortedDistIndicies=distance.argsort()# from 0 to k-1, count times of every classclassCount={}for i in range(k):className=labels[sortedDistIndicies[i]]#print classCount.get(className,0)#here parameter 0 means:if className doesn't exist, returnclassCount[className]=classCount.get(className,0)+1#sort class count result, i don't understand this method now#parameter reverse=true:from big to small,reverse=flase:from small to bigsortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)#print sortedClassCount#print sortedClassCount[0][0]# return resultreturn sortedClassCount[0][0]
classifyPerson.py:输入某个人的参数,给出预测结果
__author__ = 'root'import numpy as np
import classifydef classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels):resultList=['not at all','a little like','like very much']k=3#input dataflyMiles=float(raw_input('please input fly miles per year:'))percOfVedioGames=float(raw_input('please input percentage of time you spend playing video games:'))iceCream=float(raw_input('please input how much iceCream you eat every week:'))inX=[flyMiles,percOfVedioGames,iceCream]inX=(inX-dataSetMin)/(dataSetMax-dataSetMin)#predictans=classify.classify(inX,datingDataSet,labels,k)ans=resultList[ans-1]#print resultprint 'you may feel this person:',ans
knn.py:主函数
__author__ = 'root'import file2Matrix
import plotDataSet
import autoNorm
import datingDataSetClassifyTest
import classifyPerson
import numpy as np#get data to ram
datingDataSetOri, labels=file2Matrix.file2Matrix('datingTestSet2.txt')
print 'datingDataSetOri:\n',datingDataSetOri
print 'labels:\n',labels#plot data
plotDataSet.plotDataSet(datingDataSetOri,labels)#autonorm data to [0,1]
datingDataSet,dataSetMin,dataSetMax=autoNorm.autoNorm(datingDataSetOri)
print 'datingDataSet:\n',datingDataSet#test error rate
errorRate=datingDataSetClassifyTest.datingClassTest(datingDataSet,labels)
print 'errorRate:',errorRate#pridect person
classifyPerson.classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels)
总结:
knn的优点:算法简单,易于实现
knn的缺点:1、随着样本集的增加,计算时间线性增长,当特征数量增加时,计算复杂度也线性增加;2、没有训练过程,无法提取出对样本的特征表述
原文链接:点击打开链接
这篇关于k-近邻算法(KNN)--2改进约会网站的配对效果---by香蕉麦乐迪的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!