k-近邻算法（KNN）--2改进约会网站的配对效果---by香蕉麦乐迪

本文主要是介绍k-近邻算法（KNN）--2改进约会网站的配对效果---by香蕉麦乐迪，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

参考书籍：《机器学习实战》

实验说明：预测约会对象对用户是否具有吸引力

输入数据：每个待约会的对象有三个属性，分别是每年飞行里程数、玩游戏占时间比、每周吃的冰淇淋（单位公升）；（ps：我觉得这三个参数，分别代表一个人是否有钱，生活娱乐，饮食习惯）

样本集：有1000个约会对象的数据，并且每个对象有一个标签，标签有三大类，分别是不喜欢、魅力一般、非常有魅力

实验过程：

1、将样本集90%当做训练集，10%当做测试集，测试classify.py的错误率

2、用户输入一个约会对象的参数，给出分类的标签，为用户提供建议

代码文件：

file2Matrix.py：样本集存在txt文件中，该函数将样本集输入到内存中，以array的方式存储起来

plotDataSet：将样本集的数据画出来（每个样本只能画出两个变量）

autoNorm.py：将参数归一化，每个参数的大小范围不一致

datingClassTest.py：错误率测试

classify.py：预测分类函数

classifyPerson.py：输入某个人的参数，给出预测结果

knn.py：主函数

样本集及源文件下载：点击打开链接

源文件：

file2Matrix.py：样本集存在txt文件中，该函数将样本集输入到内存中，以array的方式存储起来

__author__ = 'root'import numpy as npdef file2Matrix(filename):#open filefileHandle=open(filename,mode='r')#read lines, here lines is a listlines=fileHandle.readlines()#for saving datai=0datingDataSet=np.zeros((len(lines),3))labels=[]#traverse all lines,save to matrixfor line in lines:line=line.strip()listFromLine=line.split('\t')datingDataSet[i,:]=listFromLine[0:3]labels.append(int(listFromLine[-1]))i+=1#return dataSet and labelsreturn datingDataSet, labels

plotDataSet：将样本集的数据画出来（每个样本只能画出两个变量）

__author__ = 'root'import  numpy as np
import matplotlib.pyplot as pltdef plotDataSet(datingDataSet,labels):fig=plt.figure()ax=fig.add_subplot(111)ax.scatter(datingDataSet[:,0],datingDataSet[:,1],15*np.array(labels[:]),15*np.array(labels[:]))plt.show()

autoNorm.py：将参数归一化，每个参数的大小范围不一致

__author__ = 'root'import file2Matrix
import numpy as npdef autoNorm(datingDataSet):#get the minimum and maximum value of each featuredataSetMin=datingDataSet.min(axis=0)dataSetMinTiled=np.tile(dataSetMin,(datingDataSet.shape[0],1))dataSetMax=datingDataSet.max(axis=0)dataSetMaxTiled=np.tile(dataSetMax,(datingDataSet.shape[0],1))# =(value-min)/(max-min)datingDataSet=(datingDataSet-dataSetMinTiled)/(dataSetMaxTiled-dataSetMinTiled)return datingDataSet,dataSetMin,dataSetMax

datingClassTest.py：错误率测试

__author__ = 'root'import numpy as np
import classifydef datingClassTest(datingDataSet,labels):#set:ratio of test,kratio=0.1k=4#num of testDatalenOfDataSet=datingDataSet.shape[0]numOfTest=int(ratio*lenOfDataSet)print numOfTest#variable:num of errornumOfError=0#traverse all test datafor i in range(numOfTest):#prepare input datainX=datingDataSet[i,:]label=labels[i]ans=classify.classify(inX,datingDataSet[numOfTest:lenOfDataSet,:],labels[numOfTest:lenOfDataSet],k)if ans!=label:numOfError+=1.0print 'predict error'return numOfError/numOfTest

classify.py：预测分类函数

__author__ = 'root'import numpy as np
import operatordef classify(inX,dataSet,labels,k):#calculate euclidean distance between k and dataSetdataSetSize=dataSet.shape[0]diffMat=np.tile(inX,(dataSetSize,1))-dataSetsqDiffMat=diffMat**2sqDistances=sqDiffMat.sum(axis=1)distance=sqDistances**0.5#sort distance, min to max, return index listsortedDistIndicies=distance.argsort()# from 0 to k-1, count times of every classclassCount={}for i in range(k):className=labels[sortedDistIndicies[i]]#print classCount.get(className,0)#here parameter 0 means:if className doesn't exist, returnclassCount[className]=classCount.get(className,0)+1#sort class count result, i don't understand this method now#parameter reverse=true:from big to small,reverse=flase:from small to bigsortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)#print sortedClassCount#print sortedClassCount[0][0]# return resultreturn sortedClassCount[0][0]

classifyPerson.py：输入某个人的参数，给出预测结果

__author__ = 'root'import numpy as np
import classifydef classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels):resultList=['not at all','a little like','like very much']k=3#input dataflyMiles=float(raw_input('please input fly miles per year:'))percOfVedioGames=float(raw_input('please input percentage of time you spend playing video games:'))iceCream=float(raw_input('please input how much iceCream you eat every week:'))inX=[flyMiles,percOfVedioGames,iceCream]inX=(inX-dataSetMin)/(dataSetMax-dataSetMin)#predictans=classify.classify(inX,datingDataSet,labels,k)ans=resultList[ans-1]#print resultprint 'you may feel this person:',ans

knn.py：主函数

__author__ = 'root'import file2Matrix
import plotDataSet
import autoNorm
import datingDataSetClassifyTest
import classifyPerson
import numpy as np#get data to ram
datingDataSetOri, labels=file2Matrix.file2Matrix('datingTestSet2.txt')
print 'datingDataSetOri:\n',datingDataSetOri
print 'labels:\n',labels#plot data
plotDataSet.plotDataSet(datingDataSetOri,labels)#autonorm data to [0,1]
datingDataSet,dataSetMin,dataSetMax=autoNorm.autoNorm(datingDataSetOri)
print 'datingDataSet:\n',datingDataSet#test error rate
errorRate=datingDataSetClassifyTest.datingClassTest(datingDataSet,labels)
print 'errorRate:',errorRate#pridect person
classifyPerson.classifyPerson(datingDataSet,dataSetMin,dataSetMax,labels)

总结：

knn的优点：算法简单，易于实现

knn的缺点：1、随着样本集的增加，计算时间线性增长，当特征数量增加时，计算复杂度也线性增加；2、没有训练过程，无法提取出对样本的特征表述

原文链接：点击打开链接

这篇关于k-近邻算法（KNN）--2改进约会网站的配对效果---by香蕉麦乐迪的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！