k近邻分类算法(kNN)

2024-08-24 18:18
文章标签 算法 分类 knn 近邻

本文主要是介绍k近邻分类算法(kNN),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

注明:部分内容来自维基百科

In pattern recognition, the k-Nearest Neighbors algorithm (ork-NN for short) is anon-parametric method used forclassification andregression. In both cases, the input consists of the k closest training examples in thefeature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership.An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among itsk nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object. This value is the average of the values of itsk nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.The k-NN algorithm is among the simplest of allmachine learning algorithms.

Both for classification and regression, it can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.

The training examples are vectors in a multidimensional feature space, each with a class label.The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among thek training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as theoverlap metric (or Hamming distance). Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among thek nearest neighbors due to their large number. One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of itsk nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.

                                                            

如上图所示,最中间的圆点,如果是3NN,则属于红色三角形,如果是5NN,则属于蓝色正方形。这就是kNN最基本的思想。但是,kNN对于每一个待分类的点,都需要和全部数据点进行距离计算,计算量太大。


在下面,我们将通过一段python代码来演示kNN算法。

#coding:utf-8from numpy import *
import operator
import os#创建开发用的小规模数据集
def createDataSet():group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])labels = ['A','A','B','B']return group, labels#kNN分类算法的核心函数
def classify0(inX, dataSet, labels, k):dataSetSize = dataSet.shape[0]#dataSet的总共的行数#计算输入向量和数据集中每一个数据的欧式距离diffMat = tile(inX, (dataSetSize,1)) - dataSetsqDiffMat = diffMat**2sqDistances = sqDiffMat.sum(axis=1)distances = sqDistances**0.5#对距离进行排序,返回的是原来的相对位置sortedDistIndices = distances.argsort()#统计前k个最短的距离中,分类的情况classCount={}for i in range(k):voteIlabel = labels[sortedDistIndices[i]]classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True)return sortedClassCount[0][0]#将文件中的数据转换成矩阵
def file2matrix(filename):fr = open(filename, 'r')arrayOLines = fr.readlines()numberOfLines = len(arrayOLines)returnMat = zeros((numberOfLines,3))classLabelVector = []index = 0for line in arrayOLines:line = line.strip()listFromLine = line.split('\t')returnMat[index,:] = listFromLine[0:3]classLabelVector.append(int(listFromLine[-1]))index += 1return returnMat, classLabelVector#数据的前处理步骤:归一化数值
def autoNorm(dataSet):minVals = dataSet.min(0)maxVals = dataSet.max(0)ranges = maxVals - minValsnormDataSet = zeros(shape(dataSet))m = dataSet.shape[0]normDataSet = dataSet - tile(minVals, (m,1))normDataSet = normDataSet/tile(ranges,(m,1))return normDataSet, ranges, minVals#约会网站的测试代码
def datingClassTest():hoRadio = 0.010datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')normMat,ranges, minVals = autoNorm(datingDataMat)m = normMat.shape[0]numTestVecs = int(m*hoRadio)errorCount = 0.0for i in range(numTestVecs):classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:], datingLabels[numTestVecs:m],3)print "the classifier came back with: %d, the real answer is : %d" % (classifierResult, datingLabels[i])if(classifierResult != datingLabels[i]):errorCount += 1.0print "the total error rate is : %f" % (errorCount/float(numTestVecs))#约会网站预测函数
def classifyPerson():resultList = ['not at all', 'in small doses', 'in large doses']percentTats = float(raw_input("percentage of time spent playing video games?"))ffMiles = float(raw_input("frequent flier miles earned per year?"))iceCream = float(raw_input("liters of ice cream consumed per year?"))datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')normMat, ranges, minVals = autoNorm(datingDataMat)inArr = array([ffMiles, percentTats, iceCream])classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)print "you will probably like this person: " , resultList[classifierResult - 1]#图像转换为向量
def img2vector(filename):returnVect = zeros((1,1024))fr = open(filename)for i in range(32):lineStr = fr.readline()for j in range(32):returnVect[0,32*i+j] = int(lineStr[j])return returnVect#手写数字识别系统的测试代码
def handwritingClassTest():hwLabels = []trainingFileList = os.listdir('trainingDigits')m = len(trainingFileList)trainingMat = zeros((m,1024))for i in range(m):fileNameStr = trainingFileList[i]fileStr = fileNameStr.split('.')[0]classNumStr = int(fileStr.split('_')[0])hwLabels.append(classNumStr)trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)testFileList = os.listdir('testDigits')errorCount = 0.0mTest = len(testFileList)for i in range(mTest):fileNameStr = testFileList[i]fileStr = fileNameStr.split('.')[0]classNumStr = int(fileStr.split('_')[0])vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)print "the classifier came back with: %d , the real answer is: %d" % (classifierResult, classNumStr)if(classifierResult != classNumStr):errorCount += 1.0print "\nthe total number of errors is : %d" % errorCountprint "\nthe total error rate is: %f" % (errorCount/float(mTest))#main函数
if __name__ == "__main__" :#利用小规模数据进行测试kNN分类器    group, labels = createDataSet()a = classify0([0,0], group, labels, 3)#print a    #约会网站的测试代码datingClassTest()#约会网站的预测函数classifyPerson()#手写数字识别系统的测试代码handwritingClassTest()


这篇关于k近邻分类算法(kNN)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1103240

相关文章

不懂推荐算法也能设计推荐系统

本文以商业化应用推荐为例,告诉我们不懂推荐算法的产品,也能从产品侧出发, 设计出一款不错的推荐系统。 相信很多新手产品,看到算法二字,多是懵圈的。 什么排序算法、最短路径等都是相对传统的算法(注:传统是指科班出身的产品都会接触过)。但对于推荐算法,多数产品对着网上搜到的资源,都会无从下手。特别当某些推荐算法 和 “AI”扯上关系后,更是加大了理解的难度。 但,不了解推荐算法,就无法做推荐系

基于人工智能的图像分类系统

目录 引言项目背景环境准备 硬件要求软件安装与配置系统设计 系统架构关键技术代码示例 数据预处理模型训练模型预测应用场景结论 1. 引言 图像分类是计算机视觉中的一个重要任务,目标是自动识别图像中的对象类别。通过卷积神经网络(CNN)等深度学习技术,我们可以构建高效的图像分类系统,广泛应用于自动驾驶、医疗影像诊断、监控分析等领域。本文将介绍如何构建一个基于人工智能的图像分类系统,包括环境

康拓展开(hash算法中会用到)

康拓展开是一个全排列到一个自然数的双射(也就是某个全排列与某个自然数一一对应) 公式: X=a[n]*(n-1)!+a[n-1]*(n-2)!+...+a[i]*(i-1)!+...+a[1]*0! 其中,a[i]为整数,并且0<=a[i]<i,1<=i<=n。(a[i]在不同应用中的含义不同); 典型应用: 计算当前排列在所有由小到大全排列中的顺序,也就是说求当前排列是第

认识、理解、分类——acm之搜索

普通搜索方法有两种:1、广度优先搜索;2、深度优先搜索; 更多搜索方法: 3、双向广度优先搜索; 4、启发式搜索(包括A*算法等); 搜索通常会用到的知识点:状态压缩(位压缩,利用hash思想压缩)。

csu 1446 Problem J Modified LCS (扩展欧几里得算法的简单应用)

这是一道扩展欧几里得算法的简单应用题,这题是在湖南多校训练赛中队友ac的一道题,在比赛之后请教了队友,然后自己把它a掉 这也是自己独自做扩展欧几里得算法的题目 题意:把题意转变下就变成了:求d1*x - d2*y = f2 - f1的解,很明显用exgcd来解 下面介绍一下exgcd的一些知识点:求ax + by = c的解 一、首先求ax + by = gcd(a,b)的解 这个

综合安防管理平台LntonAIServer视频监控汇聚抖动检测算法优势

LntonAIServer视频质量诊断功能中的抖动检测是一个专门针对视频稳定性进行分析的功能。抖动通常是指视频帧之间的不必要运动,这种运动可能是由于摄像机的移动、传输中的错误或编解码问题导致的。抖动检测对于确保视频内容的平滑性和观看体验至关重要。 优势 1. 提高图像质量 - 清晰度提升:减少抖动,提高图像的清晰度和细节表现力,使得监控画面更加真实可信。 - 细节增强:在低光条件下,抖

【数据结构】——原来排序算法搞懂这些就行,轻松拿捏

前言:快速排序的实现最重要的是找基准值,下面让我们来了解如何实现找基准值 基准值的注释:在快排的过程中,每一次我们要取一个元素作为枢纽值,以这个数字来将序列划分为两部分。 在此我们采用三数取中法,也就是取左端、中间、右端三个数,然后进行排序,将中间数作为枢纽值。 快速排序实现主框架: //快速排序 void QuickSort(int* arr, int left, int rig

poj 3974 and hdu 3068 最长回文串的O(n)解法(Manacher算法)

求一段字符串中的最长回文串。 因为数据量比较大,用原来的O(n^2)会爆。 小白上的O(n^2)解法代码:TLE啦~ #include<stdio.h>#include<string.h>const int Maxn = 1000000;char s[Maxn];int main(){char e[] = {"END"};while(scanf("%s", s) != EO

秋招最新大模型算法面试,熬夜都要肝完它

💥大家在面试大模型LLM这个板块的时候,不知道面试完会不会复盘、总结,做笔记的习惯,这份大模型算法岗面试八股笔记也帮助不少人拿到过offer ✨对于面试大模型算法工程师会有一定的帮助,都附有完整答案,熬夜也要看完,祝大家一臂之力 这份《大模型算法工程师面试题》已经上传CSDN,还有完整版的大模型 AI 学习资料,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费

dp算法练习题【8】

不同二叉搜索树 96. 不同的二叉搜索树 给你一个整数 n ,求恰由 n 个节点组成且节点值从 1 到 n 互不相同的 二叉搜索树 有多少种?返回满足题意的二叉搜索树的种数。 示例 1: 输入:n = 3输出:5 示例 2: 输入:n = 1输出:1 class Solution {public int numTrees(int n) {int[] dp = new int