【整理】Advanced clustering methods (Cure, Chameleon, Rock, Jarvis-Petrich)

本文主要是介绍【整理】Advanced clustering methods (Cure, Chameleon, Rock, Jarvis-Petrich),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

之前的有一点乱,重新整理一下 4/16/2014

上传了PPT不过正在审核,本来想说图片见PPT的,嗯,想看还是搜索书和paper吧...

首先是关于Hierarchical Clustering:

A hierarchical method creates a hierarchical decomposition of the given set of data objects.  Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone.

CURE

关于CURE其实很容易查到相关的内容,Datamining的书(《数据挖掘导论》)里面也有详细的介绍,算是比较好找到的资料。可以参考这个http://wiki.madio.net/index.php?doc-view-996 以及搜这篇paper:Studipto Guha, Rajeev Rastogi, Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”

CURE是一种聚类算法,它使用各种不同的技术创造一种方法,该方法能 够处理大型数据,离群点和具有非球形和非均匀大小的簇的数据。

Shrinking representative points toward the center helps avoid problems with noise and outliers

Input: D=[p1,p2,...,pn] and the number of clusters k
Output: the set of clusters
1. Draw a random sample from the data set. 
The CURE paper is notable for explicitly deriving a formula for what the size of this sample should be in order to gurantee. with high probability, that all clusters are represented by a minimum number of points.
2. Partition the sample into p equal-sized partitions.
3. Cluser the points in each partition into m/pq clusers using CURE's hierarchical clustering algorithm to obtain a total of m/q clusters. 
4. Use CURE's hierachical clustering algorithm to cluster the m/q clusters found in the previous step until only k clusters remain.
5. Eliminate outliers. This is the second phase of outlier elimination
6. Assign all remaining data points to the nearesr cluster to obtain a complete clustering.



From the experiment before you can point out CURE is better able to handle clusters of arbitrary shapes and sizes. 

But CURE Cannot Handle Differing Densities

               

CURE 不处理分类属性。
而ROCK 是一个可选的凝聚的层次聚类算法,适用于分类属性。它通过将集合的互连性与用户定义的互连性模型相比较来度量两个簇的相似度。两个簇 C1和 C2 的互连性被定义为两个簇间交叉邻接(cross link)的数目,link(pi,pj)是两个点 pI和pj共同的邻居的数目。
    换句话说,簇间相似度是基于来自不同簇而有相同邻居的点的数目。 
ROCK 首先根据相似度阈值和共同邻居的概念从给定的数据相似度矩阵构建一个稀疏的图,然后在这个稀疏图上运行一个层次聚类算法。 

ROCK (RObust Clustering using linKs)

Clustering algorithm for data with categorical and Boolean attributes
A pair of points is defined to be neighbors if their similarity is greater than some threshold
Use a hierarchical clustering scheme to cluster the data. 


Obtain a sample of points from the data set
Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points
Perform an agglomerative hierarchical clustering on the data using the “number of shared neighbors” as similarity measure and maximizing “the shared neighbors” objective function
Assign the remaining points to the clusters that have been found

ROCK的作者的paper是这一篇:Studipto Guha, Rajeev Rastogi, Kyuseok Shim,”ROCK: A Robust Clustering Algorithm for Categorical Attributes”
R语言中可以使用ROCK聚类,这里有一个很好的例子: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/RockCluster

Chameleon和JP算法都是Graph-Based Clustering。
Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the proximity between the two points
Initially the proximity graph is fully connected 
MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph


In the simplest case, clusters are connected components in the graph.
Chameleon: Clustering Using Dynamic Modeling

资料来源于“http://wiki.madio.net/index.php?doc-view-997” 以及George Karypis, Eui-Hong(Sam) Han, Vipin Kumar,“Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling”
Adapt to the characteristics of the data set to find the natural clusters。
chameleon 是一个在层次聚类中采用动态模型的聚类算法。在它的聚类过程中,如果两个簇间的互连性和近似度与簇内部对象间的互连性和近似度高度相关,则合并这两个簇。基于动态模型的合并过程有利于自然的和相似的聚类的发现,而且只要定义了相似度函数就可应用于所有类型的数据。 
Use a dynamic model to measure the similarity between clusters
  • Main property is the relative closeness and relative inter-connectivity of the cluster
  • Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
  • The merging scheme preserves self-similarity
其中self-similarity很重要。书中和上面的paper中都有详细解释。
Chameleon详细算法如下:
  1. Build a k-nearest neighbor graph
  2. Partition the graph using a multilebel graph partitioning algorithm
  3. Repeat
  4. Merge the clusters that best preserve the cluster self-similarity with respect to relative interconnectibity and relative closeness
  5. Until No more cluster can be merged
    Chameleon 的产生是基于对两个层次聚类算法 CURE 和 ROCK的缺点的观察。CURE 及其相关的方案忽略了关于两个不同簇中的对象的互连性的信息,而 ROCK及其相关的方案强调对象间互连性,却忽略了关于对象间近似度的信息。 
“chameleon 怎样工作的呢?”Chameleon 首先通过一个图划分算法将数据对象聚类为大量相对较小的子类,然后用一个凝聚的层次聚类算法通过反复地合并子类来找到真正的结果簇。它既考虑了互连性,又考虑了簇间的近似度,特别是簇内部的特征,来确定最相似的子类。这样它不依赖于一个静态的,用户提供的模型,能够自动地适应被合并的簇的内部特征。 



Jarvis-Patrick Clustering  

要搞清楚这个算法,首先要理解SNN(Shared Nearest Neighbor)
In some case, clustering techniques that rely on standard approaches to similarity and density do not produce the desired clustering result. Introduces an indirect approach to similarity that is based on the following principle:
if two points are similar to many of the same points, then they are simiar to one another, even if a direct measurement of similarity does not indicate this.
Algorithm as below:
Find the K-nearest neighbors of all points.
if two points, x and y are not among the k-nearest neighors of each other then 
similarity(x,y) <- 0
else
similarity(s,y) <- number of shared neighbors
end if
"http://btluke.com/jpclust.html"
First, the k-nearest neighbors of all points are found  
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
 
A pair of points is put in the same cluster if 
any two points share more than T neighbors and 
the two points are in each others k nearest neighbor list

For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
Jarvis-Patrick clustering
First, the k-nearest neighbors of all points are found  
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
 
A pair of points is put in the same cluster if 
any two points share more than  threshold neighbors and 
the two points are in each others k nearest neighbor list

For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
Advantages:
(1)It is good at dealing with noise and outliers and can handle cluters of different sizes,shapes, and densities.
(2)Work well for high-dimensional data and is particularly good at finding tight cluster of strongly related objects. 
But:
brittle.It may split true clusters or join cluster that should be kept separate.
not all objects are cluster
It's hard to choosing the best values for the parameters
基本时间复杂度
Storage requirements of the JP clustering algorithm are only O(km)
The basic time complexity of JP clustering is O(m2)
And for low-dimensional Euclidean data, it can use more efficiently find the k-nearest neighbors. This can reduce the time complexity from O(m2) to O(mlogm)


这篇关于【整理】Advanced clustering methods (Cure, Chameleon, Rock, Jarvis-Petrich)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/200863

相关文章

数论入门整理(updating)

一、gcd lcm 基础中的基础,一般用来处理计算第一步什么的,分数化简之类。 LL gcd(LL a, LL b) { return b ? gcd(b, a % b) : a; } <pre name="code" class="cpp">LL lcm(LL a, LL b){LL c = gcd(a, b);return a / c * b;} 例题:

Spark MLlib模型训练—聚类算法 PIC(Power Iteration Clustering)

Spark MLlib模型训练—聚类算法 PIC(Power Iteration Clustering) Power Iteration Clustering (PIC) 是一种基于图的聚类算法,用于在大规模数据集上进行高效的社区检测。PIC 算法的核心思想是通过迭代图的幂运算来发现数据中的潜在簇。该算法适用于处理大规模图数据,特别是在社交网络分析、推荐系统和生物信息学等领域具有广泛应用。Spa

rtmp流媒体编程相关整理2013(crtmpserver,rtmpdump,x264,faac)

转自:http://blog.163.com/zhujiatc@126/blog/static/1834638201392335213119/ 相关资料在线版(不定时更新,其实也不会很多,也许一两个月也不会改) http://www.zhujiatc.esy.es/crtmpserver/index.htm 去年在这进行rtmp相关整理,其实内容早有了,只是整理一下看着方

笔记整理—内核!启动!—kernel部分(2)从汇编阶段到start_kernel

kernel起始与ENTRY(stext),和uboot一样,都是从汇编阶段开始的,因为对于kernel而言,还没进行栈的维护,所以无法使用c语言。_HEAD定义了后面代码属于段名为.head .text的段。         内核起始部分代码被解压代码调用,前面关于uboot的文章中有提到过(eg:zImage)。uboot启动是无条件的,只要代码的位置对,上电就工作,kern

JavaScript整理笔记

JavaScript笔记 JavaScriptJavaScript简介快速入门JavaScript用法基础语法注释关键字显示数据输出innerHTML innerText属性返回值的区别调试 数据类型和变量数据类型数字(Number)字符串(String)布尔值(Boolean)null(空值)和undefined(未定义)数组(Array)对象(Object)函数(Function) 变量

关于回调函数和钩子函数基础知识的整理

回调函数:Callback Function 什么是回调函数? 首先做一个形象的比喻:   你有一个任务,但是有一部分你不会做,或者说不愿做,所以我来帮你做这部分,你做你其它的任务工作或者等着我的消息,但是当我完成的时候我要通知你我做好了,你可以用了,我怎么通知你呢?你给我一部手机,让我做完后给你打电话,我就打给你了,你拿到我的成果加到你的工作中,继续完成其它的工作.这就叫回叫,手机

站长常用Shell脚本整理分享(全)

站长常用Shell脚本整理分享 站长常用Shell脚本整理分享1-10 站长常用Shell脚本整理分享11-20 站长常用Shell脚本整理分享21-30 站长常用Shell脚本整理分享31-40 站长常用Shell脚本整理分享41-50 站长常用Shell脚本整理分享51-59 长期更新

我自己常用的eclipse 快捷键整理

---------------- 我自己改的快捷键: 复制当前行单下一行  ctrl alt n   --------------------- 自带快捷键: 快速定位到一行  CTRL+L 向上(下)移动选中的行:ALT+UP/DOWN ARROW 删除行(Delete Line):CTRL+D CTRL + 1也很有用     ----------

C/C++ 网络聊天室在线聊天系统(整理重传)

知识点: TCP网络通信 服务端的流程: 1.创建socket套接字 2.给这个socket绑定一个端口号 3.给这个socket开启监听属性 4.等待客户端连接 5.开始通讯 6.关闭连接 解释: socket:类似于接口的东西,只有通过这个才能跟对应的电脑通信。 每一台电脑都有一个IP地址,一台电脑上有多个应用,每个应用都会有一个端口号。 socket一般分为两种类型,一种是通讯,一种是监听

20190315 把整理和培养自己当作一生的事业,而不是局限在找工作拿offer。

把整理和培养自己当作一生的事业,而不是局限在找工作拿offer,做有本事的人。 来东南读研半年了,明显感觉自己掌握的不过是书本知识级别的中上水平,垃圾收集器这些的只知道背面经,靠脑子硬记,缺乏整理和系统,一头浆糊。 现在一边做实训这个烂项目,一边刷面经,一边刷剑指offer,想投些大公司的实习,又觉得还没准备好,看着各 种面经,都能说个大概,但明显感觉到自己知识的不体系和不深入,**做的项目