FuzzyKmeans的Mahout实现

2024-06-18 18:08
文章标签 实现 mahout fuzzykmeans

本文主要是介绍FuzzyKmeans的Mahout实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

不得不说,google更靠谱,比google更更靠谱的是官网!!!

so要好好利用google and official website!!!

https://mahout.apache.org/users/clustering/fuzzy-k-means.html

Fuzzy K-Means

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.

Algorithm

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.

  • Initialize k clusters
  • Until converged
    • Compute the probability of a point belong to a cluster for every pair
    • Recompute the cluster centers using above probability membership values of points to clusters

Design Implementation

The design is similar to K-Means present in Mahout. It accepts an input file containing vector points. User can either provide the cluster centers as input or can allow canopy algorithm to run and create initial clusters.

Similar to K-Means, the program doesn't modify the input directories. And for every iteration, the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks equal to number of map tasks. So, those many part-0

Files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as follows:

FuzzyKMeansDriver - This is similar to  KMeansDriver. It iterates over input points and cluster points for specified number of iterations or until it is converged.During every iteration i, a new cluster-i directory is created which contains the modified cluster centers obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.  Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map task is run to output "the point and the cluster membership to each cluster" pair as final output to a directory named "points".

FuzzyKMeansMapper - reads the input cluster during its configure() method, then  computes cluster membership probability of a point to each cluster.Cluster membership is inversely propotional to the distance. Distance is computed using  user supplied distance measure. Output key is encoded clusterId. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums of the cluster membership probability times input vectors for each cluster. Output key is: encoded cluster identifier. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with those keys. The reducer sums the values to produce a new centroid for the cluster which is output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g. "C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.

Running Fuzzy k-Means Clustering

The Fuzzy k-Means clustering algorithm may be run using a command-line invocation on FuzzyKMeansDriver.main or by making a Java call to FuzzyKMeansDriver.run().

Invocation using the command line takes the form:

bin/mahout fkmeans \-i <input vectors directory> \-c <input clusters directory> \-o <output working directory> \-dm <DistanceMeasure> \-m <fuzziness argument >1> \-x <maximum number of iterations> \-k <optional number of initial clusters to sample from input vectors> \-cd <optional convergence delta. Default is 0.5> \-ow <overwrite output directory if present>-cl <run input vector clustering after computing Clusters>-e <emit vectors to most likely cluster during clustering>-t <threshold to use for clustering if -e is false>-xm <execution method: sequential or mapreduce>

Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

Invocation using Java involves supplying the following arguments:

  1. input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable, VectorWritable). The sequence file key is not used.
  2. clustersIn: a file path string to a directory containing the initial clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for the initial clusters.
  3. output: a file path string to an empty directory which is used for all output from the algorithm.
  4. measure: the fully-qualified class name of an instance of DistanceMeasure which will be used for the clustering.
  5. convergence: a double value used to determine if the algorithm has converged (clusters have not moved more than the value in the last iteration)
  6. max-iterations: the maximum number of iterations to run, independent of the convergence specified
  7. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
  8. runClustering: a boolean indicating, if true, that the clustering step is to be executed after clusters have been determined.
  9. emitMostLikely: a boolean indicating, if true, that the clustering step should only emit the most likely cluster for each clustered point.
  10. threshold: a double indicating, if emitMostLikely is false, the cluster probability threshold used for emitting multiple clusters for each point. A value of 0 will emit all clusters with their associated probabilities for each vector.
  11. runSequential: a boolean indicating, if true, that the algorithm is to use the sequential reference implementation running in memory.

After running the algorithm, the output directory will contain: 1. clusters-N: directories containing SequenceFiles(Text, SoftCluster) produced by the algorithm for each iteration. The Text key is a cluster identifier string. 1. clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the clusterId. The WeightedVectorWritable value is a bean containing a double weight and a VectorWritable vector where the weights are computed as 1/(1+distance) where the distance is between the cluster center and the vector using the chosen DistanceMeasure.

Examples

The following images illustrate Fuzzy k-Means clustering applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the /examples/src/main/java/org/apache/mahout/clustering/display/README.txt for details on running similar examples.

The points are generated as follows:

  • 500 samples m=[1.0, 1.0](1.0,-1.0.html) sd=3.0
  • 300 samples m=[1.0, 0.0](1.0,-0.0.html) sd=0.5
  • 300 samples m=[0.0, 2.0](0.0,-2.0.html) sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed.

fuzzy

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html) . Although it misses a lot of the points and cannot capture the original, superimposed cluster centers, it does a decent job of clustering this data.

fuzzy

The third image shows the results of running Fuzzy k-Means on a different data set which is generated using asymmetrical standard deviations. Fuzzy k-Means does a fair job handling this data set as well.

fuzzy


这篇关于FuzzyKmeans的Mahout实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1072722

相关文章

C#实现将Excel表格转换为图片(JPG/ PNG)

《C#实现将Excel表格转换为图片(JPG/PNG)》Excel表格可能会因为不同设备或字体缺失等问题,导致格式错乱或数据显示异常,转换为图片后,能确保数据的排版等保持一致,下面我们看看如何使用C... 目录通过C# 转换Excel工作表到图片通过C# 转换指定单元格区域到图片知识扩展C# 将 Excel

基于Java实现回调监听工具类

《基于Java实现回调监听工具类》这篇文章主要为大家详细介绍了如何基于Java实现一个回调监听工具类,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录监听接口类 Listenable实际用法打印结果首先,会用到 函数式接口 Consumer, 通过这个可以解耦回调方法,下面先写一个

使用Java将DOCX文档解析为Markdown文档的代码实现

《使用Java将DOCX文档解析为Markdown文档的代码实现》在现代文档处理中,Markdown(MD)因其简洁的语法和良好的可读性,逐渐成为开发者、技术写作者和内容创作者的首选格式,然而,许多文... 目录引言1. 工具和库介绍2. 安装依赖库3. 使用Apache POI解析DOCX文档4. 将解析

Qt中QGroupBox控件的实现

《Qt中QGroupBox控件的实现》QGroupBox是Qt框架中一个非常有用的控件,它主要用于组织和管理一组相关的控件,本文主要介绍了Qt中QGroupBox控件的实现,具有一定的参考价值,感兴趣... 目录引言一、基本属性二、常用方法2.1 构造函数 2.2 设置标题2.3 设置复选框模式2.4 是否

C++使用printf语句实现进制转换的示例代码

《C++使用printf语句实现进制转换的示例代码》在C语言中,printf函数可以直接实现部分进制转换功能,通过格式说明符(formatspecifier)快速输出不同进制的数值,下面给大家分享C+... 目录一、printf 原生支持的进制转换1. 十进制、八进制、十六进制转换2. 显示进制前缀3. 指

springboot整合阿里云百炼DeepSeek实现sse流式打印的操作方法

《springboot整合阿里云百炼DeepSeek实现sse流式打印的操作方法》:本文主要介绍springboot整合阿里云百炼DeepSeek实现sse流式打印,本文给大家介绍的非常详细,对大... 目录1.开通阿里云百炼,获取到key2.新建SpringBoot项目3.工具类4.启动类5.测试类6.测

pytorch自动求梯度autograd的实现

《pytorch自动求梯度autograd的实现》autograd是一个自动微分引擎,它可以自动计算张量的梯度,本文主要介绍了pytorch自动求梯度autograd的实现,具有一定的参考价值,感兴趣... autograd是pytorch构建神经网络的核心。在 PyTorch 中,结合以下代码例子,当你

SpringBoot集成Milvus实现数据增删改查功能

《SpringBoot集成Milvus实现数据增删改查功能》milvus支持的语言比较多,支持python,Java,Go,node等开发语言,本文主要介绍如何使用Java语言,采用springboo... 目录1、Milvus基本概念2、添加maven依赖3、配置yml文件4、创建MilvusClient

JS+HTML实现在线图片水印添加工具

《JS+HTML实现在线图片水印添加工具》在社交媒体和内容创作日益频繁的今天,如何保护原创内容、展示品牌身份成了一个不得不面对的问题,本文将实现一个完全基于HTML+CSS构建的现代化图片水印在线工具... 目录概述功能亮点使用方法技术解析延伸思考运行效果项目源码下载总结概述在社交媒体和内容创作日益频繁的

openCV中KNN算法的实现

《openCV中KNN算法的实现》KNN算法是一种简单且常用的分类算法,本文主要介绍了openCV中KNN算法的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的... 目录KNN算法流程使用OpenCV实现KNNOpenCV 是一个开源的跨平台计算机视觉库,它提供了各