FuzzyKmeans的Mahout实现

2024-06-18 18:08
文章标签 实现 mahout fuzzykmeans

本文主要是介绍FuzzyKmeans的Mahout实现,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

不得不说,google更靠谱,比google更更靠谱的是官网!!!

so要好好利用google and official website!!!

https://mahout.apache.org/users/clustering/fuzzy-k-means.html

Fuzzy K-Means

Fuzzy K-Means (also called Fuzzy C-Means) is an extension of K-Means , the popular simple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster), Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.

Algorithm

Like K-Means, Fuzzy K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is similar to k-means.

  • Initialize k clusters
  • Until converged
    • Compute the probability of a point belong to a cluster for every pair
    • Recompute the cluster centers using above probability membership values of points to clusters

Design Implementation

The design is similar to K-Means present in Mahout. It accepts an input file containing vector points. User can either provide the cluster centers as input or can allow canopy algorithm to run and create initial clusters.

Similar to K-Means, the program doesn't modify the input directories. And for every iteration, the cluster output is stored in a directory cluster-N. The code has set number of reduce tasks equal to number of map tasks. So, those many part-0

Files are created in clusterN directory. The code uses driver/mapper/combiner/reducer as follows:

FuzzyKMeansDriver - This is similar to  KMeansDriver. It iterates over input points and cluster points for specified number of iterations or until it is converged.During every iteration i, a new cluster-i directory is created which contains the modified cluster centers obtained during FuzzyKMeans iteration. This will be feeded as input clusters in the next iteration.  Once Fuzzy KMeans is run for specified number of iterations or until it is converged, a map task is run to output "the point and the cluster membership to each cluster" pair as final output to a directory named "points".

FuzzyKMeansMapper - reads the input cluster during its configure() method, then  computes cluster membership probability of a point to each cluster.Cluster membership is inversely propotional to the distance. Distance is computed using  user supplied distance measure. Output key is encoded clusterId. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansCombiner - receives all key:value pairs from the mapper and produces partial sums of the cluster membership probability times input vectors for each cluster. Output key is: encoded cluster identifier. Output values are ClusterObservations containing observation statistics.

FuzzyKMeansReducer - Multiple reducers receives certain keys and all values associated with those keys. The reducer sums the values to produce a new centroid for the cluster which is output. Output key is: encoded cluster identifier (e.g. "C14". Output value is: formatted cluster identifier (e.g. "C14"). The reducer encodes unconverged clusters with a 'Cn' cluster Id and converged clusters with 'Vn' clusterId.

Running Fuzzy k-Means Clustering

The Fuzzy k-Means clustering algorithm may be run using a command-line invocation on FuzzyKMeansDriver.main or by making a Java call to FuzzyKMeansDriver.run().

Invocation using the command line takes the form:

bin/mahout fkmeans \-i <input vectors directory> \-c <input clusters directory> \-o <output working directory> \-dm <DistanceMeasure> \-m <fuzziness argument >1> \-x <maximum number of iterations> \-k <optional number of initial clusters to sample from input vectors> \-cd <optional convergence delta. Default is 0.5> \-ow <overwrite output directory if present>-cl <run input vector clustering after computing Clusters>-e <emit vectors to most likely cluster during clustering>-t <threshold to use for clustering if -e is false>-xm <execution method: sequential or mapreduce>

Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

Invocation using Java involves supplying the following arguments:

  1. input: a file path string to a directory containing the input data set a SequenceFile(WritableComparable, VectorWritable). The sequence file key is not used.
  2. clustersIn: a file path string to a directory containing the initial clusters, a SequenceFile(key, SoftCluster | Cluster | Canopy). Fuzzy k-Means SoftClusters, k-Means Clusters and Canopy Canopies may be used for the initial clusters.
  3. output: a file path string to an empty directory which is used for all output from the algorithm.
  4. measure: the fully-qualified class name of an instance of DistanceMeasure which will be used for the clustering.
  5. convergence: a double value used to determine if the algorithm has converged (clusters have not moved more than the value in the last iteration)
  6. max-iterations: the maximum number of iterations to run, independent of the convergence specified
  7. m: the "fuzzyness" argument, a double > 1. For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then the cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
  8. runClustering: a boolean indicating, if true, that the clustering step is to be executed after clusters have been determined.
  9. emitMostLikely: a boolean indicating, if true, that the clustering step should only emit the most likely cluster for each clustered point.
  10. threshold: a double indicating, if emitMostLikely is false, the cluster probability threshold used for emitting multiple clusters for each point. A value of 0 will emit all clusters with their associated probabilities for each vector.
  11. runSequential: a boolean indicating, if true, that the algorithm is to use the sequential reference implementation running in memory.

After running the algorithm, the output directory will contain: 1. clusters-N: directories containing SequenceFiles(Text, SoftCluster) produced by the algorithm for each iteration. The Text key is a cluster identifier string. 1. clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the clusterId. The WeightedVectorWritable value is a bean containing a double weight and a VectorWritable vector where the weights are computed as 1/(1+distance) where the distance is between the cluster center and the vector using the chosen DistanceMeasure.

Examples

The following images illustrate Fuzzy k-Means clustering applied to a set of randomly-generated 2-d data points. The points are generated using a normal distribution centered at a mean location and with a constant standard deviation. See the README file in the /examples/src/main/java/org/apache/mahout/clustering/display/README.txt for details on running similar examples.

The points are generated as follows:

  • 500 samples m=[1.0, 1.0](1.0,-1.0.html) sd=3.0
  • 300 samples m=[1.0, 0.0](1.0,-0.0.html) sd=0.5
  • 300 samples m=[0.0, 2.0](0.0,-2.0.html) sd=0.1

In the first image, the points are plotted and the 3-sigma boundaries of their generator are superimposed.

fuzzy

In the second image, the resulting clusters (k=3) are shown superimposed upon the sample data. As Fuzzy k-Means is an iterative algorithm, the centers of the clusters in each recent iteration are shown using different colors. Bold red is the final clustering and previous iterations are shown in [orange, yellow, green, blue, violet and gray](orange,-yellow,-green,-blue,-violet-and-gray.html) . Although it misses a lot of the points and cannot capture the original, superimposed cluster centers, it does a decent job of clustering this data.

fuzzy

The third image shows the results of running Fuzzy k-Means on a different data set which is generated using asymmetrical standard deviations. Fuzzy k-Means does a fair job handling this data set as well.

fuzzy


这篇关于FuzzyKmeans的Mahout实现的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1072722

相关文章

基于Python实现一个图片拆分工具

《基于Python实现一个图片拆分工具》这篇文章主要为大家详细介绍了如何基于Python实现一个图片拆分工具,可以根据需要的行数和列数进行拆分,感兴趣的小伙伴可以跟随小编一起学习一下... 简单介绍先自己选择输入的图片,默认是输出到项目文件夹中,可以自己选择其他的文件夹,选择需要拆分的行数和列数,可以通过

Python中将嵌套列表扁平化的多种实现方法

《Python中将嵌套列表扁平化的多种实现方法》在Python编程中,我们常常会遇到需要将嵌套列表(即列表中包含列表)转换为一个一维的扁平列表的需求,本文将给大家介绍了多种实现这一目标的方法,需要的朋... 目录python中将嵌套列表扁平化的方法技术背景实现步骤1. 使用嵌套列表推导式2. 使用itert

Python使用pip工具实现包自动更新的多种方法

《Python使用pip工具实现包自动更新的多种方法》本文深入探讨了使用Python的pip工具实现包自动更新的各种方法和技术,我们将从基础概念开始,逐步介绍手动更新方法、自动化脚本编写、结合CI/C... 目录1. 背景介绍1.1 目的和范围1.2 预期读者1.3 文档结构概述1.4 术语表1.4.1 核

在Linux中改变echo输出颜色的实现方法

《在Linux中改变echo输出颜色的实现方法》在Linux系统的命令行环境下,为了使输出信息更加清晰、突出,便于用户快速识别和区分不同类型的信息,常常需要改变echo命令的输出颜色,所以本文给大家介... 目python录在linux中改变echo输出颜色的方法技术背景实现步骤使用ANSI转义码使用tpu

Python使用python-can实现合并BLF文件

《Python使用python-can实现合并BLF文件》python-can库是Python生态中专注于CAN总线通信与数据处理的强大工具,本文将使用python-can为BLF文件合并提供高效灵活... 目录一、python-can 库:CAN 数据处理的利器二、BLF 文件合并核心代码解析1. 基础合

Python使用OpenCV实现获取视频时长的小工具

《Python使用OpenCV实现获取视频时长的小工具》在处理视频数据时,获取视频的时长是一项常见且基础的需求,本文将详细介绍如何使用Python和OpenCV获取视频时长,并对每一行代码进行深入解析... 目录一、代码实现二、代码解析1. 导入 OpenCV 库2. 定义获取视频时长的函数3. 打开视频文

golang版本升级如何实现

《golang版本升级如何实现》:本文主要介绍golang版本升级如何实现问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录golanwww.chinasem.cng版本升级linux上golang版本升级删除golang旧版本安装golang最新版本总结gola

SpringBoot中SM2公钥加密、私钥解密的实现示例详解

《SpringBoot中SM2公钥加密、私钥解密的实现示例详解》本文介绍了如何在SpringBoot项目中实现SM2公钥加密和私钥解密的功能,通过使用Hutool库和BouncyCastle依赖,简化... 目录一、前言1、加密信息(示例)2、加密结果(示例)二、实现代码1、yml文件配置2、创建SM2工具

Mysql实现范围分区表(新增、删除、重组、查看)

《Mysql实现范围分区表(新增、删除、重组、查看)》MySQL分区表的四种类型(范围、哈希、列表、键值),主要介绍了范围分区的创建、查询、添加、删除及重组织操作,具有一定的参考价值,感兴趣的可以了解... 目录一、mysql分区表分类二、范围分区(Range Partitioning1、新建分区表:2、分

MySQL 定时新增分区的实现示例

《MySQL定时新增分区的实现示例》本文主要介绍了通过存储过程和定时任务实现MySQL分区的自动创建,解决大数据量下手动维护的繁琐问题,具有一定的参考价值,感兴趣的可以了解一下... mysql创建好分区之后,有时候会需要自动创建分区。比如,一些表数据量非常大,有些数据是热点数据,按照日期分区MululbU