K-均值聚类算法(K-means algorithm)

2023年8月8日 234次阅读来源: 聚类算法

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means because of the k in the name. One can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.

此算法的主要作用：屏幕上很多的点，把相邻的点聚到离他最近的点。

k-means algorithm算法是一个聚类算法，把n个对象根据他们的属性分为k个分割，k < n。它与处理混合正态分布的最大期望算法很相似，因为他们都试图找到数据中自然聚类的中心。

聚类（clustering），其实本质就是寻找联系紧密的事物，把他们区分出来。如果这些事物较少，人为的就可以简单完成这一目标。但是遇到大规模的数据时，人力就显得十分无力了。所以我们需要借助计算机来帮助寻找海量数据间的联系。

聚类过程中有一个关键的量，这个量就是标识两个事物之间的关联度的值，称为相关距离度量（distance metrics），之前的两篇博文相似性度量、皮尔逊相似性系数都是计算这种距离度量的方法。根据实际情况的不同，选择不同的适用的度量方法。这一点十分重要，直接影响聚类的结果是否符合实际需要和情况。
K-均值聚类（K-Means Clustering）

这个是经典的聚类算法，无论时间复杂度还是空间复杂度都是比较好的。这个算法的名称已经说明了算法的核心意图，会对数据进行K个类别的聚类。算法过程就是：

1、在数据集里随机选K个点，当作每个类别的中心点（你也可以通过一定方法选择K个点）
2、通过距离度量，把数据集里的所有点根据距离远近分配给这K个中心点（即数据分给最近的一个中心点），组成一个类别，即获得K个类别。
3、在获得的K个类别里进行均值计算，算出新的中心点（根据需求进行不同模型的均值计算，一般就是选个中心点使相应聚类里的所有点到这个点的距离和最小），把得到的中心点替换各个类别的K点值。
4、判断新获得的一组K值是否和上一次的一组K值相同，如果不同则跳到第2步。如果相同则完成了聚类过程。

http://lib.csdn.net/article/machinelearning/35217

http://blog.pureisle.net/archives/1982.html

http://blog.csdn.net/garfielder007/article/details/51476104

http://blog.csdn.net/abcjennifer/article/details/8170687