scikit-learn之聚类算法之Hierarchical clustering(层次聚类)

2019年3月21日 446次阅读来源: 聚类算法

算法流程

层次聚类分为分裂法和凝聚法，分裂法由上向下把大的类别（cluster）分割，凝聚法由下向上对小的类别进行聚合，但是一般用的比较多的是由下向上的凝聚法。
下面只介绍凝聚法，分裂法和凝聚法类似：
1、将样本集中的所有的样本点都当做一个独立的类簇；
2、计算两两类簇之间的距离（对应下面的 linkage 和 affinity 参数），找到距离最小的两个类簇 c1 和 c2；
3、合并类簇 c1 和 c2 为一个类簇；
重复 2、3 步骤，直到达到聚类的数目或者达到设定的条件（为两两类簇之间的距离设置一个阈值）；

sklearn中的参数

[sklearn.cluster.AgglomerativeClustering]
n_clusters=2： int，聚类个数；
affinity=’euclidean’： string or callable，计算距离的方法，可以是 “euclidean”（即 “l2”，欧氏距离），“manhattan”（即 “l1”，曼哈顿距离，有利于稀疏特征或稀疏噪声，例如文本挖掘中使用稀有词的出现作为特征时，会出现许多 0）, “cosine”（余弦距离）, ‘precomputed’（预先计算的 affinity matrix），如果 linkage=“ward”，只能选择 “euclidean”，选择度量标准的方针是使得不同类样本之间距离最大化，并且最小化同类样本之间的距离；
memory=None： None, str or object with the joblib，如果给定一个地址，可以将层次聚类的树形图缓存到相应地址；
connectivity=None： array-like or callable，连接矩阵{n_samples*n_samples}，是一个稀疏矩阵，矩阵元素为 0 或者 1，0 代表两个样本不相邻，1 代表相邻，通过连接矩阵将连接约束添加到聚类算法中，只有相邻的点才能在一个聚类中，这些约束对于给样本点强加一定的局部结构是很有用的，也使算法更快，特别是样本数量巨大时；计算连接矩阵的方法有 sklearn.neighbors.kneighbors_graph 和 sklearn.feature_extraction.image.grid_to_graph；
compute_full_tree=’auto’： bool or ‘auto’，不是太理解；
linkage=’ward’：{“ward”, “complete”, “average”}，计算类簇间距离的方法，“ward”：所有类簇的方差和，“complete”：取两个集合中距离最远的两个点的距离作为两个集合的距离，“average”：把两个集合中的点两两的距离全部放在一起求一个平均值，Agglomerative cluster 算法中存在 “rich get richer” 的现象，导致聚类大小不均匀，对此，“complete” 是最坏策略，“ward” 给出了最规则的大小，但是linkage 是 “ward” ，affinity 只能是 “euclidean”，所以对于 affinity 不是 “euclidean” 的情况，“average” 是一个好的选择；
pooling_func=< function mean >： callable，不是太理解；

示例代码

Agglomerative clustering with and without structure

# Authors: Gael Varoquaux, Nelle Varoquaux
# License: BSD 3 clause

import time
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph

# Generate sample data
n_samples = 1500
np.random.seed(0)
t = 1.5 * np.pi * (1 + 3 * np.random.rand(1, n_samples))
x = t * np.cos(t)
y = t * np.sin(t)


X = np.concatenate((x, y))
X += .7 * np.random.randn(2, n_samples)
X = X.T

# Create a graph capturing local connectivity. Larger number of neighbors
# will give more homogeneous clusters to the cost of computation
# time. A very large number of neighbors gives more evenly distributed
# cluster sizes, but may not impose the local manifold structure of
# the data
knn_graph = kneighbors_graph(X, 30, include_self=False)

for connectivity in (None, knn_graph):
    for n_clusters in (30, 3):
        plt.figure(figsize=(10, 4))
        for index, linkage in enumerate(('average', 'complete', 'ward')):
            plt.subplot(1, 3, index + 1)
            model = AgglomerativeClustering(linkage=linkage,
                                            connectivity=connectivity,
                                            n_clusters=n_clusters)
            t0 = time.time()
            model.fit(X)
            elapsed_time = time.time() - t0
            plt.scatter(X[:, 0], X[:, 1], c=model.labels_,
                        cmap=plt.cm.spectral)
            plt.title('linkage=%s (time %.2fs)' % (linkage, elapsed_time),
                      fontdict=dict(verticalalignment='top'))
            plt.axis('equal')
            plt.axis('off')

            plt.subplots_adjust(bottom=0, top=.89, wspace=0,
                                left=0, right=1)
            plt.suptitle('n_cluster=%i, connectivity=%r' %
                         (n_clusters, connectivity is not None), size=17)


plt.show()

    原文作者：聚类算法
    原文地址: https://blog.csdn.net/xiaoleiniu1314/article/details/80027610
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。