Python【极简】聚类算法(KMeans+DBSCAN+MeanShift)
链接:https://blog.csdn.net/Yellow_python/article/details/81461056?utm_source=copy
1、聚类算法极简代码
1.1、K-Means:基于欧式距离
1.2、DBSCAN:基于密度
1.3、Mean Shift:均值漂移(三维可视化)
2、聚类评估:轮廓系数(Silhouette Coefficient)
2.1、KMeans聚类评估
2.2、DBSCAN聚类评估
2.3、MeanShift聚类评估
4、附录
4.1、翻译
4.2、数据集
4.2.1、数据集1
4.2.2、数据集2
1、聚类算法极简代码
1.1、K-Means:基于欧式距离
K-Means聚类算法的时间复杂度是O(nkt) ,适合挖掘大规模数据集
n:数据集中对象的数量
t:算法迭代的次数
k:簇的数目
创建数据
import numpy as np
X = np.array([[3, 4], [6, 8], [1, 2], [6, 7], [3, 1], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
聚类算法
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2) # 创建KMeans对象,设置簇的数量
km.fit(X) # 传入数据
labels = km.labels_ # 聚类结果(分类标签)
print(labels)
centers = km.cluster_centers_ # 簇的中心
print(centers)
可视化
import matplotlib.pyplot as mp
for x, l in zip(X, labels): # 聚类标签
if l == 0:
mp.scatter(x[0], x[1], c=‘r’)
else:
mp.scatter(x[0], x[1], c=‘g’)
for i in range(len(centers)): # 簇的中心
if i == 0:
mp.scatter(centers[i][0], centers[i][1], c=‘r’, marker=‘x’, s=99)
else:
mp.scatter(centers[i][0], centers[i][1], c=‘g’, marker=‘x’, s=99)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1.2、DBSCAN:基于密度
Density-Based Spatial Clustering of Applications with Noise
优点:
1、不需要事先知道要形成的簇类的数量
2、可发现任意形状的簇类
3、可识别出噪声点
4、对样本的顺序不敏感。但对于处于簇类之间边界样本,可能会根据哪个簇类优先被探测到而其归属有所摆动
缺点:
1、不能很好反映高维数据
2、不能很好反映数据集以变化的密度
3、如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差
创造数据
from sklearn.datasets.samples_generator import make_blobs
X, _ = make_blobs(n_samples=100, centers=[[1, 1], [9, 9], [7, 3]])
DBSCAN:基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=1, min_samples=3).fit(X).labels_
print(labels)
可视化
import matplotlib.pyplot as mp
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
1.3、Mean Shift:均值漂移(三维可视化)
寻找核密度极值点并作为簇的质心,然后根据最近邻原则为样本点赋予质心
从网络读取数据
import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056’
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’
([\s\S]+?)
’, r.text)[1].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
均值漂移
from sklearn.cluster import MeanShift
labels = MeanShift().fit(X).labels_
可视化
import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=150, alpha=0.3)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2、聚类评估:轮廓系数(Silhouette Coefficient)
a(i)a(i):样本ii到同簇其他样本的平均距离
b(i)b(i):样本ii的簇间不相似度
s(i)s(i)接近1:样本ii聚类合理
s(i)s(i)接近-1:样本ii更适合分到别的簇
s(i)s(i)接近0:样本ii在两个簇的边界上
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
2.1、KMeans聚类评估
从网络读取数据
import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056’
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’
([\s\S]+?)
’, r.text)[0].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
m, n = 2, 6 # 设定簇的数量
for i in range(m, n):
# KMeans聚类算法
from sklearn.cluster import KMeans
labels = KMeans(n_clusters=i).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, n – m, i – m + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘n_clusters = %d 的聚类得分为:’ % i, score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
打印结果
n_clusters = 2 的聚类得分为: 0.6094103841500139
n_clusters = 3 的聚类得分为: 0.4249285827871494
n_clusters = 4 的聚类得分为: 0.3447569550742587
n_clusters = 5 的聚类得分为: 0.34076078057327047
2.2、DBSCAN聚类评估
创建数据
import numpy as np
X = np.array([[1, 4], [6, 8], [1, 2], [6, 7], [5, 3], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
radii = [1.414, 1.415, 2]
for i in range(3):
# DBSCAN:基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=radii[i], min_samples=2).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, 3, i + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘eps = %.3f 的聚类得分是:’ % radii[i], score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
打印结果
eps = 1.414 的聚类得分是: 0.36739772676132704
eps = 1.415 的聚类得分是: 0.6018738849706604
eps = 2.000 的聚类得分是: 0.6431136276704154
2.3、MeanShift聚类评估
创建数据 ————————————————————————————————————-
from sklearn.datasets.samples_generator import make_blobs
centers = [[0, 0, 0], [6, 4, 1], [9, 9, 9]]
X, _ = make_blobs(n_samples=100, centers=centers, cluster_std=2, random_state=0)
均值偏移 ————————————————————————————————————-
from sklearn.cluster import MeanShift, estimate_bandwidth
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=50) # 带宽(分位点、样本数)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)
聚类标签
labels = ms.labels_
簇的中心
centers = ms.cluster_centers_
print(centers)
聚类评估 ———————————————————————————————————
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘聚类得分是:%.2f’ % score)
可视化 ———————————————————————————————————–
import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
样本集聚类结果
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=120, alpha=0.2)
簇的中心
for i in range(len(centers)):
ax.scatter(centers[i][0], centers[i][1], centers[i][2], c=colors[i], s=200, marker=‘x’)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30