K-Means 聚类算法分析客户群价值

2023年4月27日 332次阅读来源: 聚类算法

K-Means 算法是典型的基于距离的非层次聚类算法，在最小化误差函数的基础上将数据划分为预订的类树 K，采用距离作为相似性的评价指标，即认为两个对象的距离越近，其相似度越大。

度量样本之间的相似性最常用的是欧几里得距离、曼哈顿距离和闵可夫斯基距离;（Scikit-Learn 中的 KMeans 算法仅仅支持欧氏距离，因为采用其他的距离并不一定能够保证算法的收敛性。）

度量聚类质量的目标函数：误差平方和 SSE。对于两种不同的聚类结果，选择误差平方和较小的分类结果。

算法评价：组内相似性越大，组间差别越大，聚类效果越好

#使用 K-Means 算法聚类消费行为特征数据
import pandas as pd

#参数初始化
inputfile='../data/consumption_data.xls'
outputfile='../out/data_type.xls'

k=3  #聚类的类别
iteration=500  #聚类最大循环次数
data=pd.read_excel(inputfile,index_col='Id')
data_zs=1.0*(data-data.mean())/data.std()  #数据标准化,直接对每一列进行标准化计算，不需要单独提取每一列

from sklearn.cluster import KMeans
model=KMeans(n_clusters=k, n_jobs=4, max_iter=iteration)  #分为k类，并发数为4 
model.fit(data_zs)  #开始聚类

#简单打印结果
r1=pd.Series(model.labels_).value_counts()  #统计各个类别的数目
r2=pd.DataFrame(model.cluster_centers_)  #找出聚类中心
r=pd.concat([r2,r1],axis=1)  #横向连接（axis=0纵向连接），得到聚类中心对应的类别下的数目    连接 r1 和 r2
r.columns=list(data.columns)+[u'类别数目']  #重命名表头，data.columns是data的列标签
print(r)

#详细输出原始数据及其类别
r=pd.concat([data,pd.Series(model.labels_,index=data.index)],axis=1)  #详细输出每个样本对应的类别   连接中括号内的两项内容
r.columns=list(data.columns)+[u'聚类类别']  #重命名表头
r.to_excel(outputfile)  #保存结果

sklearn.cluster.KMeans   #KMeans 聚类

Parameters：

n_clusters: int, 可选项，default=8。要生成的聚类质量，以及要生成的聚类中心的数量。

n_jobs: int，用于计算的并行数量（并发数量）

n_jobs=-1，If -1 all CPUs are used ,所有cpu 用于计算;

n_jobs=1，no parallel computing code is used at all, which is useful for debugging，没有并行计算几点，用于调试

max_iter: int, default=300。聚类最大循环次数（单次运行 K-Means 算法的最大迭代次数）

Attributes：

labels_: array, [n_clusters, n_features]。 Labels of each point 类别标签

cluster_centers_: Coordinates of cluster centers 聚类中心

Methods：

fit(X[, y]) Compute k-means clustering.

Parameters:	X : array-like or sparse matrix, shape=(n_samples, n_features) Training instances to cluster. y : Ignored

Parameters:

X : array-like or sparse matrix, shape=(n_samples, n_features)

Training instances to cluster.

y : Ignored

Methods：

pandas.read_excel：将 excel 表格读入 pandas DataFrame

Parameters：

index_col: int、list of int. default=None . Column (0-indexed) to use as the row labels of the DataFrame

列（0索引）用作 DataFrame 的行标签

io： string, path object 文件名，或文件地址

file-like object, pandas ExcelFile, or xlrd workbook. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx

DataFrame.columns: DataFrame 的列标签。

Returns：

parsed： DateFrame or Dict of DataFrame.

Methods：

pandas.cancat： Concatenate pandas objects along a particular axis with optional set logic along the other axes.

使用其他轴的可选设置逻辑沿着特定坐标轴连接 pandas 对象

Parameters：

objs: Series、DataFrame、Panel 对象的序列或映射

axis： The axis to concatenate along 要连接的轴。 {0/’index’, 1/’columns’}, default 0 值为0,则纵向连接; 值为1,则横向连接

Returns：

concatenated： object,type of objs.

When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

结果输出：

聚类中心：model.cluster_centers_

聚类标号：model.labels_

参考链接：http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.columns.html#pandas.DataFrame.columns

Python 里面实现的聚类主要包括：KMeans 聚类、层次聚类、FCM、神经网络聚类。

不同模型的使用方法大同小异，基本都是先用对应的函数建立模型，然后使用 .fit()方法来训练模型，训练好之后，就可以用 .label_ 方法给出样本数据的标签，或者用 .predict() 方法预测新的输入的标签

聚类完成后，根据聚类中心点向量画出客户聚类向量图：

import numpy as np
import matplotlib.pyplot as plt

labels=data.columns  #标签
k=5  #数据个数
plot_data=model.cluster_centers_  
color=['b','g','r','c','y']   #指定颜色

angles = np.linspace(0, 2*np.pi, k, endpoint=False)  #闭合
plot_data = np.concatenate( (plot_data,plot_data[:,[0]]), axis=1)
angles=np.concatenate( (angles,[angles[0]]) ) #闭合

fig=plt.figure()
ax=fig.add_subplot(111,polar=True)  #polar参数
for i in range(len(plot_data)):
   ax.plot(angles, plot_data[i], 'o--', color=color[i], label=u'客户群'+str(i), linewidth=2)  #画线

ax.set_rgrids(np.arange(0.01,3.5,0.5), np.arange(-1,2.5,0.5), fontproperties="SimHei")  #设置网格线
ax.set_thetagrids(angles*180/np.pi,labels,fontproperties="SimHei")
plt.legend(loc=4)   #图例
plt.show()

Methods：

numpy.linspace： numpy..linspace(start, stop, num=50, endpoint=True, retstep=False, dType=None)

Return evenly spaced numbers over a specified interval. 在指定的间隔内返回均匀间隔的数字

Returns num evenly spaced samples, calculated over the interval [start, stop]. 在[start, stop]区间内计算，均匀间隔的样本，并返回 num。

The endpoint of the interval can optionally be excluded. 区间的终点，可以选择是否 exclude 排除

Parameters：

start: 序列的起始值; stop:在 endpoint=True时，stop是序列的终止值

num：要生成的样本数

endpoint： True， stop 是序列的终止值。False，则不包括stop那个值

retstep：True，则返回（sample，step），其中 step 是样本间距

dType：输出数组的类型。如果未给出类型，则从输入类型推断数据类型。

Returns：

sample： start与stop之间的样本

step：只在 retstep=True 时返回、

参考链接：https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html?highlight=linspace#numpy.linspace

Methods：

numpy.concatenate： numpy..concatenate((a1,a2,…,) , axis=0, out=None)

沿现有的轴，加入一系列数组。

Parameters：

a1,a2,…: 数组序列，除与轴对应的尺寸外，必须具有相同形状

axis：int,数组将连接的轴; 如果 axis=None，则在使用前将数组展平，default=0

out：放置结果的目的地

Returns：

res：连接数租

参考链接：

https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html?highlight=concatenate#numpy.concatenate

Methods：

numpy.arange： numpy.arange(start, stop, step, dType=None)

    原文作者：聚类算法
    原文地址: https://blog.csdn.net/carolinedy/article/details/80773716
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。