2 二元数据集的分布

2019年4月21日 159次阅读来源: readilen

数据之间的关联的经典做法是皮尔逊和斯皮尔曼计算，最简单的方法就是jointplot了，这个函数很厉害，可以绘制多个面板，详细的展示两个变量的关联,

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats, integrate

np.random.seed(sum(map(ord, 'distributions')))

mean, cov = [0, 1],[(1, .5),(.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=['x', 'y'])

分散点图scatterplot

最简单的观察方法是分散点图，plt.scatter也可以，x轴表示x的数据，y轴表示y的数据，使用jointplot

sns.jointplot(x="x", y="y", data=df);

《2 二元数据集的分布》 scatter

六角硬币图(Hexbin)

使用一个六角形的硬币的颜色反应落在该上的数值多少。

sns.jointplot(x="x", y="y", data=df, kind='hexbin')

《2 二元数据集的分布》 hexbin

核密度估计（Kernel density estimation）

二元变量的分布也可以使用核密度函数，像不像等高线图哈哈

sns.jointplot(x="x", y="y", data=df, kind='kde')

《2 二元数据集的分布》 kde

核密度函数还有另一种画法

f, ax = plt.subplots(figsize=(12, 8))
sns.kdeplot(df.x, df.y, ax=ax)
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax)

《2 二元数据集的分布》 kde.png

如果你想让图形显示的连续写，可以修改参数

f, ax = plt.subplots(figsize=(12, 8))
cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True)

《2 二元数据集的分布》 continue

jiointplot使用一个JointGrid来管理图形，可以直接使用JointGrid来添加函数，例如

g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m")
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$");

《2 二元数据集的分布》 JionGrid

本文最后介绍一个多元变量的二元关系的画法，pairplot创建一个矩阵，每一个小图显示两个变量之间的关联,默认对角线上显示一元变量图。

iris = sns.load_dataset("iris")
sns.pairplot(iris);

《2 二元数据集的分布》 pairplot

jointplot和pairplot非常相似，jointplot使用JoinGrid管理图形，pairplot使用PairGrid管理图形，可以更灵活的使用

g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=6);

《2 二元数据集的分布》 PairGrid

    原文作者：readilen
    原文地址: https://www.jianshu.com/p/cb4c430c72f3
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。