偏度skewness
1. 随机变量的偏度定义
随机变量 X {X} X的偏度 γ 1 \gamma_1 γ1为三阶标准矩,标准定义为:
γ 1 = E [ ( X − μ σ ) 3 ] = μ 3 σ 3 = E [ ( X − μ ) 3 ] ( E [ ( X − μ ) 2 ] ) 3 / 2 = κ 3 κ 2 3 / 2 , \gamma_1=\displaystyle E \Big[(\frac{X-\mu}{\sigma})^3\Big]=\frac{\mu_3}{\sigma^3}=\frac{E\Big[(X-\mu)^3\Big]}{\Big(E\Big[(X-\mu)^2\Big]\Big)^{3/2}}=\frac{\kappa_3}{\kappa_2^{3/2}}, γ1=E[(σX−μ)3]=σ3μ3=(E[(X−μ)2])3/2E[(X−μ)3]=κ23/2κ3,
其中, μ 3 \mu_3 μ3为随机变量 X {X} X的三阶中心距, σ \sigma σ为随机变量 X {X} X的标准差, E E E是求期望, κ 3 = E [ ( X − μ ) 3 ] \kappa_3=E\Big[(X-\mu)^3\Big] κ3=E[(X−μ)3]为随机变量 X {X} X的三阶累积量, κ 2 = E [ ( X − μ ) 2 ] \kappa_2=E\Big[(X-\mu)^2\Big] κ2=E[(X−μ)2]为随机变量 X {X} X的二阶累积量。
ps:对于随机变量 X {X} X而言,一阶累积量等于期望值 E ( X ) {E(X)} E(X),二阶累积量等于方差 V ( x ) {V(x)} V(x),三阶累积量等于三阶中心矩 S ( x ) {S(x)} S(x),但是四阶以及更高阶的累积量与同阶的中心矩并不相等。
还可以用原点距表示偏度的公式:
γ 1 = E [ ( X − μ σ ) 3 ] = E [ X 3 ] − 3 E [ X 2 ] μ + 3 E [ X ] μ 2 − μ 3 ( E [ ( X − μ ) 2 ] ) 3 / 2 \gamma_1=\displaystyle E \Big[(\frac{X-\mu}{\sigma})^3\Big]=\frac{E[X^3]-3E[X^2]\mu +3E[X]\mu^2 -\mu^3}{\Big(E\Big[(X-\mu)^2\Big]\Big)^{3/2}} γ1=E[(σX−μ)3]=(E[(X−μ)2])3/2E[X3]−3E[X2]μ+3E[X]μ2−μ3
= E [ X 3 ] − 3 E μ [ X 2 ] + 3 μ μ 2 − μ 3 ( E [ ( X − μ ) 2 ] ) 3 / 2 =\displaystyle \frac{E[X^3]-3E\mu[X^2]+3\mu\mu^2 -\mu^3}{\Big(E\Big[(X-\mu)^2\Big]\Big)^{3/2}} =(E[(X−μ)2])3/2E[X3]−3Eμ[X2]+3μμ2−μ3
= E [ X 3 ] − 3 E μ [ X 2 ] + 2 μ 3 ( E [ ( X − μ ) 2 ] ) 3 / 2 =\displaystyle \frac{E[X^3]-3E\mu[X^2]+2\mu^3}{\Big(E\Big[(X-\mu)^2\Big]\Big)^{3/2}} =(E[(X−μ)2])3/2E[X3]−3Eμ[X2]+2μ3
= E [ X 3 ] − 3 μ ( E [ X 2 ] − μ 2 ) − μ 3 ( E [ ( X − μ ) 2 ] ) 3 / 2 = E [ X 3 ] − 3 μ σ 2 − μ 3 σ 3 . =\displaystyle \frac{E[X^3]-3\mu(E[X^2]-\mu^2) -\mu^3}{\Big(E\Big[(X-\mu)^2\Big]\Big)^{3/2}}=\frac{E[X^3]-3\mu \sigma^2-\mu^3}{\sigma^3}. =(E[(X−μ)2])3/2E[X3]−3μ(E[X2]−μ2)−μ3=σ3E[X3]−3μσ2−μ3.
2. 样本偏度的定义
具有n( n ≥ 3 n\geq 3 n≥3)个值的样本偏度的定义为:
b 1 = m 3 s 3 = 1 n Σ i = 1 n ( x i − x ˉ ) 3 [ 1 n − 1 Σ i = 1 n ( x i − x ˉ ) 2 ] 3 / 2 , \displaystyle b_1=\frac{m_3}{s^3}=\frac{\frac{1}{n}\Sigma_{i=1}^{n}(x_i-{\bar x})^3}{\Big[\frac{1}{n-1}\Sigma_{i=1}^{n}(x_i-{\bar x})^2\Big]^{3/2}}, b1=s3m3=[n−11Σi=1n(xi−xˉ)2]3/2n1Σi=1n(xi−xˉ)3,
其中, x ˉ \bar x xˉ为样本的均值, s s s为样本的标准差, m 3 m_3 m3为样本的三阶中心矩。
3. 总体偏度的估计
实际上,在许多文献中,尤其对小样本来说,一个常用的样本偏度的估计,计算公式为:
G 1 = κ 3 κ 2 3 / 2 = n 2 ( n − 1 ) ( n − 2 ) m 3 s 3 = n ( n − 1 ) n − 2 1 n ∑ i = 1 n ( x i − x ˉ ) 3 [ 1 n − 1 ∑ i = 1 n ( x i − x ˉ ) 2 ] 3 / 2 , \displaystyle G_1=\frac{\kappa_3}{\kappa_2^{3/2}}=\frac{n^2}{(n-1)(n-2)}\frac{m_3}{s^3}=\frac{\sqrt{n(n-1)}}{n-2}\frac{\frac{1}{n}\displaystyle\sum_{i=1}^{n}(x_i-{\bar x})^3}{\Big[\frac{1}{n-1}\displaystyle\sum_{i=1}^{n}(x_i-{\bar x})^2\Big]^{3/2}}, G1=κ23/2κ3=(n−1)(n−2)n2s3m3=n−2n(n−1) [n−11i=1∑n(xi−xˉ)2]3/2n1i=1∑n(xi−xˉ)3,
其中, κ 3 \kappa_3 κ3为三阶累积量的唯一对称无偏估计量, κ 2 = s 2 \kappa_2=s^2 κ2=s2为二阶累积量(即样本方差)的对称无偏估计量。
加上系数调整后的Fisher-Pearson标准化矩 G 1 {G_1} G1是Excel,Minitab,SAS和SPSS等统计软件及Pandas库所采用的计算公式。
pandas源码片段
def nanskew(values, axis=None, skipna=True, mask=None):
""" Compute the sample skewness. The statistic computed here is the adjusted Fisher-Pearson standardized moment coefficient G1. The algorithm computes this coefficient directly from the second and third central moment. """
......
mean = values.sum(axis, dtype=np.float64) / count
if axis is not None:
mean = np.expand_dims(mean, axis)
adjusted = values - mean
if skipna:
np.putmask(adjusted, mask, 0)
adjusted2 = adjusted ** 2
adjusted3 = adjusted2 * adjusted
m2 = adjusted2.sum(axis, dtype=np.float64)
m3 = adjusted3.sum(axis, dtype=np.float64)
# floating point error
#
# #18044 in _libs/windows.pyx calc_skew follow this behavior
# to fix the fperr to treat m2 <1e-14 as zero
m2 = _zero_out_fperr(m2)
m3 = _zero_out_fperr(m3)
with np.errstate(invalid='ignore', divide='ignore'):
result = (count * (count - 1) ** 0.5 / (count - 2)) * (m3 / m2 ** 1.5)
.......
return result