pyspark系列--统计基础

2024年4月19日 303次阅读来源: master苏

统计基础

1. 简单统计
2. 随机数
3. 四舍五入
4. 抽样
5. 描述性统计
6. 最大值最小值
7. 均值方差
8. 协方差与相关系数
9. 交叉表(列联表)
10. 频繁项目元素
11. 其他数学函数

11.1. 数学函数

12. 元素去重计数
13. 聚合函数 grouping
14. 聚合函数 grouping_id

1. 简单统计

在数据分析中，基本统计分析已经能满足95%的需求了，什么是基本统计分析呢，就是均值，方差，标准差，抽样，卡方，相关系数，协方差，假设检验等。如果你的需求超出了这个范畴，我想你应该从事很高深的工作吧，或者你在一个很厉害的公司或者很牛逼的部门，那么你也不用担心spark做不到，因为有人会帮你做到的。

spark dataframe的基本统计函数已经包含在 pyspark.sql.functions 中，类似的，dataframe本身也有一些统计方法。

2. 随机数

# 基于dataframe生成相同行数的随机数
from pyspark.sql.functions import rand, randn  # 均匀分布和正太分布函数

color_df.select(rand(seed=10).alias("uniform"), 
                randn(seed=27).alias("normal"))\
    .show()

# 或者随机生成指定行数的dataframe
df = spark.range(0, 10).withColumn('rand1', rand(seed=10)) \
                       .withColumn('rand2', rand(seed=27))
df.show()

3. 四舍五入

from pyspark.sql.functions import round
df = spark.createDataFrame([(2.5,)], ['a'])

df.select(round('a', 0).alias('r')).show()

4. 抽样

from pyspark.sql
spark = SparkSession \
    .builder \
    .appName('my_first_app_name') \
    .getOrCreate()

# 生成测试数据
colors = ['white','green','yellow','red','brown','pink']
color_df=pd.DataFrame(colors,columns=['color'])
color_df['length']=color_df['color'].apply(len)

# 抽样
sample1 = color_df.sample(
    withReplacement=False, # 无放回抽样
    fraction=0.6,
    seed=1000)  
sample1.show()

5. 描述性统计

# dataframe本身也有基本统计的方法，和pandas一致
import numpy as np
import pandas as pd

# 1.生成测试数据
df=pd.DataFrame(np.random.rand(5,5),columns=['a','b','c','d','e']).\
    applymap(lambda x: int(x*10))
df.iloc[2,2]=np.nan

spark_df=spark.createDataFrame(df)
spark_df.show()

# 2.描述性统计信息
spark_df.describe().show()

# 3.针对一个字段的统计信息
spark_df.describe('a').show()

6. 最大值最小值

from pyspark.sql.functions import min, max
color_df.select(min('uniform'), max('uniform')).show()

7. 均值方差

均值方差标准差前面提到过，这里再复习一下

from pyspark.sql.functions import mean, stddev  # 同样是在function里面

color_df.select(mean('uniform').alias('mean'),
                stddev('uniform').alias('stddev'))\
    .show()

8. 协方差与相关系数

# 协方差
df.stat.cov('rand1','rand2')

# 样本协方差
from pyspark.sql.functions import covar_pop
df.agg(covar_samp("rand1", "rand1").alias('new_col')).collect()

# 相关系数
df.stat.corr('rand1', 'rand2')

9. 交叉表(列联表)

# 交叉列表
# Create a DataFrame with two columns (name, item)
names = ["Alice", "Bob", "Mike"]
items = ["milk", "bread", "butter", "apples", "oranges"]
df = spark.createDataFrame([(names[i % 3], items[i % 5]) for i in range(100)], ["name", "item"])
df.show(5)

df.stat.crosstab("name", "item").show()
# +---------+------+-----+------+----+-------+
# |name_item|apples|bread|butter|milk|oranges|
# +---------+------+-----+------+----+-------+
# | Bob| 6| 7| 7| 6| 7|
# | Mike| 7| 6| 7| 7| 6|
# | Alice| 7| 7| 6| 7| 7|
# +---------+------+-----+------+----+-------+

10. 频繁项目元素

# 找出现次数最多的元素(频数分布)
df = spark.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(100)],
                           ["a", "b", "c"])
df.show(10)

# 下面的代码找到每列出现次数占总的40%以上频繁项目
df.stat.freqItems(["a", "b", "c"], 0.4).show()
# +-----------+-----------+-----------+
# |a_freqItems|b_freqItems|c_freqItems|
# +-----------+-----------+-----------+
# | [23, 1]| [2, 46]| [1, 3]|
# +-----------+-----------+-----------+
# “23”和“1”是列“a”的频繁值

11. 其他数学函数

通过观察pyspark.sql.functions模块，发现还有很多常用的好用的函数。

11.1. 数学函数

| 函数 | 作用 |
|———–|—————|
| log | 对数 |
| log2 | 以2为底的对数 |
| factorial | 阶乘 |

12. 元素去重计数

from pyspark.sql import functions as func

df = spark.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(10)],
                           ["a", "b", "c"])
# 注意agg函数的使用
df.agg(func.countDistinct('a')).show()

13. 聚合函数 grouping

没看懂，谁看懂了告诉我。

Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated
or not, returns 1 for aggregated or 0 for not aggregated in the result set.

from pyspark.sql import functions as func

df.cube("name").agg(func.grouping("name"), func.sum("age")).orderBy("name").show()

# +-----+--------------+--------+
# | name|grouping(name)|sum(age)|
# +-----+--------------+--------+
# | null|             1|       7|
# |Alice|             0|       2|
# |  Bob|             0|       5|
# +-----+--------------+--------+

14. 聚合函数 grouping_id

同样没看懂。

Aggregate function: returns the level of grouping, equals to

(grouping(c1) << (n-1)) + (grouping(c2) << (n-2)) + ... + grouping(cn)

note:: The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).


df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()
# +-----+-------------+--------+
# | name|grouping_id()|sum(age)|
# +-----+-------------+--------+
# | null|            1|       7|
# |Alice|            0|       2|
# |  Bob|            0|       5|
# +-----+-------------+--------+

    原文作者：master苏
    原文地址: https://zhuanlan.zhihu.com/p/34901846
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。