python – 在pyspark中积累数据帧的最有效方法是什么？

2023年1月2日 181次阅读

我有一个数据帧(或者可能是任何RDD),在一个众所周知的架构中包含数百万行,如下所示：

Key | FeatureA | FeatureB
--------------------------
U1  |        0 |         1
U2  |        1 |         1

我需要从磁盘加载十几个其他数据集,其中包含相同数量的键的不同功能.有些数据集最多可达十几列.想像：

Key | FeatureC | FeatureD |  FeatureE
-------------------------------------
U1  |        0 |        0 |         1

Key | FeatureF
--------------
U2  |        1

感觉就像折叠或积累,我只想迭代所有的数据集,并得到这样的东西：

Key | FeatureA | FeatureB | FeatureC | FeatureD | FeatureE | FeatureF 
---------------------------------------------------------------------
U1  |        0 |        1 |        0 |        0 |        1 |        0
U2  |        1 |        1 |        0 |        0 |        0 |        1

我已经尝试加载每个数据帧然后加入,但一旦我通过一些数据集,这将永远.我是否缺少完成此任务的常用模式或有效方法？

最佳答案假设每个DataFrame中每个键最多有一行,并且所有键都是原始类型,您可以尝试使用聚合进行联合.让我们从一些导入和示例数据开始：

from itertools import chain
from functools import reduce
from pyspark.sql.types import StructType
from pyspark.sql.functions import col, lit, max
from pyspark.sql import DataFrame

df1 = sc.parallelize([
    ("U1", 0, 1), ("U2", 1, 1)
]).toDF(["Key", "FeatureA", "FeatureB"])

df2 = sc.parallelize([
  ("U1", 0, 0, 1)
]).toDF(["Key", "FeatureC", "FeatureD", "FeatureE"])

df3 = sc.parallelize([("U2", 1)]).toDF(["Key", "FeatureF"])

dfs = [df1, df2, df3]

接下来我们可以提取常见架构：

output_schema = StructType(
  [df1.schema.fields[0]] + list(chain(*[df.schema.fields[1:] for df in dfs]))
)

并转换所有DataFrame：

transformed_dfs = [df.select(*[
  lit(None).cast(c.dataType).alias(c.name) if c.name not in df.columns 
  else col(c.name)
  for c in output_schema.fields
]) for df in dfs]

最后一个联合和虚拟聚合：

combined = reduce(DataFrame.unionAll, transformed_dfs)
exprs = [max(c).alias(c) for c in combined.columns[1:]]
result = combined.repartition(col("Key")).groupBy(col("Key")).agg(*exprs)

如果每个键有多个行但单个列仍然是原子的,则可以尝试使用collect_list / collect_set替换max,然后使用explode.