pyspark – spark join引发“检测到INNER加入的笛卡尔积”

2023年6月13日 341次阅读

我有一个数据框,我想为每一行添加new_col = max(some_column0)按其他列1分组：

maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)

在第二个字符串中我收到一个错误：

AnalysisException: u’Detected cartesian product for INNER join between
logical plans\nProject … Use the CROSS JOIN syntax to allow
cartesian products between these relations.;’

我不明白的是：为什么火花会在这里发现笛卡尔积？

获取此错误的可能方法：我将DF保存到Hive表,然后再次从表中选择init DF.或者用hive查询替换这两个字符串 – 无论如何.但我不想拯救DF.

最佳答案如
Why does spark think this is a cross/cartesian join所述,它可能是由以下原因引起的：

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.

至于笛卡尔积如何产生？你可以参考Identifying and Eliminating the Dreaded Cartesian Product.