Spark SQL的一个坑

2024年1月3日 175次阅读来源: 杨网瘾

题图随便找的，图文无关。

我有一个DataFrame，叫它dataFrame。我想给dataFrame每一行加一个从0开始单调递增的id。非常贴心地，Spark SQL有这样一个函数 monotonically_increasing_id()，是这么用的：

val newDataFrame = dataFrame.withColumn("id", functions.monotonically_increasing_id())

然后后面的代码就出问题了。按照id的range进行query，怎么也查不出东西来。

翻了文档找到了坑。这个函数的文档写着：

   * A column expression that generates monotonically increasing 64-bit integers.
   *
   * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
   * The current implementation puts the partition ID in the upper 31 bits, and the record number
   * within each partition in the lower 33 bits. The assumption is that the data frame has
   * less than 1 billion partitions, and each partition has less than 8 billion records.
   *
   * As an example, consider a `DataFrame` with two partitions, each with 3 records.
   * This expression would return the following IDs:
   *
   * {{{
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}

个么还有id不连续的咯。

    原文作者：杨网瘾
    原文地址: https://zhuanlan.zhihu.com/p/39265974
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。