Spark SQL的一个坑

题图随便找的,图文无关。

我有一个DataFrame,叫它dataFrame。我想给dataFrame每一行加一个从0开始单调递增的id。非常贴心地,Spark SQL有这样一个函数 monotonically_increasing_id(),是这么用的:

val newDataFrame = dataFrame.withColumn("id", functions.monotonically_increasing_id()) 

然后后面的代码就出问题了。按照id的range进行query,怎么也查不出东西来。

翻了文档找到了坑。这个函数的文档写着:

   * A column expression that generates monotonically increasing 64-bit integers.
   *
   * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
   * The current implementation puts the partition ID in the upper 31 bits, and the record number
   * within each partition in the lower 33 bits. The assumption is that the data frame has
   * less than 1 billion partitions, and each partition has less than 8 billion records.
   *
   * As an example, consider a `DataFrame` with two partitions, each with 3 records.
   * This expression would return the following IDs:
   *
   * {{{
   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
   * }}}

个么还有id不连续的咯。

    原文作者:杨网瘾
    原文地址: https://zhuanlan.zhihu.com/p/39265974
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞