题图随便找的,图文无关。
我有一个DataFrame,叫它dataFrame。我想给dataFrame每一行加一个从0开始单调递增的id。非常贴心地,Spark SQL有这样一个函数 monotonically_increasing_id()
,是这么用的:
val newDataFrame = dataFrame.withColumn("id", functions.monotonically_increasing_id())
然后后面的代码就出问题了。按照id的range进行query,怎么也查不出东西来。
翻了文档找到了坑。这个函数的文档写着:
* A column expression that generates monotonically increasing 64-bit integers.
*
* The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
* The current implementation puts the partition ID in the upper 31 bits, and the record number
* within each partition in the lower 33 bits. The assumption is that the data frame has
* less than 1 billion partitions, and each partition has less than 8 billion records.
*
* As an example, consider a `DataFrame` with two partitions, each with 3 records.
* This expression would return the following IDs:
*
* {{{
* 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
* }}}
个么还有id不连续的咯。