我正在尝试编写一个ETL进程,在联合之前合并两个数据集我为每个数据集添加一个列,更新的数据集得到2,旧数据集得到1,然后如果行有重复的主键,我删除有一个列的行旧/新列中的1.我试过用几种方法写这个,最近做过:
orderBy(keys, desc(old/new)).dropDuplicates(keys)
但是在大型数据集上,我总是会收到大量减速消息,并显示:
16/09/21 20:31:45 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (0 time so far)
16/09/21 20:32:00 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (1 time so far)
16/09/21 20:32:16 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (2 times so far)
16/09/21 20:32:31 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (3 times so far)
16/09/21 20:32:47 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (4 times so far)
16/09/21 20:33:02 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (5 times so far)
16/09/21 20:33:18 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (6 times so far)
16/09/21 20:33:33 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (7 times so far)
16/09/21 20:33:49 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (8 times so far)
16/09/21 20:34:04 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (9 times so far)
16/09/21 20:34:19 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (10 times so far)
16/09/21 20:34:35 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (11 times so far)
16/09/21 20:34:50 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (12 times so far)
16/09/21 20:35:06 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (13 times so far)
16/09/21 20:35:21 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (14 times so far)
16/09/21 20:35:37 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (15 times so far)
16/09/21 20:35:52 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (16 times so far)
16/09/21 20:36:07 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (17 times so far)
16/09/21 20:36:23 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (18 times so far)
16/09/21 20:36:38 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (19 times so far)
16/09/21 20:36:53 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (20 times so far)
16/09/21 20:37:09 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (21 times so far)
16/09/21 20:37:24 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (22 times so far)
16/09/21 20:37:40 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (23 times so far)
16/09/21 20:37:55 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (24 times so far)
16/09/21 20:38:10 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (25 times so far)
16/09/21 20:38:25 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (26 times so far)
16/09/21 20:38:41 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (27 times so far)
16/09/21 20:38:56 INFO UnsafeExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (28 times so far)
16/09/21 20:39:25 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (0 time so far)
16/09/21 20:39:45 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (1 time so far)
16/09/21 20:40:05 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (2 times so far)
16/09/21 20:40:26 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (3 times so far)
16/09/21 20:40:46 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (4 times so far)
16/09/21 20:41:07 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (5 times so far)
16/09/21 20:41:27 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (6 times so far)
16/09/21 20:41:47 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (7 times so far)
16/09/21 20:42:07 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (8 times so far)
16/09/21 20:42:28 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (9 times so far)
16/09/21 20:42:49 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (10 times so far)
16/09/21 20:43:09 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (11 times so far)
16/09/21 20:43:30 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (12 times so far)
16/09/21 20:43:50 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (13 times so far)
16/09/21 20:44:11 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (14 times so far)
16/09/21 20:44:31 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (15 times so far)
16/09/21 20:44:52 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (16 times so far)
16/09/21 20:45:13 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (17 times so far)
16/09/21 20:45:33 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (18 times so far)
16/09/21 20:45:53 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (19 times so far)
16/09/21 20:46:14 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (20 times so far)
16/09/21 20:46:34 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (21 times so far)
16/09/21 20:46:54 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (22 times so far)
16/09/21 20:47:14 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (23 times so far)
16/09/21 20:47:34 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (24 times so far)
16/09/21 20:47:54 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (25 times so far)
16/09/21 20:48:14 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (26 times so far)
16/09/21 20:48:34 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (27 times so far)
16/09/21 20:48:54 INFO ShuffleExternalSorter: Thread 84 spilling sort data of 3.0 GB to disk (28 times so far)
在检查Spark UI时,只有一个线程正在加班,其余线程已经完成.
是否有可能在线程之间传播?
最佳答案 您可以通过设计放大与数据偏差相关的任何可能问题来解决此问题.由于您首先按键和指示符变量重新排序数据,因此您首先要对数据进行重新排序,否则可能会创建高度不平衡的分区.之后应用的任何减少都无法弥补这一点.
至少有两种方法可用于实现相同的结果,同时充分受益于地图侧面减少.我在回答SPARK DataFrame: select the first row of each group时解释了这两点,所以重申一下:
>您可以使用结构排序来选择每组的最小/最大行数.
>您可以使用静态类型的数据集和groupByKey,然后使用reduceGroups.