hive优化-级联求和

一、需求:根据每日访问信息,算累计访问

输入数据:

设备ID日期
1000000420180501
1000000520180501
1000000420180502
1000000520180502
1000000620180502
1000000720180502
1000000720180503
1000000820180503
1000000920180503

输出数据:

日期累计
201805012
201805026
201805039
二、准备表和数据

1.创建表

create table device(dev_id string,dt string) row format delimited fields terminated by '\t'; 

2.准备数据,data.txt(按tab间隔)

10000004    20180501
10000005    20180501
10000004    20180502
10000005    20180502
10000006    20180502
10000007    20180502
10000007    20180503
10000008    20180503
10000009    20180503

3.把数据加载到表中

load data local inpath 'data.txt' overwrite into table device

三、累计明细 – 过程详解

第一步:提取表device中日期

 select dt from device group by dt
dt
20180503
20180502
20180501

第二步:将第一步做为子查询放在左边和表device产生笛卡儿积

select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2
t1.dtt2.dtt2.dev_id
201805032018050310000009
201805032018050310000008
201805032018050310000007
201805032018050210000007
201805032018050210000006
201805032018050210000005
201805032018050210000004
201805032018050110000005
201805032018050110000004
201805022018050310000009
201805022018050310000008
201805022018050310000007
201805022018050210000007
201805022018050210000006
201805022018050210000005
201805022018050210000004
201805022018050110000005
201805022018050110000004
201805012018050310000009
201805012018050310000008
201805012018050310000007
201805012018050210000007
201805012018050210000006
201805012018050210000005
201805012018050210000004
201805012018050110000005
201805012018050110000004

第三步:过滤(t1.dt>=t2.dt)生成累计明细

select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt
t1.dtt2.dtt2.dev_id
201805032018050310000009
201805032018050310000008
201805032018050310000007
201805032018050210000007
201805032018050210000006
201805032018050210000005
201805032018050210000004
201805032018050110000005
201805032018050110000004
201805022018050210000007
201805022018050210000006
201805022018050210000005
201805022018050210000004
201805022018050110000005
201805022018050110000004
201805012018050110000005
201805012018050110000004
最后一步(最重要):为解决数据量大时数据倾斜,走mapjoin,t1会全部加载到内存,在map端处理,第2个job不走reduce。
 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 select /*+ MAPJOIN(t1) */ t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt

mapreduce的job信息

 Total jobs = 2
 Launching Job 1 out of 2
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
 2018-05-23 20:37:14,439 Stage-1 map = 0%,  reduce = 0%
 2018-05-23 20:37:20,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.22 sec
 2018-05-23 20:37:26,863 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.98 sec
 MapReduce Total cumulative CPU time: 3 seconds 980 msec
 Launching Job 2 out of 2
 Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
 2018-05-23 20:37:38,493 Stage-4 map = 0%,  reduce = 0%
 2018-05-23 20:37:44,663 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.32 sec
 MapReduce Total cumulative CPU time: 3 seconds 320 msec
 MapReduce Jobs Launched: 
 Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.98 sec   HDFS Read: 5965 HDFS Write: 174 SUCCESS
 Stage-Stage-4: Map: 1   Cumulative CPU: 3.32 sec   HDFS Read: 6031 HDFS Write: 459 SUCCESS
 Total MapReduce CPU Time Spent: 7 seconds 300 msec

四、测试结果

用一个表大小为108GB,记录为222612963行表中做测试,整个mapreduce过程消耗的CPU时间为:1 days 14 hours 49 minutes 56 seconds 470 msec

其mapreduce的job信息

 Total jobs = 6
 Launching Job 1 out of 6
 Hadoop job information for Stage-1: number of mappers: 12; number of reducers: 1
 2018-05-23 20:13:16,253 Stage-1 map = 0%,  reduce = 0%
 2018-05-23 20:13:20,426 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 1.95 sec
 2018-05-23 20:13:21,454 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 24.02 sec
 2018-05-23 20:13:22,481 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 31.04 sec
 2018-05-23 20:13:27,623 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 33.66 sec
 MapReduce Total cumulative CPU time: 33 seconds 660 msec
 Stage-13 is filtered out by condition resolver.
 Stage-14 is selected by condition resolver.
 Stage-2 is filtered out by condition resolver.
 2018-05-23 20:13:35    Starting to launch local task to process map join;  maximum memory = 508559360
 2018-05-23 20:13:36    End of local task; Time Taken: 0.771 sec.
 Execution completed successfully
 MapredLocal task succeeded
 Launching Job 3 out of 6
 Number of reduce tasks is set to 0 since there's no reduce operator
 Hadoop job information for Stage-11: number of mappers: 445; number of reducers: 0
 2018-05-23 20:13:47,631 Stage-11 map = 0%,  reduce = 0%
 2018-05-23 20:14:23,705 Stage-11 map = 1%,  reduce = 0%, Cumulative CPU 4601.59 sec
 2018-05-23 20:14:32,961 Stage-11 map = 2%,  reduce = 0%, Cumulative CPU 6296.4 sec
 2018-05-23 20:14:41,196 Stage-11 map = 3%,  reduce = 0%, Cumulative CPU 7643.17 sec
 2018-05-23 20:14:53,531 Stage-11 map = 4%,  reduce = 0%, Cumulative CPU 9436.65 sec
 2018-05-23 20:15:03,815 Stage-11 map = 5%,  reduce = 0%, Cumulative CPU 11603.96 sec
 2018-05-23 20:15:14,129 Stage-11 map = 6%,  reduce = 0%, Cumulative CPU 13287.69 sec
 2018-05-23 20:15:22,372 Stage-11 map = 7%,  reduce = 0%, Cumulative CPU 14857.01 sec
 2018-05-23 20:15:30,582 Stage-11 map = 8%,  reduce = 0%, Cumulative CPU 16324.26 sec
 2018-05-23 20:15:38,782 Stage-11 map = 9%,  reduce = 0%, Cumulative CPU 17903.47 sec
 2018-05-23 20:15:45,953 Stage-11 map = 10%,  reduce = 0%, Cumulative CPU 19114.98 sec
 2018-05-23 20:15:54,168 Stage-11 map = 11%,  reduce = 0%, Cumulative CPU 20462.52 sec
 2018-05-23 20:16:00,352 Stage-11 map = 12%,  reduce = 0%, Cumulative CPU 21664.48 sec
 2018-05-23 20:16:09,567 Stage-11 map = 13%,  reduce = 0%, Cumulative CPU 23182.39 sec
 2018-05-23 20:16:16,760 Stage-11 map = 14%,  reduce = 0%, Cumulative CPU 24985.56 sec
 2018-05-23 20:16:23,925 Stage-11 map = 15%,  reduce = 0%, Cumulative CPU 26222.6 sec
 2018-05-23 20:16:31,122 Stage-11 map = 16%,  reduce = 0%, Cumulative CPU 27689.36 sec
 2018-05-23 20:16:37,290 Stage-11 map = 17%,  reduce = 0%, Cumulative CPU 28790.46 sec
 2018-05-23 20:16:42,414 Stage-11 map = 18%,  reduce = 0%, Cumulative CPU 29595.99 sec
 2018-05-23 20:16:49,593 Stage-11 map = 19%,  reduce = 0%, Cumulative CPU 31000.63 sec
 2018-05-23 20:16:55,739 Stage-11 map = 20%,  reduce = 0%, Cumulative CPU 32066.58 sec
 2018-05-23 20:17:01,881 Stage-11 map = 21%,  reduce = 0%, Cumulative CPU 33455.96 sec
 2018-05-23 20:17:08,039 Stage-11 map = 22%,  reduce = 0%, Cumulative CPU 34469.7 sec
 2018-05-23 20:17:14,222 Stage-11 map = 23%,  reduce = 0%, Cumulative CPU 35659.33 sec
 2018-05-23 20:17:20,378 Stage-11 map = 24%,  reduce = 0%, Cumulative CPU 36752.24 sec
 2018-05-23 20:17:26,545 Stage-11 map = 25%,  reduce = 0%, Cumulative CPU 37894.69 sec
 2018-05-23 20:17:31,680 Stage-11 map = 26%,  reduce = 0%, Cumulative CPU 38814.68 sec
 2018-05-23 20:17:37,836 Stage-11 map = 27%,  reduce = 0%, Cumulative CPU 39917.47 sec
 2018-05-23 20:17:43,987 Stage-11 map = 28%,  reduce = 0%, Cumulative CPU 41044.08 sec
 2018-05-23 20:17:50,128 Stage-11 map = 29%,  reduce = 0%, Cumulative CPU 42133.94 sec
 2018-05-23 20:17:56,293 Stage-11 map = 30%,  reduce = 0%, Cumulative CPU 43210.77 sec
 2018-05-23 20:18:02,457 Stage-11 map = 31%,  reduce = 0%, Cumulative CPU 44302.57 sec
 2018-05-23 20:18:08,618 Stage-11 map = 32%,  reduce = 0%, Cumulative CPU 45353.76 sec
 2018-05-23 20:18:14,788 Stage-11 map = 33%,  reduce = 0%, Cumulative CPU 46470.85 sec
 2018-05-23 20:18:20,955 Stage-11 map = 34%,  reduce = 0%, Cumulative CPU 47600.43 sec
 2018-05-23 20:18:27,117 Stage-11 map = 35%,  reduce = 0%, Cumulative CPU 48682.28 sec
 2018-05-23 20:18:34,296 Stage-11 map = 36%,  reduce = 0%, Cumulative CPU 49934.44 sec
 2018-05-23 20:18:40,462 Stage-11 map = 37%,  reduce = 0%, Cumulative CPU 51019.56 sec
 2018-05-23 20:18:47,640 Stage-11 map = 38%,  reduce = 0%, Cumulative CPU 52278.9 sec
 2018-05-23 20:18:54,830 Stage-11 map = 39%,  reduce = 0%, Cumulative CPU 53561.27 sec
 2018-05-23 20:19:00,985 Stage-11 map = 40%,  reduce = 0%, Cumulative CPU 54670.49 sec
 2018-05-23 20:19:11,542 Stage-11 map = 41%,  reduce = 0%, Cumulative CPU 56161.62 sec
 2018-05-23 20:19:15,650 Stage-11 map = 42%,  reduce = 0%, Cumulative CPU 57228.98 sec
 2018-05-23 20:19:23,838 Stage-11 map = 43%,  reduce = 0%, Cumulative CPU 58538.29 sec
 2018-05-23 20:19:29,992 Stage-11 map = 44%,  reduce = 0%, Cumulative CPU 59752.35 sec
 2018-05-23 20:19:36,145 Stage-11 map = 45%,  reduce = 0%, Cumulative CPU 60833.42 sec
 2018-05-23 20:19:42,306 Stage-11 map = 46%,  reduce = 0%, Cumulative CPU 61952.92 sec
 2018-05-23 20:19:49,484 Stage-11 map = 47%,  reduce = 0%, Cumulative CPU 63240.14 sec
 2018-05-23 20:19:55,649 Stage-11 map = 48%,  reduce = 0%, Cumulative CPU 64412.93 sec
 2018-05-23 20:20:02,842 Stage-11 map = 49%,  reduce = 0%, Cumulative CPU 65536.58 sec
 2018-05-23 20:20:10,356 Stage-11 map = 50%,  reduce = 0%, Cumulative CPU 66877.52 sec
 2018-05-23 20:20:19,698 Stage-11 map = 51%,  reduce = 0%, Cumulative CPU 68598.9 sec
 2018-05-23 20:20:25,867 Stage-11 map = 52%,  reduce = 0%, Cumulative CPU 69693.92 sec
 2018-05-23 20:20:33,051 Stage-11 map = 53%,  reduce = 0%, Cumulative CPU 70867.07 sec
 2018-05-23 20:20:41,255 Stage-11 map = 54%,  reduce = 0%, Cumulative CPU 72342.41 sec
 2018-05-23 20:20:47,419 Stage-11 map = 55%,  reduce = 0%, Cumulative CPU 73482.16 sec
 2018-05-23 20:20:55,647 Stage-11 map = 56%,  reduce = 0%, Cumulative CPU 74751.77 sec
 2018-05-23 20:21:04,894 Stage-11 map = 57%,  reduce = 0%, Cumulative CPU 76294.79 sec
 2018-05-23 20:21:14,123 Stage-11 map = 58%,  reduce = 0%, Cumulative CPU 77849.29 sec
 2018-05-23 20:21:24,389 Stage-11 map = 59%,  reduce = 0%, Cumulative CPU 79616.28 sec
 2018-05-23 20:21:43,397 Stage-11 map = 61%,  reduce = 0%, Cumulative CPU 82838.05 sec
 2018-05-23 20:21:52,620 Stage-11 map = 62%,  reduce = 0%, Cumulative CPU 84412.33 sec
 2018-05-23 20:22:01,879 Stage-11 map = 63%,  reduce = 0%, Cumulative CPU 85991.96 sec
 2018-05-23 20:22:11,109 Stage-11 map = 64%,  reduce = 0%, Cumulative CPU 87430.72 sec
 2018-05-23 20:22:19,306 Stage-11 map = 65%,  reduce = 0%, Cumulative CPU 88764.07 sec
 2018-05-23 20:22:29,554 Stage-11 map = 66%,  reduce = 0%, Cumulative CPU 90620.16 sec
 2018-05-23 20:22:38,812 Stage-11 map = 67%,  reduce = 0%, Cumulative CPU 92066.45 sec
 2018-05-23 20:22:50,109 Stage-11 map = 68%,  reduce = 0%, Cumulative CPU 93883.67 sec
 2018-05-23 20:23:06,470 Stage-11 map = 69%,  reduce = 0%, Cumulative CPU 96169.2 sec
 2018-05-23 20:23:12,624 Stage-11 map = 70%,  reduce = 0%, Cumulative CPU 97121.51 sec
 2018-05-23 20:23:21,870 Stage-11 map = 71%,  reduce = 0%, Cumulative CPU 98447.23 sec
 2018-05-23 20:23:33,192 Stage-11 map = 72%,  reduce = 0%, Cumulative CPU 99957.59 sec
 2018-05-23 20:23:42,459 Stage-11 map = 73%,  reduce = 0%, Cumulative CPU 101274.39 sec
 2018-05-23 20:23:52,734 Stage-11 map = 74%,  reduce = 0%, Cumulative CPU 102580.07 sec
 2018-05-23 20:24:05,048 Stage-11 map = 75%,  reduce = 0%, Cumulative CPU 104293.63 sec
 2018-05-23 20:24:14,278 Stage-11 map = 76%,  reduce = 0%, Cumulative CPU 105529.64 sec
 2018-05-23 20:24:22,480 Stage-11 map = 77%,  reduce = 0%, Cumulative CPU 106748.48 sec
 2018-05-23 20:24:31,704 Stage-11 map = 78%,  reduce = 0%, Cumulative CPU 107946.75 sec
 2018-05-23 20:24:41,944 Stage-11 map = 79%,  reduce = 0%, Cumulative CPU 109133.42 sec
 2018-05-23 20:24:53,247 Stage-11 map = 80%,  reduce = 0%, Cumulative CPU 110501.49 sec
 2018-05-23 20:25:07,608 Stage-11 map = 81%,  reduce = 0%, Cumulative CPU 111875.66 sec
 2018-05-23 20:25:20,931 Stage-11 map = 82%,  reduce = 0%, Cumulative CPU 113264.94 sec
 2018-05-23 20:25:35,306 Stage-11 map = 83%,  reduce = 0%, Cumulative CPU 114595.5 sec
 2018-05-23 20:25:51,714 Stage-11 map = 84%,  reduce = 0%, Cumulative CPU 116126.83 sec
 2018-05-23 20:26:08,108 Stage-11 map = 85%,  reduce = 0%, Cumulative CPU 117537.44 sec
 2018-05-23 20:26:24,494 Stage-11 map = 86%,  reduce = 0%, Cumulative CPU 118963.82 sec
 2018-05-23 20:26:40,882 Stage-11 map = 87%,  reduce = 0%, Cumulative CPU 120285.9 sec
 2018-05-23 20:27:03,416 Stage-11 map = 88%,  reduce = 0%, Cumulative CPU 121891.57 sec
 2018-05-23 20:27:24,910 Stage-11 map = 89%,  reduce = 0%, Cumulative CPU 123494.62 sec
 2018-05-23 20:27:43,347 Stage-11 map = 90%,  reduce = 0%, Cumulative CPU 124845.46 sec
 2018-05-23 20:28:14,110 Stage-11 map = 91%,  reduce = 0%, Cumulative CPU 126327.84 sec
 2018-05-23 20:28:44,871 Stage-11 map = 92%,  reduce = 0%, Cumulative CPU 127912.78 sec
 2018-05-23 20:29:12,507 Stage-11 map = 93%,  reduce = 0%, Cumulative CPU 129345.83 sec
 2018-05-23 20:29:57,615 Stage-11 map = 94%,  reduce = 0%, Cumulative CPU 130868.17 sec
 2018-05-23 20:30:58,157 Stage-11 map = 95%,  reduce = 0%, Cumulative CPU 132488.85 sec
 2018-05-23 20:31:58,544 Stage-11 map = 95%,  reduce = 0%, Cumulative CPU 133836.96 sec
 2018-05-23 20:32:02,634 Stage-11 map = 96%,  reduce = 0%, Cumulative CPU 133997.82 sec
 2018-05-23 20:32:40,580 Stage-11 map = 97%,  reduce = 0%, Cumulative CPU 135165.46 sec
 2018-05-23 20:33:37,148 Stage-11 map = 98%,  reduce = 0%, Cumulative CPU 136322.97 sec
 2018-05-23 20:34:37,541 Stage-11 map = 98%,  reduce = 0%, Cumulative CPU 137363.62 sec
 2018-05-23 20:35:02,125 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 137742.39 sec
 2018-05-23 20:36:02,482 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 138493.75 sec
 2018-05-23 20:37:02,931 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 138966.65 sec
 2018-05-23 20:38:03,216 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 139377.26 sec
 2018-05-23 20:39:03,473 Stage-11 map = 99%,  reduce = 0%, Cumulative CPU 139657.54 sec
 2018-05-23 20:39:36,121 Stage-11 map = 100%,  reduce = 0%, Cumulative CPU 139762.81 sec
 MapReduce Total cumulative CPU time: 1 days 14 hours 49 minutes 22 seconds 810 msec
 Stage-5 is selected by condition resolver.
 Stage-4 is filtered out by condition resolver.
 Stage-6 is filtered out by condition resolver.
 MapReduce Jobs Launched: 
 Stage-Stage-1: Map: 12  Reduce: 1   Cumulative CPU: 33.66 sec   HDFS Read: 1894841 HDFS Write: 8548 SUCCESS
 Stage-Stage-11: Map: 445   Cumulative CPU: 139762.81 sec   HDFS Read: 116588319891 HDFS Write: 336399118813 SUCCESS
 Total MapReduce CPU Time Spent: 1 days 14 hours 49 minutes 56 seconds 470 msec

五、算结果

将累计明细落地成表detail([dt string,dev_id string]),注:dt为t1.dt

 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 insert overwrite table detail
 select /*+ MAPJOIN(t1) */ t1.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt

聚合

 set hive.map.aggr=true;
 set Hive.optimize.skewjoin=true;
 set hive.exec.parallel=true;
 set hive.auto.convert.join=true;
 
 select dt,count(1) as acc from detail group by dt order by dt

最终结果

dtacc
201805012
201805026
201805039
    原文作者:zhanghuang
    原文地址: https://www.jianshu.com/p/af302643d586
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞