一、需求:根据每日访问信息,算累计访问
输入数据:
设备ID | 日期 |
---|---|
10000004 | 20180501 |
10000005 | 20180501 |
10000004 | 20180502 |
10000005 | 20180502 |
10000006 | 20180502 |
10000007 | 20180502 |
10000007 | 20180503 |
10000008 | 20180503 |
10000009 | 20180503 |
输出数据:
日期 | 累计 |
---|---|
20180501 | 2 |
20180502 | 6 |
20180503 | 9 |
二、准备表和数据
1.创建表
create table device(dev_id string,dt string) row format delimited fields terminated by '\t';
2.准备数据,data.txt(按tab间隔)
10000004 20180501
10000005 20180501
10000004 20180502
10000005 20180502
10000006 20180502
10000007 20180502
10000007 20180503
10000008 20180503
10000009 20180503
3.把数据加载到表中
load data local inpath 'data.txt' overwrite into table device
三、累计明细 – 过程详解
第一步:提取表device中日期
select dt from device group by dt
dt |
---|
20180503 |
20180502 |
20180501 |
第二步:将第一步做为子查询放在左边和表device产生笛卡儿积
select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2
t1.dt | t2.dt | t2.dev_id |
---|---|---|
20180503 | 20180503 | 10000009 |
20180503 | 20180503 | 10000008 |
20180503 | 20180503 | 10000007 |
20180503 | 20180502 | 10000007 |
20180503 | 20180502 | 10000006 |
20180503 | 20180502 | 10000005 |
20180503 | 20180502 | 10000004 |
20180503 | 20180501 | 10000005 |
20180503 | 20180501 | 10000004 |
20180502 | 20180503 | 10000009 |
20180502 | 20180503 | 10000008 |
20180502 | 20180503 | 10000007 |
20180502 | 20180502 | 10000007 |
20180502 | 20180502 | 10000006 |
20180502 | 20180502 | 10000005 |
20180502 | 20180502 | 10000004 |
20180502 | 20180501 | 10000005 |
20180502 | 20180501 | 10000004 |
20180501 | 20180503 | 10000009 |
20180501 | 20180503 | 10000008 |
20180501 | 20180503 | 10000007 |
20180501 | 20180502 | 10000007 |
20180501 | 20180502 | 10000006 |
20180501 | 20180502 | 10000005 |
20180501 | 20180502 | 10000004 |
20180501 | 20180501 | 10000005 |
20180501 | 20180501 | 10000004 |
第三步:过滤(t1.dt>=t2.dt)生成累计明细
select t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt
t1.dt | t2.dt | t2.dev_id |
---|---|---|
20180503 | 20180503 | 10000009 |
20180503 | 20180503 | 10000008 |
20180503 | 20180503 | 10000007 |
20180503 | 20180502 | 10000007 |
20180503 | 20180502 | 10000006 |
20180503 | 20180502 | 10000005 |
20180503 | 20180502 | 10000004 |
20180503 | 20180501 | 10000005 |
20180503 | 20180501 | 10000004 |
20180502 | 20180502 | 10000007 |
20180502 | 20180502 | 10000006 |
20180502 | 20180502 | 10000005 |
20180502 | 20180502 | 10000004 |
20180502 | 20180501 | 10000005 |
20180502 | 20180501 | 10000004 |
20180501 | 20180501 | 10000005 |
20180501 | 20180501 | 10000004 |
最后一步(最重要):为解决数据量大时数据倾斜,走mapjoin,t1会全部加载到内存,在map端处理,第2个job不走reduce。
set hive.map.aggr=true;
set Hive.optimize.skewjoin=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=true;
select /*+ MAPJOIN(t1) */ t1.dt,t2.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt
mapreduce的job信息
Total jobs = 2
Launching Job 1 out of 2
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-05-23 20:37:14,439 Stage-1 map = 0%, reduce = 0%
2018-05-23 20:37:20,680 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.22 sec
2018-05-23 20:37:26,863 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.98 sec
MapReduce Total cumulative CPU time: 3 seconds 980 msec
Launching Job 2 out of 2
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2018-05-23 20:37:38,493 Stage-4 map = 0%, reduce = 0%
2018-05-23 20:37:44,663 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 3.32 sec
MapReduce Total cumulative CPU time: 3 seconds 320 msec
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.98 sec HDFS Read: 5965 HDFS Write: 174 SUCCESS
Stage-Stage-4: Map: 1 Cumulative CPU: 3.32 sec HDFS Read: 6031 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 300 msec
四、测试结果
用一个表大小为108GB,记录为222612963行表中做测试,整个mapreduce过程消耗的CPU时间为:1 days 14 hours 49 minutes 56 seconds 470 msec。
其mapreduce的job信息
Total jobs = 6
Launching Job 1 out of 6
Hadoop job information for Stage-1: number of mappers: 12; number of reducers: 1
2018-05-23 20:13:16,253 Stage-1 map = 0%, reduce = 0%
2018-05-23 20:13:20,426 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 1.95 sec
2018-05-23 20:13:21,454 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 24.02 sec
2018-05-23 20:13:22,481 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 31.04 sec
2018-05-23 20:13:27,623 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 33.66 sec
MapReduce Total cumulative CPU time: 33 seconds 660 msec
Stage-13 is filtered out by condition resolver.
Stage-14 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
2018-05-23 20:13:35 Starting to launch local task to process map join; maximum memory = 508559360
2018-05-23 20:13:36 End of local task; Time Taken: 0.771 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 3 out of 6
Number of reduce tasks is set to 0 since there's no reduce operator
Hadoop job information for Stage-11: number of mappers: 445; number of reducers: 0
2018-05-23 20:13:47,631 Stage-11 map = 0%, reduce = 0%
2018-05-23 20:14:23,705 Stage-11 map = 1%, reduce = 0%, Cumulative CPU 4601.59 sec
2018-05-23 20:14:32,961 Stage-11 map = 2%, reduce = 0%, Cumulative CPU 6296.4 sec
2018-05-23 20:14:41,196 Stage-11 map = 3%, reduce = 0%, Cumulative CPU 7643.17 sec
2018-05-23 20:14:53,531 Stage-11 map = 4%, reduce = 0%, Cumulative CPU 9436.65 sec
2018-05-23 20:15:03,815 Stage-11 map = 5%, reduce = 0%, Cumulative CPU 11603.96 sec
2018-05-23 20:15:14,129 Stage-11 map = 6%, reduce = 0%, Cumulative CPU 13287.69 sec
2018-05-23 20:15:22,372 Stage-11 map = 7%, reduce = 0%, Cumulative CPU 14857.01 sec
2018-05-23 20:15:30,582 Stage-11 map = 8%, reduce = 0%, Cumulative CPU 16324.26 sec
2018-05-23 20:15:38,782 Stage-11 map = 9%, reduce = 0%, Cumulative CPU 17903.47 sec
2018-05-23 20:15:45,953 Stage-11 map = 10%, reduce = 0%, Cumulative CPU 19114.98 sec
2018-05-23 20:15:54,168 Stage-11 map = 11%, reduce = 0%, Cumulative CPU 20462.52 sec
2018-05-23 20:16:00,352 Stage-11 map = 12%, reduce = 0%, Cumulative CPU 21664.48 sec
2018-05-23 20:16:09,567 Stage-11 map = 13%, reduce = 0%, Cumulative CPU 23182.39 sec
2018-05-23 20:16:16,760 Stage-11 map = 14%, reduce = 0%, Cumulative CPU 24985.56 sec
2018-05-23 20:16:23,925 Stage-11 map = 15%, reduce = 0%, Cumulative CPU 26222.6 sec
2018-05-23 20:16:31,122 Stage-11 map = 16%, reduce = 0%, Cumulative CPU 27689.36 sec
2018-05-23 20:16:37,290 Stage-11 map = 17%, reduce = 0%, Cumulative CPU 28790.46 sec
2018-05-23 20:16:42,414 Stage-11 map = 18%, reduce = 0%, Cumulative CPU 29595.99 sec
2018-05-23 20:16:49,593 Stage-11 map = 19%, reduce = 0%, Cumulative CPU 31000.63 sec
2018-05-23 20:16:55,739 Stage-11 map = 20%, reduce = 0%, Cumulative CPU 32066.58 sec
2018-05-23 20:17:01,881 Stage-11 map = 21%, reduce = 0%, Cumulative CPU 33455.96 sec
2018-05-23 20:17:08,039 Stage-11 map = 22%, reduce = 0%, Cumulative CPU 34469.7 sec
2018-05-23 20:17:14,222 Stage-11 map = 23%, reduce = 0%, Cumulative CPU 35659.33 sec
2018-05-23 20:17:20,378 Stage-11 map = 24%, reduce = 0%, Cumulative CPU 36752.24 sec
2018-05-23 20:17:26,545 Stage-11 map = 25%, reduce = 0%, Cumulative CPU 37894.69 sec
2018-05-23 20:17:31,680 Stage-11 map = 26%, reduce = 0%, Cumulative CPU 38814.68 sec
2018-05-23 20:17:37,836 Stage-11 map = 27%, reduce = 0%, Cumulative CPU 39917.47 sec
2018-05-23 20:17:43,987 Stage-11 map = 28%, reduce = 0%, Cumulative CPU 41044.08 sec
2018-05-23 20:17:50,128 Stage-11 map = 29%, reduce = 0%, Cumulative CPU 42133.94 sec
2018-05-23 20:17:56,293 Stage-11 map = 30%, reduce = 0%, Cumulative CPU 43210.77 sec
2018-05-23 20:18:02,457 Stage-11 map = 31%, reduce = 0%, Cumulative CPU 44302.57 sec
2018-05-23 20:18:08,618 Stage-11 map = 32%, reduce = 0%, Cumulative CPU 45353.76 sec
2018-05-23 20:18:14,788 Stage-11 map = 33%, reduce = 0%, Cumulative CPU 46470.85 sec
2018-05-23 20:18:20,955 Stage-11 map = 34%, reduce = 0%, Cumulative CPU 47600.43 sec
2018-05-23 20:18:27,117 Stage-11 map = 35%, reduce = 0%, Cumulative CPU 48682.28 sec
2018-05-23 20:18:34,296 Stage-11 map = 36%, reduce = 0%, Cumulative CPU 49934.44 sec
2018-05-23 20:18:40,462 Stage-11 map = 37%, reduce = 0%, Cumulative CPU 51019.56 sec
2018-05-23 20:18:47,640 Stage-11 map = 38%, reduce = 0%, Cumulative CPU 52278.9 sec
2018-05-23 20:18:54,830 Stage-11 map = 39%, reduce = 0%, Cumulative CPU 53561.27 sec
2018-05-23 20:19:00,985 Stage-11 map = 40%, reduce = 0%, Cumulative CPU 54670.49 sec
2018-05-23 20:19:11,542 Stage-11 map = 41%, reduce = 0%, Cumulative CPU 56161.62 sec
2018-05-23 20:19:15,650 Stage-11 map = 42%, reduce = 0%, Cumulative CPU 57228.98 sec
2018-05-23 20:19:23,838 Stage-11 map = 43%, reduce = 0%, Cumulative CPU 58538.29 sec
2018-05-23 20:19:29,992 Stage-11 map = 44%, reduce = 0%, Cumulative CPU 59752.35 sec
2018-05-23 20:19:36,145 Stage-11 map = 45%, reduce = 0%, Cumulative CPU 60833.42 sec
2018-05-23 20:19:42,306 Stage-11 map = 46%, reduce = 0%, Cumulative CPU 61952.92 sec
2018-05-23 20:19:49,484 Stage-11 map = 47%, reduce = 0%, Cumulative CPU 63240.14 sec
2018-05-23 20:19:55,649 Stage-11 map = 48%, reduce = 0%, Cumulative CPU 64412.93 sec
2018-05-23 20:20:02,842 Stage-11 map = 49%, reduce = 0%, Cumulative CPU 65536.58 sec
2018-05-23 20:20:10,356 Stage-11 map = 50%, reduce = 0%, Cumulative CPU 66877.52 sec
2018-05-23 20:20:19,698 Stage-11 map = 51%, reduce = 0%, Cumulative CPU 68598.9 sec
2018-05-23 20:20:25,867 Stage-11 map = 52%, reduce = 0%, Cumulative CPU 69693.92 sec
2018-05-23 20:20:33,051 Stage-11 map = 53%, reduce = 0%, Cumulative CPU 70867.07 sec
2018-05-23 20:20:41,255 Stage-11 map = 54%, reduce = 0%, Cumulative CPU 72342.41 sec
2018-05-23 20:20:47,419 Stage-11 map = 55%, reduce = 0%, Cumulative CPU 73482.16 sec
2018-05-23 20:20:55,647 Stage-11 map = 56%, reduce = 0%, Cumulative CPU 74751.77 sec
2018-05-23 20:21:04,894 Stage-11 map = 57%, reduce = 0%, Cumulative CPU 76294.79 sec
2018-05-23 20:21:14,123 Stage-11 map = 58%, reduce = 0%, Cumulative CPU 77849.29 sec
2018-05-23 20:21:24,389 Stage-11 map = 59%, reduce = 0%, Cumulative CPU 79616.28 sec
2018-05-23 20:21:43,397 Stage-11 map = 61%, reduce = 0%, Cumulative CPU 82838.05 sec
2018-05-23 20:21:52,620 Stage-11 map = 62%, reduce = 0%, Cumulative CPU 84412.33 sec
2018-05-23 20:22:01,879 Stage-11 map = 63%, reduce = 0%, Cumulative CPU 85991.96 sec
2018-05-23 20:22:11,109 Stage-11 map = 64%, reduce = 0%, Cumulative CPU 87430.72 sec
2018-05-23 20:22:19,306 Stage-11 map = 65%, reduce = 0%, Cumulative CPU 88764.07 sec
2018-05-23 20:22:29,554 Stage-11 map = 66%, reduce = 0%, Cumulative CPU 90620.16 sec
2018-05-23 20:22:38,812 Stage-11 map = 67%, reduce = 0%, Cumulative CPU 92066.45 sec
2018-05-23 20:22:50,109 Stage-11 map = 68%, reduce = 0%, Cumulative CPU 93883.67 sec
2018-05-23 20:23:06,470 Stage-11 map = 69%, reduce = 0%, Cumulative CPU 96169.2 sec
2018-05-23 20:23:12,624 Stage-11 map = 70%, reduce = 0%, Cumulative CPU 97121.51 sec
2018-05-23 20:23:21,870 Stage-11 map = 71%, reduce = 0%, Cumulative CPU 98447.23 sec
2018-05-23 20:23:33,192 Stage-11 map = 72%, reduce = 0%, Cumulative CPU 99957.59 sec
2018-05-23 20:23:42,459 Stage-11 map = 73%, reduce = 0%, Cumulative CPU 101274.39 sec
2018-05-23 20:23:52,734 Stage-11 map = 74%, reduce = 0%, Cumulative CPU 102580.07 sec
2018-05-23 20:24:05,048 Stage-11 map = 75%, reduce = 0%, Cumulative CPU 104293.63 sec
2018-05-23 20:24:14,278 Stage-11 map = 76%, reduce = 0%, Cumulative CPU 105529.64 sec
2018-05-23 20:24:22,480 Stage-11 map = 77%, reduce = 0%, Cumulative CPU 106748.48 sec
2018-05-23 20:24:31,704 Stage-11 map = 78%, reduce = 0%, Cumulative CPU 107946.75 sec
2018-05-23 20:24:41,944 Stage-11 map = 79%, reduce = 0%, Cumulative CPU 109133.42 sec
2018-05-23 20:24:53,247 Stage-11 map = 80%, reduce = 0%, Cumulative CPU 110501.49 sec
2018-05-23 20:25:07,608 Stage-11 map = 81%, reduce = 0%, Cumulative CPU 111875.66 sec
2018-05-23 20:25:20,931 Stage-11 map = 82%, reduce = 0%, Cumulative CPU 113264.94 sec
2018-05-23 20:25:35,306 Stage-11 map = 83%, reduce = 0%, Cumulative CPU 114595.5 sec
2018-05-23 20:25:51,714 Stage-11 map = 84%, reduce = 0%, Cumulative CPU 116126.83 sec
2018-05-23 20:26:08,108 Stage-11 map = 85%, reduce = 0%, Cumulative CPU 117537.44 sec
2018-05-23 20:26:24,494 Stage-11 map = 86%, reduce = 0%, Cumulative CPU 118963.82 sec
2018-05-23 20:26:40,882 Stage-11 map = 87%, reduce = 0%, Cumulative CPU 120285.9 sec
2018-05-23 20:27:03,416 Stage-11 map = 88%, reduce = 0%, Cumulative CPU 121891.57 sec
2018-05-23 20:27:24,910 Stage-11 map = 89%, reduce = 0%, Cumulative CPU 123494.62 sec
2018-05-23 20:27:43,347 Stage-11 map = 90%, reduce = 0%, Cumulative CPU 124845.46 sec
2018-05-23 20:28:14,110 Stage-11 map = 91%, reduce = 0%, Cumulative CPU 126327.84 sec
2018-05-23 20:28:44,871 Stage-11 map = 92%, reduce = 0%, Cumulative CPU 127912.78 sec
2018-05-23 20:29:12,507 Stage-11 map = 93%, reduce = 0%, Cumulative CPU 129345.83 sec
2018-05-23 20:29:57,615 Stage-11 map = 94%, reduce = 0%, Cumulative CPU 130868.17 sec
2018-05-23 20:30:58,157 Stage-11 map = 95%, reduce = 0%, Cumulative CPU 132488.85 sec
2018-05-23 20:31:58,544 Stage-11 map = 95%, reduce = 0%, Cumulative CPU 133836.96 sec
2018-05-23 20:32:02,634 Stage-11 map = 96%, reduce = 0%, Cumulative CPU 133997.82 sec
2018-05-23 20:32:40,580 Stage-11 map = 97%, reduce = 0%, Cumulative CPU 135165.46 sec
2018-05-23 20:33:37,148 Stage-11 map = 98%, reduce = 0%, Cumulative CPU 136322.97 sec
2018-05-23 20:34:37,541 Stage-11 map = 98%, reduce = 0%, Cumulative CPU 137363.62 sec
2018-05-23 20:35:02,125 Stage-11 map = 99%, reduce = 0%, Cumulative CPU 137742.39 sec
2018-05-23 20:36:02,482 Stage-11 map = 99%, reduce = 0%, Cumulative CPU 138493.75 sec
2018-05-23 20:37:02,931 Stage-11 map = 99%, reduce = 0%, Cumulative CPU 138966.65 sec
2018-05-23 20:38:03,216 Stage-11 map = 99%, reduce = 0%, Cumulative CPU 139377.26 sec
2018-05-23 20:39:03,473 Stage-11 map = 99%, reduce = 0%, Cumulative CPU 139657.54 sec
2018-05-23 20:39:36,121 Stage-11 map = 100%, reduce = 0%, Cumulative CPU 139762.81 sec
MapReduce Total cumulative CPU time: 1 days 14 hours 49 minutes 22 seconds 810 msec
Stage-5 is selected by condition resolver.
Stage-4 is filtered out by condition resolver.
Stage-6 is filtered out by condition resolver.
MapReduce Jobs Launched:
Stage-Stage-1: Map: 12 Reduce: 1 Cumulative CPU: 33.66 sec HDFS Read: 1894841 HDFS Write: 8548 SUCCESS
Stage-Stage-11: Map: 445 Cumulative CPU: 139762.81 sec HDFS Read: 116588319891 HDFS Write: 336399118813 SUCCESS
Total MapReduce CPU Time Spent: 1 days 14 hours 49 minutes 56 seconds 470 msec
五、算结果
将累计明细落地成表detail([dt string,dev_id string]),注:dt为t1.dt
set hive.map.aggr=true;
set Hive.optimize.skewjoin=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=true;
insert overwrite table detail
select /*+ MAPJOIN(t1) */ t1.dt,t2.dev_id from (select dt from device group by dt) t1,device t2 where t1.dt>=t2.dt
聚合
set hive.map.aggr=true;
set Hive.optimize.skewjoin=true;
set hive.exec.parallel=true;
set hive.auto.convert.join=true;
select dt,count(1) as acc from detail group by dt order by dt
最终结果
dt | acc |
---|---|
20180501 | 2 |
20180502 | 6 |
20180503 | 9 |