hive中会对多个mapjoin做进一步的优化,即:将多个mapjoin合并为一个mapjoin,这样做的依据是:
- 一个mapjoin其实只是一个map
- 多个mapjoin其实是多个map,而多个map是可以合并为一个map的
要启用这个功能需要以下2个设置:
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=比较大的值
我们以以下sql为例看看不同参数设置对执行计划的影响:
SELECT a.id,a.k2,b.mapping2_id
FROM (
SELECT id,coalesce(a2.k1, a1.k2) AS k2
FROM (
SELECT id,k2
FROM id_info
WHERE day = 20180822
) a1
LEFT JOIN (
SELECT k2,k1
FROM mapping1
) a2
ON a1.k2 = a2.k2
) a
JOIN (
SELECT k1,mapping2_id
FROM mapping2
) b
ON a.k2 = b.k1
1. 未启用合共多个mapjoin时
stage如下:
stage流如下:
12 --> 8 \ / 10 --> 5 \
7 -> 11 --> 6 --> 0
1(backup)--/ \ 2(backup)--- /
从上可以看出stage被分成了2块,分别对应2个join阶段, 而对应的2个backup stage则是为了预防,mapjoin不成功的时候,采用这个备用方案.(上面的一堆stage中只要有1条线走通,就不会走其他的线)
后面的一批stage有3条线是因为,后面的join
为inner join
,所以会有2个尝试:
- 用
inner join
的左表(上次join
的中间表,从这里我们可以看出中间结果表也是可以用来做mapjoin的)来做hashtable
- 用
inner join
的右表做hashtable
如果将这个inner join
改为left join
则会变成1个
stage-11就是使用中间结果表作为hashtable
具体情况如下:
Stage: Stage-11
Map Reduce Local Work
Alias -> Map Local Tables:
$INTNAME
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
$INTNAME
TableScan
HashTable Sink Operator
keys:
0 _col16 (type: string)
1 _col0 (type: string)
2. 启用了合共多个mapjoin时
STAGE DEPENDENCIES:
Stage-7 is a root stage
Stage-5 depends on stages: Stage-7
Stage-0 depends on stages: Stage-5
只有3个stage(并且没有backup的stage),其中stage-7为做hashtable
的stage
Stage: Stage-7
Map Reduce Local Work
Alias -> Map Local Tables:
a:a2:mapping1
Fetch Operator
limit: -1
b:mapping2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a:a2:mapping1
TableScan
alias: mapping1
Statistics: Num rows: 624369 Data size: 6877835 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: k2 (type: string), k1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 624369 Data size: 6877835 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col16 (type: string)
1 _col0 (type: string)
b:mapping2
TableScan
alias: mapping2
Statistics: Num rows: 27716 Data size: 5543330 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: k1 is not null (type: boolean)
Statistics: Num rows: 13858 Data size: 2771665 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: k1 (type: string), mapping2_id (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 13858 Data size: 2771665 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col16 (type: string)
1 _col0 (type: string)
从上可以看到读了2个表,来做hashtable
3. 优缺点
优点:
启用了多个mapjoin的合并,可以减少stage之间因为文件落地造成的io开销
缺点:
将hashtable将更占用内存,并且没有backup,任务不够健壮
ref: