Hive Multiple MapJoin优化

hive中会对多个mapjoin做进一步的优化,即:将多个mapjoin合并为一个mapjoin,这样做的依据是:

  1. 一个mapjoin其实只是一个map
  2. 多个mapjoin其实是多个map,而多个map是可以合并为一个map的

要启用这个功能需要以下2个设置:

set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=比较大的值

我们以以下sql为例看看不同参数设置对执行计划的影响:

SELECT a.id,a.k2,b.mapping2_id
FROM (
  SELECT id,coalesce(a2.k1, a1.k2) AS k2
  FROM (
    SELECT id,k2
    FROM id_info
    WHERE day = 20180822
  ) a1
  LEFT JOIN (
     SELECT k2,k1
     FROM mapping1
  ) a2
  ON a1.k2 = a2.k2
) a
JOIN (
  SELECT k1,mapping2_id
  FROM mapping2
) b
ON a.k2 = b.k1

1. 未启用合共多个mapjoin时

stage如下:


stage流如下:

12 --> 8   \      / 10 --> 5     \
            7 ->    11 --> 6  --> 0
1(backup)--/      \ 2(backup)--- /

从上可以看出stage被分成了2块,分别对应2个join阶段, 而对应的2个backup stage则是为了预防,mapjoin不成功的时候,采用这个备用方案.(上面的一堆stage中只要有1条线走通,就不会走其他的线)

后面的一批stage有3条线是因为,后面的joininner join,所以会有2个尝试:

  1. inner join的左表(上次join的中间表,从这里我们可以看出中间结果表也是可以用来做mapjoin的)来做hashtable
  2. inner join的右表做hashtable

如果将这个inner join改为left join则会变成1个

stage-11就是使用中间结果表作为hashtable具体情况如下:

Stage: Stage-11
  Map Reduce Local Work
    Alias -> Map Local Tables:
      $INTNAME 
        Fetch Operator
          limit: -1
    Alias -> Map Local Operator Tree:
      $INTNAME 
        TableScan
          HashTable Sink Operator
            keys:
              0 _col16 (type: string)
              1 _col0 (type: string)

2. 启用了合共多个mapjoin时

STAGE DEPENDENCIES:
  Stage-7 is a root stage
  Stage-5 depends on stages: Stage-7
  Stage-0 depends on stages: Stage-5

只有3个stage(并且没有backup的stage),其中stage-7为做hashtable的stage

Stage: Stage-7
  Map Reduce Local Work
    Alias -> Map Local Tables:
      a:a2:mapping1 
        Fetch Operator
          limit: -1
      b:mapping2 
        Fetch Operator
          limit: -1
    Alias -> Map Local Operator Tree:
      a:a2:mapping1 
        TableScan
          alias: mapping1
          Statistics: Num rows: 624369 Data size: 6877835 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: k2 (type: string), k1 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 624369 Data size: 6877835 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 _col16 (type: string)
                1 _col0 (type: string)
      b:mapping2 
        TableScan
          alias: mapping2
          Statistics: Num rows: 27716 Data size: 5543330 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: k1 is not null (type: boolean)
            Statistics: Num rows: 13858 Data size: 2771665 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: k1 (type: string), mapping2_id (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 13858 Data size: 2771665 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col16 (type: string)
                  1 _col0 (type: string)

从上可以看到读了2个表,来做hashtable

3. 优缺点

优点:

启用了多个mapjoin的合并,可以减少stage之间因为文件落地造成的io开销

缺点:

将hashtable将更占用内存,并且没有backup,任务不够健壮

ref:

  1. MapJoinOptimization
  2. https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf
    原文作者:丧诗
    原文地址: https://www.jianshu.com/p/6415d6664839
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞