Hive MapJoin 执行计划

2023年2月23日 286次阅读来源: 丧诗

本文通过展示hive.mapjoin.smalltable.filesize 这个参数的设置,来比较是否使用mapjoin的执行计划的区别

测试sql:

SELECT id, clienttime
FROM (
  SELECT id, clienttime, key
  FROM log_table
  WHERE day = '20180801'
) a1
LEFT JOIN (SELECT key, field2 FROM key_mapping) a2 ON a1.key = a2.key

1. 未使用`mapjoin`

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col2 (type: string)
                sort order: +
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: bigint)
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string), field2 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 _col2 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint)
            Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

2. 使用了`mapjoin`

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-5, Stage-1
  Stage-5 has a backup stage: Stage-1
  Stage-3 depends on stages: Stage-5
  Stage-1
  Stage-0 depends on stages: Stage-3, Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a2:key_mapping 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a2:key_mapping 
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string),  (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col2 (type: string)
                  1 _col0 (type: string)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col2 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: bigint)
                  Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: id_app_runtimes
            Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: id (type: string), clienttime (type: bigint), key (type: string)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col16 (type: string)
                sort order: +
                Map-reduce partition columns: _col2 (type: string)
                Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: bigint)
          TableScan
            alias: key_mapping
            Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: key (type: string), field2 (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 _col2 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: bigint)
            Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

上述执行计划中:

stage5 有一个Map Local Tables 和 HashTable Sink Operator
stage3 有一个Map Join Operator

两个stage结合起来完成了mapjoin 这样一个过程

而后面多出的stage1则是一个备份任务,即 stage5如果执行成功的话就不会执行该stage

3. 总结

使用mapjoin会多一个stage,但是将map-reduce简化成了一个单纯的map任务,少了一个shuffle和聚合的动作,最终执行的时候会更快

    原文作者：丧诗
    原文地址: https://www.jianshu.com/p/46e63958d070
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

1. 未使用mapjoin

2. 使用了mapjoin

3. 总结

1. 未使用`mapjoin`

2. 使用了`mapjoin`