本文通过展示hive.mapjoin.smalltable.filesize
这个参数的设置,来比较是否使用mapjoin
的执行计划的区别
测试sql:
SELECT id, clienttime
FROM (
SELECT id, clienttime, key
FROM log_table
WHERE day = '20180801'
) a1
LEFT JOIN (SELECT key, field2 FROM key_mapping) a2 ON a1.key = a2.key
1. 未使用mapjoin
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: id_app_runtimes
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string), clienttime (type: bigint), key (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col2 (type: string)
sort order: +
Map-reduce partition columns: _col2 (type: string)
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: bigint)
TableScan
alias: key_mapping
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: string), field2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col2 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
2. 使用了mapjoin
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-5, Stage-1
Stage-5 has a backup stage: Stage-1
Stage-3 depends on stages: Stage-5
Stage-1
Stage-0 depends on stages: Stage-3, Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-5
Map Reduce Local Work
Alias -> Map Local Tables:
a2:key_mapping
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
a2:key_mapping
TableScan
alias: key_mapping
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: string), (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 _col2 (type: string)
1 _col0 (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: id_app_runtimes
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string), clienttime (type: bigint), key (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col2 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: id_app_runtimes
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string), clienttime (type: bigint), key (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col16 (type: string)
sort order: +
Map-reduce partition columns: _col2 (type: string)
Statistics: Num rows: 3156646800 Data size: 1666709511947 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: bigint)
TableScan
alias: key_mapping
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: string), field2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 145794 Data size: 29158920 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col2 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3472311555 Data size: 1833380502879 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
上述执行计划中:
-
stage5
有一个Map Local Tables
和HashTable Sink Operator
-
stage3
有一个Map Join Operator
两个stage
结合起来完成了mapjoin
这样一个过程
而后面多出的stage1
则是一个备份任务,即 stage5
如果执行成功的话就不会执行该stage
3. 总结
使用mapjoin
会多一个stage
,但是将map-reduce
简化成了一个单纯的map
任务,少了一个shuffle
和聚合的动作,最终执行的时候会更快