1. IN
query plan
1. NOT IN
select * from id_full as a
where a.id not in (select id from id_incr where day=20180512)
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-8 depends on stages: Stage-4 , consists of Stage-10, Stage-1
Stage-10 has a backup stage: Stage-1
Stage-7 depends on stages: Stage-10
Stage-6 depends on stages: Stage-1, Stage-7 , consists of Stage-9, Stage-2
Stage-9 has a backup stage: Stage-2
Stage-5 depends on stages: Stage-9
Stage-2
Stage-1
Stage-0 depends on stages: Stage-5, Stage-2
主要执行job的有以下3个, stage详细信息:
stage-4
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: id_incr
filterExpr: ((day = 20180512) and id is null) (type: boolean)
Statistics: Num rows: 14933063 Data size: 15054692352 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is null (type: boolean)
Statistics: Num rows: 7466531 Data size: 7527345671 Basic stats: COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 7466531 Data size: 7527345671 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col0 = 0) (type: boolean)
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 0 (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: bigint)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
不太明白这个stage的意义,根据这个自己实现了一个sql可以给出类似的stage
select * from (
select count(id) as cnt from id_incr where day=20180512 and id is null
) as a
where cnt = 0
stage-7
Stage: Stage-7
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4025608842 Data size: 8412504677023 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Semi Join 0 to 1
keys:
0
1
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19
Statistics: Num rows: 4428169822 Data size: 9253755345295 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
有个mapjoin的动作,没有reduce
Stage-2
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 4428169822 Data size: 9253755345295 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: string), _col19 (type: string)
TableScan
alias: id_incr
filterExpr: (day = 20180512) (type: boolean)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col24
Statistics: Num rows: 4870986909 Data size: 10179131100451 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col24 is null (type: boolean)
Statistics: Num rows: 2435493454 Data size: 5089565549180 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: string), _col19 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19
Statistics: Num rows: 2435493454 Data size: 5089565549180 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2435493454 Data size: 5089565549180 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
map: 两个TableScan
读取数据
reduce: 两表join
,然后再predicate: _col24 is null (type: boolean)
2. IN
examples
select * from (
select stack(4,
'a', null, 'b', 'a'
) as key
) a
where a.key not in (select key from (select stack(2, 'a', null) as key) b);
output:
# nothing
select * from (
select stack(4,
'a', null, 'b', 'a'
) as key
) a
where a.key not in ('a', null);
output:
# nothing
IN
语句中只要有null
就不会查出数据
3. EXISTS
query plan
1. NOT EXISTS
select * from id_full as a
where not exists (select id from id_incr as b where day=20180512 and a.id =b.id)
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4025608842 Data size: 8412504677023 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: id (type: string)
sort order: +
Map-reduce partition columns: id (type: string)
Statistics: Num rows: 4025608842 Data size: 8412504677023 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string), ....
TableScan
alias: b
filterExpr: (day = 20180512) (type: boolean)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col1
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col1 (type: string)
sort order: +
Map-reduce partition columns: _col1 (type: string)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: string)
1 _col1 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col25
Statistics: Num rows: 4428169822 Data size: 9253755345295 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col25 is null (type: boolean)
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: string), _col19 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
map: 两个TableScan
reduce: 两表join
,然后再predicate: _col25 is null (type: boolean)
4. EXISTS
examples
select * from (
select stack(4,
'a', null, 'b', 'a'
) as key
) a
where not exists (select key from (select stack(2, 'a', null) as key) b where a.key = b.key)
output:
NULL
b
5. LEFT JOIN
query plan
select * from id_full as a
left join (select id from id_incr as b where day=20180512) as b
on a.id = b.id
where b.id is null
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: b
filterExpr: (day = 20180512) (type: boolean)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 14933063 Data size: 15054692244 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: a
Statistics: Num rows: 4025608842 Data size: 8412504677023 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: id (type: string)
sort order: +
Map-reduce partition columns: id (type: string)
Statistics: Num rows: 4025608842 Data size: 8412504677023 Basic stats: COMPLETE Column stats: NONE
value expressions: id (type: string), ...
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 id (type: string)
1 _col0 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col24
Statistics: Num rows: 4428169822 Data size: 9253755345295 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col24 is null (type: boolean)
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string), _col16 (type: string), _col17 (type: string), _col18 (type: string), _col29 (type: string), null (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2214084911 Data size: 4626877672647 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到这个stage和NOT EXISTS
的stage是一样的
6. 结论
在求差集上NOT EXISTS
和LEFT JOIN
在执行上是等价的.
而NOT IN
需要分以下2种情况:
- 如果子查询的结果中没有
NULL
,则返回结果和NOT EXISTS
,LEFT JOIN
是一样的,只不过会多2个stage - 如果子查询的结果中有
NULL
,则返回的结果为空,这点不同于NOT EXISTS
,LEFT JOIN