hive执行计划举例

2023年10月15日 250次阅读来源: DobeWang

执行计划例子：

insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))

STAGE DEPENDENCIES:

Stage-1 is a root stage /根

Stage-0 depends on stages: Stage-1 /0依赖1

Stage-2 depends on stages: Stage-0 /2依赖0

STAGE PLANS:

Stage: Stage-1

Map Reduce//这个阶段是一个mapreduce作业 Alias -> Map Operator Tree: //map操作树，对应map阶段

TableScan //扫描表获取数据 from加载表，描述中有行数和大小等

alias: t1 //表别名

Reduce Output Operator //这里描述map的输出，也就是reduce的输入。比如key，partition，sort等信息

key expressions: //t1表输出到reduce阶段的key信息

expr: bar

type: string

sort order: + //一个排序字段，这个排序字段是key=bar，多个排序字段多个+

Map-reduce partition columns: //partition的信息，由此也可以看出hive在join的时候会以join on后的列作为partition的列，以保证具有相同此列的值的行被分到同一个reduce中去

expr: bar

type: string

tag: 0 //对t1表打标签

value expressions: //t1表输出到reduce阶段的value信息

expr: foo

type: int

expr: bar

type: string

TableScan

alias: t2

Reduce Output Operator

key expressions:

expr: bar

type: string

sort order: +

Map-reduce partition columns:

expr: bar

type: string

tag: 1

value expressions:

expr: foo

type: int

Reduce Operator Tree://reduce操作树，相当于reduce阶段Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col1} //对应前面t1.bar, t1.foo

1 {VALUE._col0} //对应前面t2.foo

handleSkewJoin: false

outputColumnNames: _col0, _col1, _col5

Select Operator //筛选列，描述中有列名、类型，输出类型、大小等。

expressions:

expr: _col1

type: string

expr: _col0

type: int

expr: _col5

type: int

outputColumnNames: _col0, _col1, _col2 //为临时结果字段按规则起的临时字段名

File Output Operator //输出结果到临时文件中，描述介绍了压缩格式、输出文件格式。

compressed: false

GlobalTableId: 1

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0

Move Operator //Stage-0简单把结果从临时目录，移动到表lpx相关的目录。

tables:

replace: true

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-2

Stats-Aggr Operator

========================================

从信息头：

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 depends on stages: Stage-1

Stage-2 depends on stages: Stage-0

从这里可以看出Plan计划的Job任务结构，整个任务会分为3个Job执行，第一个Job将由Stage-1构成;

第二个Job处理由Stage-0构成，Stage-0的处理必须依赖Stage-1阶段的结果;

第三个Job处理由Stage-2构成，Stage-2的处理必须依赖Stage-0阶段的结果。

下面分别解释Stage-1和Stage-0，执行SQL可以分成两步：(1)、SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);

(2)

、insert overwrite TABLE lpx;

Stage: Stage-1对应一次完整的Map Reduce任务，包括：Map Operator Tree和Reduce Operator Tree两步操作,Map Operator Tree对应Map任务，Reduce Operator Tree对应Reduce任务。从Map Operator Tree阶段可以看出进行了两个并列的操作t1和t2，分别SELECT t1.bar, t1.foo FROM t1;和SELECT t2.foo FROM t2;而且两个Map任务分别产生了Reduce阶段的输入[Reduce Output Operator]。从Reduce Operator Tree分析可以看到如下信息，条件连接Map的输出以及通过预定义的输出格式生成符合default.lpx的存储格式的数据存储到HDFS中。在我们创建lpx表的时候，没有指定该表的存储格式，默认会以Text为存储格式，输入输出会以TextInputFormat与TextOutputFormat进行读写：

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

input format的值对应org.apache.hadoop.mapred.TextInputFormat，这是因为在开始的Map阶段产生的临时输出文件是以TextOutputFormat格式保存的，自然Reduce的读取是由TextInputFormat格式处理读入数据。这些是由Hadoop的MapReduce处理细节来控制，而Hive只需要指定处理格式即可。

Serde值为org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe类，这时这个对象的保存的值为_col0, _col1, _col2，也就是我们预期要查询的t1.bar, t1.foo, t2.foo，这个值具体的应该为_col0+表lpx设置的列分割符+_col1+表lpx设置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat可以知道output的处理是使用该类来处理的。

Stage-0对应上面提到的第二步操作。这时stage-1产生的临时处理文件举例如tmp,需要经过stage-0阶段操作处理到lpx表中。Move Operator代表了这并不是一个MapReduce任务，只需要调用MoveTask的处理就行，在处理之前会去检查输入文件是否符合lpx表的存储格式。

hive执行计划作用

分析作业执行过程，优化作业执行流程，提升作业执行效率；例如，数据过滤条件从reduce端提前到map端，有效减少map/reduce间shuffle数据量，提升作业执行效率；

提前过滤数据数据集，减少不必要的读取操作；例如: hive join操作先于where条件顾虑，将分区条件放入on语句中，能够有效减少输入数据集；

执行计划分析问题hql

select a.*, b.cust_uid

from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a

where a.dt<=20140101 and a.dt<=20140108 and key=’deliver_id_bucket_id’ and a.dt=b.dt and a.key_slice=b.deliver_id

==========================================================================

执行计划：

抽象语法树：

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) ‘deliver_id_bucket_id’)) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Filter Operator

predicate:

expr: (key = ‘deliver_id_bucket_id’) //按key指定值在map阶段过滤

type: boolean

Reduce Output Operator

sort order:

tag: 1

value expressions: //select *导致输出到reduce的数据是全部的列信息

expr: key

type: string

expr: key_slice

type: string

expr: billing_mode_slice

type: string

expr: bucket_id

type: string

expr: ctr

type: string

expr: ecpm

type: string

expr: auc

type: string

expr: pctr

type: string

expr: pctr_ctr

type: string

expr: total_pv

type: string

expr: total_click

type: string

expr: dt

type: string

TableScan

alias: b

Reduce Output Operator

sort order:

tag: 0

value expressions:

expr: deliver_id

type: string

expr: cust_uid

type: string

expr: dt

type: string

Reduce Operator Tree:

Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col6} {VALUE._col35}

1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

handleSkewJoin: false

outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

Filter Operator

predicate:

expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = ‘deliver_id_bucket_id’)) and (_col49 = _col35)) and (_col39 = _col0))

type: boolean

Select Operator

expressions:

expr: _col38

type: string

expr: _col39

type: string

expr: _col40

type: string

expr: _col41

type: string

expr: _col42

type: string

expr: _col43

type: string

expr: _col44

type: string

expr: _col45

type: string

expr: _col46

type: string

expr: _col47

type: string

expr: _col48

type: string

expr: _col49

type: string

expr: _col6

type: string

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

优化之后hql：

select a.*, b.cust_uid

from ods_ad_bid_deliver_info b

join mds_ad_algo_feed_monitor_data_table a

on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key=’deliver_id_bucket_id’)

=================================================================

执行计划：

抽象语法树：

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

TableScan

alias: a

Filter Operator

predicate:

expr: (key = ‘deliver_id_bucket_id’)

type: boolean

Filter Operator

predicate:

expr: (dt <= 20140101) //分区过滤条件在map端生效

type: boolean

Filter Operator

predicate:

expr: (dt <= 20140108) //分区过滤条件在map端生效

type: boolean

Filter Operator

predicate:

expr: (key = ‘deliver_id_bucket_id’)

type: boolean

Reduce Output Operator

key expressions:

expr: dt

type: string

expr: key_slice

type: string

sort order: ++

Map-reduce partition columns:

expr: dt

type: string

expr: key_slice

type: string

tag: 1

value expressions:

expr: key

type: string

expr: key_slice

type: string

expr: billing_mode_slice

type: string

expr: bucket_id

type: string

expr: ctr

type: string

expr: ecpm

type: string

expr: auc

type: string

expr: pctr

type: string

expr: pctr_ctr

type: string

expr: total_pv

type: string

expr: total_click

type: string

expr: dt

type: string

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: dt

type: string

expr: deliver_id

type: string

sort order: ++

Map-reduce partition columns:

expr: dt

type: string

expr: deliver_id

type: string

tag: 0

value expressions:

expr: cust_uid

type: string

Reduce Operator Tree:

Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

0 {VALUE._col6}

1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

handleSkewJoin: false

outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

Select Operator

expressions:

expr: _col38

type: string

expr: _col39

type: string

expr: _col40

type: string

expr: _col41

type: string

expr: _col42

type: string

expr: _col43

type: string

expr: _col44

type: string

expr: _col45

type: string

expr: _col46

type: string

expr: _col47

type: string

expr: _col48

type: string

expr: _col49

type: string

expr: _col6

type: string

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

例子：

select * from emp e

left join dept d on e.deptno=d.deptno

where d.dt=’2018-06-04′;

花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)

执行计划:

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

Map Reduce Local Work /本地执行

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: d

Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL

HashTable Sink Operator/ReduceSinkOperator将Map端的字段组合序列化为Reduce Key/value, Partition Key，只可能出现在Map阶段，同时也标志着Hive生成的MapReduce程序中Map阶段的结束。

keys:

0 deptno (type: string)

1 deptno (type: string)

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: e

Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 deptno (type: string)

1 deptno (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: (_col15 = ‘2018-06-04’) (type: boolean)

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Select Operator

expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str

ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), ‘2018-06-04’ (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

select * from emp e

left join dept d on (e.deptno=d.deptno and d.dt=’2018-06-04′);

花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)

STAGE DEPENDENCIES:

Stage-4 is a root stage

Stage-3 depends on stages: Stage-4

Stage-0 depends on stages: Stage-3

STAGE PLANS:

Stage: Stage-4

Map Reduce Local Work

Alias -> Map Local Tables:

Fetch Operator

limit: -1

Alias -> Map Local Operator Tree:

TableScan

alias: d

filterExpr: (dt = ‘2018-06-04’) (type: boolean)

Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL

HashTable Sink Operator

keys:

0 deptno (type: string)

1 deptno (type: string)

Stage: Stage-3

Map Reduce

Map Operator Tree:

TableScan

alias: e

Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

Map Join Operator

condition map:

Left Outer Join0 to 1

keys:

0 deptno (type: string)

1 deptno (type: string)

outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

Select Operator

g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

File Output Operator

compressed: false

Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Local Work:

Map Reduce Local Work

Stage: Stage-0

Fetch Operator

limit: -1

Processor Tree:

ListSink

    原文作者：DobeWang
    原文地址: https://www.jianshu.com/p/a2d0c5beee69
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。