hive执行计划举例

执行计划例子:

insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;

OK

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))

STAGE DEPENDENCIES:

  Stage-1 is a root stage   /根

  Stage-0 depends on stages: Stage-1 /0依赖1

  Stage-2 depends on stages: Stage-0 /2依赖0

STAGE PLANS:

Stage: Stage-1

Map Reduce//这个阶段是一个mapreduce作业      Alias -> Map Operator Tree:   //map操作树,对应map阶段

t1

TableScan   //扫描表获取数据   from加载表,描述中有行数和大小等

            alias: t1     //表别名

Reduce Output Operator //这里描述map的输出,也就是reduce的输入。比如key,partition,sort等信息 

              key expressions:  //t1表输出到reduce阶段的key信息

expr: bar

type: string

sort order: +  //一个排序字段,这个排序字段是key=bar,多个排序字段多个+

Map-reduce partition columns:  //partition的信息,由此也可以看出hive在join的时候会以join on后的列作为partition的列,以保证具有相同此列的值的行被分到同一个reduce中去

expr: bar

type: string

tag: 0                         //对t1表打标签

value expressions:   //t1表输出到reduce阶段的value信息

expr: foo

type: int

expr: bar

type: string

t2

TableScan

alias: t2

Reduce Output Operator

key expressions:

expr: bar

type: string

sort order: +

Map-reduce partition columns:

expr: bar

type: string

tag: 1

value expressions:

expr: foo

type: int

Reduce Operator Tree://reduce操作树,相当于reduce阶段Join Operator

condition map:

Inner Join 0 to 1

          condition expressions:

            0 {VALUE._col0} {VALUE._col1} //对应前面t1.bar, t1.foo

            1 {VALUE._col0} //对应前面t2.foo

          handleSkewJoin: false

          outputColumnNames: _col0, _col1, _col5

Select Operator //筛选列,描述中有列名、类型,输出类型、大小等。

expressions:

expr: _col1

type: string

expr: _col0

type: int

expr: _col5

                  type: int

            outputColumnNames: _col0, _col1, _col2   //为临时结果字段按规则起的临时字段名

File Output Operator //输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。

compressed: false

GlobalTableId: 1

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: default.lpx

  Stage: Stage-0

Move Operator //Stage-0简单把结果从临时目录,移动到表lpx相关的目录。

tables:

replace: true

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: default.lpx

Stage: Stage-2

Stats-Aggr Operator

========================================

========================================

从信息头:

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 depends on stages: Stage-1

Stage-2 depends on stages: Stage-0

从这里可以看出Plan计划的Job任务结构,整个任务会分为3个Job执行,第一个Job将由Stage-1构成;

第二个Job处理由Stage-0构成,Stage-0的处理必须依赖Stage-1阶段的结果;

第三个Job处理由Stage-2构成,Stage-2的处理必须依赖Stage-0阶段的结果。

下面分别解释Stage-1和Stage-0,执行SQL可以分成两步:(1)SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);

(2)

insert overwrite TABLE lpx;

Stage: Stage-1对应一次完整的Map Reduce任务,包括:Map Operator Tree和Reduce Operator Tree两步操作,Map Operator Tree对应Map任务,Reduce Operator Tree对应Reduce任务。从Map Operator Tree阶段可以看出进行了两个并列的操作t1和t2,分别SELECT t1.bar, t1.foo FROM t1;和SELECT t2.foo FROM t2;而且两个Map任务分别产生了Reduce阶段的输入[Reduce Output Operator]。从Reduce Operator Tree分析可以看到如下信息,条件连接Map的输出以及通过预定义的输出格式生成符合default.lpx的存储格式的数据存储到HDFS中。在我们创建lpx表的时候,没有指定该表的存储格式,默认会以Text为存储格式,输入输出会以TextInputFormat与TextOutputFormat进行读写:

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: default.lpx

input format的值对应org.apache.hadoop.mapred.TextInputFormat,这是因为在开始的Map阶段产生的临时输出文件是以TextOutputFormat格式保存的,自然Reduce的读取是由TextInputFormat格式处理读入数据。这些是由Hadoop的MapReduce处理细节来控制,而Hive只需要指定处理格式即可。

Serde值为org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe类,这时这个对象的保存的值为_col0, _col1, _col2,也就是我们预期要查询的t1.bar, t1.foo, t2.foo,这个值具体的应该为_col0+表lpx设置的列分割符+_col1+表lpx设置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat可以知道output的处理是使用该类来处理的。

Stage-0对应上面提到的第二步操作。这时stage-1产生的临时处理文件举例如tmp,需要经过stage-0阶段操作处理到lpx表中。Move Operator代表了这并不是一个MapReduce任务,只需要调用MoveTask的处理就行,在处理之前会去检查输入文件是否符合lpx表的存储格式。

hive执行计划作用

分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;

提前过滤数据数据集,减少不必要的读取操作;例如: hive join操作先于where条件顾虑,将分区条件放入on语句中,能够有效减少输入数据集;

执行计划分析问题hql

select a.*, b.cust_uid

from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a

where a.dt<=20140101 and a.dt<=20140108 and key=’deliver_id_bucket_id’ and a.dt=b.dt and a.key_slice=b.deliver_id

==========================================================================

==========================================================================

执行计划:

抽象语法树:

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) ‘deliver_id_bucket_id’)) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Alias -> Map Operator Tree:

        a

         TableScan

            alias: a

           Filter Operator

              predicate:

                  expr: (key = ‘deliver_id_bucket_id’) //按key指定值在map阶段过滤

                  type: boolean

             Reduce Output Operator

                sort order:

                tag: 1

                value expressions: //select *导致输出到reduce的数据是全部的列信息

                     expr: key

                      type: string

                      expr: key_slice

                      type: string

                      expr: billing_mode_slice

                      type: string

                      expr: bucket_id

                      type: string

                      expr: ctr

                      type: string

                      expr: ecpm

                      type: string

                      expr: auc

                      type: string

                      expr: pctr

                      type: string

                      expr: pctr_ctr

                      type: string

                      expr: total_pv

                      type: string

                      expr: total_click

                      type: string

                      expr: dt

                      type: string

        b

         TableScan

            alias: b

           Reduce Output Operator

              sort order:

              tag: 0

              value expressions:

                    expr: deliver_id

                    type: string

                    expr: cust_uid

                    type: string

                    expr: dt

                    type: string

      Reduce Operator Tree:

       Join Operator

          condition map:

               Inner Join 0 to 1

          condition expressions:

            0 {VALUE._col0} {VALUE._col6} {VALUE._col35}

            1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

          handleSkewJoin: false

          outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

         Filter Operator

            predicate:

               expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = ‘deliver_id_bucket_id’)) and (_col49 = _col35)) and (_col39 = _col0))

                type: boolean

           Select Operator

              expressions:

                    expr: _col38

                    type: string

                    expr: _col39

                    type: string

                    expr: _col40

                    type: string

                    expr: _col41

                    type: string

                    expr: _col42

                    type: string

                    expr: _col43

                    type: string

                    expr: _col44

                    type: string

                    expr: _col45

                    type: string

                    expr: _col46

                    type: string

                    expr: _col47

                    type: string

                    expr: _col48

                    type: string

                    expr: _col49

                    type: string

                    expr: _col6

                    type: string

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

             File Output Operator

                compressed: false

                GlobalTableId: 0

                table:

                    input format: org.apache.hadoop.mapred.TextInputFormat

                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0

    Fetch Operator

      limit: -1

优化之后hql:

select a.*, b.cust_uid

from ods_ad_bid_deliver_info b

join mds_ad_algo_feed_monitor_data_table a

on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key=’deliver_id_bucket_id’)

=================================================================

=================================================================

执行计划:

抽象语法树:

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Alias -> Map Operator Tree:

        a

          TableScan

            alias: a

           Filter Operator

              predicate:

                  expr: (key = ‘deliver_id_bucket_id’)

                  type: boolean

             Filter Operator

                predicate:

                    expr: (dt <= 20140101)  //分区过滤条件在map端生效

                    type: boolean

               Filter Operator

                  predicate:

                      expr: (dt <= 20140108)  //分区过滤条件在map端生效

                      type: boolean

                  Filter Operator

                    predicate:

                        expr: (key = ‘deliver_id_bucket_id’)

                        type: boolean

                    Reduce Output Operator

                      key expressions:

                            expr: dt

                            type: string

                            expr: key_slice

                            type: string

                      sort order: ++

                      Map-reduce partition columns:

                            expr: dt

                            type: string

                            expr: key_slice

                            type: string

                      tag: 1

                      value expressions:

                            expr: key

                            type: string

                            expr: key_slice

                            type: string

                            expr: billing_mode_slice

                            type: string

                            expr: bucket_id

                            type: string

                            expr: ctr

                            type: string

                            expr: ecpm

                            type: string

                            expr: auc

                            type: string

                            expr: pctr

                            type: string

                            expr: pctr_ctr

                            type: string

                            expr: total_pv

                            type: string

                            expr: total_click

                            type: string

                            expr: dt

                            type: string

        b

          TableScan

            alias: b

            Reduce Output Operator

              key expressions:

                    expr: dt

                    type: string

                    expr: deliver_id

                    type: string

              sort order: ++

              Map-reduce partition columns:

                    expr: dt

                    type: string

                    expr: deliver_id

                    type: string

              tag: 0

              value expressions:

                    expr: cust_uid

                    type: string

      Reduce Operator Tree:

        Join Operator

          condition map:

               Inner Join 0 to 1

          condition expressions:

            0 {VALUE._col6}

            1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}

          handleSkewJoin: false

          outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49

          Select Operator

            expressions:

                  expr: _col38

                  type: string

                  expr: _col39

                  type: string

                  expr: _col40

                  type: string

                  expr: _col41

                  type: string

                  expr: _col42

                  type: string

                  expr: _col43

                 type: string

                  expr: _col44

                  type: string

                  expr: _col45

                  type: string

                  expr: _col46

                  type: string

                  expr: _col47

                  type: string

                  expr: _col48

                  type: string

                  expr: _col49

                  type: string

                  expr: _col6

                  type: string

            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

            File Output Operator

              compressed: false

              GlobalTableId: 0

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0

    Fetch Operator

      limit: -1

例子:

select * from emp e

left join dept d on e.deptno=d.deptno

where d.dt=’2018-06-04′;

花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)

执行计划:

STAGE DEPENDENCIES:

  Stage-4 is a root stage

  Stage-3 depends on stages: Stage-4

  Stage-0 depends on stages: Stage-3

STAGE PLANS:

  Stage: Stage-4

    Map Reduce Local Work  /本地执行

      Alias -> Map Local Tables:

        d

          Fetch Operator

            limit: -1

      Alias -> Map Local Operator Tree:

        d

          TableScan

            alias: d

            Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL

HashTable Sink Operator/ReduceSinkOperator将Map端的字段组合序列化为Reduce Key/value, Partition Key,只可能出现在Map阶段,同时也标志着Hive生成的MapReduce程序中Map阶段的结束。

              keys:

                0 deptno (type: string)

                1 deptno (type: string)

  Stage: Stage-3

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: e

            Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

            Map Join Operator

              condition map:

                   Left Outer Join0 to 1

              keys:

                0 deptno (type: string)

                1 deptno (type: string)

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

              Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

              Filter Operator

                predicate: (_col15 = ‘2018-06-04’) (type: boolean)

                Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                Select Operator

                  expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str

ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), ‘2018-06-04’ (type: string)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                  Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                  File Output Operator

                    compressed: false

                    Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                    table:

                        input format: org.apache.hadoop.mapred.TextInputFormat

                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

      Local Work:

        Map Reduce Local Work

  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink

select * from emp e

left join dept d on (e.deptno=d.deptno and  d.dt=’2018-06-04′);

花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)

STAGE DEPENDENCIES:

  Stage-4 is a root stage

  Stage-3 depends on stages: Stage-4

  Stage-0 depends on stages: Stage-3

STAGE PLANS:

  Stage: Stage-4

    Map Reduce Local Work

      Alias -> Map Local Tables:

        d

          Fetch Operator

            limit: -1

      Alias -> Map Local Operator Tree:

        d

          TableScan

            alias: d

            filterExpr: (dt = ‘2018-06-04’) (type: boolean)

            Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL

            HashTable Sink Operator

              keys:

                0 deptno (type: string)

                1 deptno (type: string)

  Stage: Stage-3

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: e

            Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL

            Map Join Operator

              condition map:

                   Left Outer Join0 to 1

              keys:

                0 deptno (type: string)

                1 deptno (type: string)

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15

              Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

              Select Operator

                expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: strin

g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string)                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12

                Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                File Output Operator

                  compressed: false

                  Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE

                  table:

                      input format: org.apache.hadoop.mapred.TextInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

      Local Work:

        Map Reduce Local Work

  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink

    原文作者:DobeWang
    原文地址: https://www.jianshu.com/p/a2d0c5beee69
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞