【转载】hive使用技巧

自动化动态分配表分区及修改hive表字段名称

1、自动化动态分配表分区

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods.fund2hundsunlg PARTITION(day)
select distinct fromHostIp ,hundsunNodeIp,concat(substring(requestTime,0,10),’ ‘, substring(requestTime,12,8)) , httpStatus ,responseTimes,urlpath, responseCharts ,postBody,
concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day
from ods.fund2hundsunlog ;

说明:

1)set hive.exec.dynamic.partition.mode=nonstrict; 设置表分区可动态加载

2)concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day,根据已有时间的切分来做partition

2、快速修改hive表字段名称

1) 重新创建新表

drop table ods.dratio;
create EXTERNAL table ods.dratio (
dratioId string comment “用户ID:用户ID:860010-2370010130,注册用户ID:860010-2370010131”,
cookieId string comment “mcookie”,
sex string comment “sex: 1 男, 2 女”,
age string comment “age: 1 0-18, 2 19-29, 3 30-39, 4 40以上”,
ppt string comment “ppt: 1 高购买力 2 中购买力 3 低购买力”,
degree string comment “degree: 1 本科以下 2 本科及其以上”,
favor string comment “喜好信息(不定长)”,
commercial string comment “商业价值信息(不定长)”
)
comment “用户行为分析”
partitioned by(day string comment “按天的分区表字段”)
STORED AS TEXTFILE
location ‘/dw/ods/dratio’;

2)重新分配表分区的数据,无须数据移动
alter table ods.dratio add partition(day=’20150507′) location ‘/dw/ods/dratio/day=20150507’;
alter table ods.dratio add partition(day=’20150508′) location ‘/dw/ods/dratio/day=20150508’;
alter table ods.dratio add partition(day=’20150509′) location ‘/dw/ods/dratio/day=20150509’;
alter table ods.dratio add partition(day=’20150510′) location ‘/dw/ods/dratio/day=20150510’;
alter table ods.dratio add partition(day=’20150511′) location ‘/dw/ods/dratio/day=20150511’;
alter table ods.dratio add partition(day=’20150512′) location ‘/dw/ods/dratio/day=20150512’;
alter table ods.dratio add partition(day=’20150513′) location ‘/dw/ods/dratio/day=20150513’;
alter table ods.dratio add partition(day=’20150514′) location ‘/dw/ods/dratio/day=20150514’;
alter table ods.dratio add partition(day=’20150515′) location ‘/dw/ods/dratio/day=20150515’;
alter table ods.dratio add partition(day=’20150516′) location ‘/dw/ods/dratio/day=20150516’;
alter table ods.dratio add partition(day=’20150517′) location ‘/dw/ods/dratio/day=20150517’;
alter table ods.dratio add partition(day=’20150518′) location ‘/dw/ods/dratio/day=20150518’;
alter table ods.dratio add partition(day=’20150519′) location ‘/dw/ods/dratio/day=20150519’;
alter table ods.dratio add partition(day=’20150520′) location ‘/dw/ods/dratio/day=20150520’;
alter table ods.dratio add partition(day=’20150521′) location ‘/dw/ods/dratio/day=20150521’;

共享中间结果集

很多hive的Job用到的中间结果集 ,存在“亲缘”关系,多作业用共用输入或输出。
1、优化前的SQL


SELECT
    COUNT(*) pv
FROM
    (
        SELECT
            cookieid,
            userid,
            to_date(DATETIME) day1
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND lower(requesturl) IN ('http://chat.hexun.com/',
                                  'http://zhibo.hexun.com/'))t1 

INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day2
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND ((
                    lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
                OR  lower(requesturl) LIKE 'http://chat.hexun.com/%')
            AND requesturl LIKE '%/default.html%'))t2
ON
    t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day3
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND ( (
                    lower(requesturl) LIKE 'http://px.hexun.com/%'
                AND lower(requesturl) LIKE '%/default.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
                AND lower(requesturl) LIKE '%.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/p/%'
                AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
    t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
    stage.saleplatform_productvisitdetail_temp t4
ON
    t1.userid=t4.userid
WHERE
    t4.createtime>t1.day1
OR  t4.userid IS NULL;

可以看,上面的SQL针对同一源表的数据查询了三次,浪费了系统的资源,相同的源完全可以通用。
2、优化后的SQL

抽出公共数据


create table default.tracklog_10month as   
select * from  ods.tracklog_5min
WHERE  DAY>='20151001' AND DAY<='20151031';

利用临时表,替换原SQL的公共部分:


SELECT
    COUNT(*) pv
FROM
    (
        SELECT
            cookieid,
            userid,
            to_date(DATETIME) day1
        FROM
            default.tracklog_10month 
        WHERE
             lower(requesturl) IN ('http://chat.hexun.com/',
                                  'http://zhibo.hexun.com/'))t1 

INNER JOIN
    (
       SELECT
            cookieid,
            to_date(DATETIME) day2
        FROM
            default.tracklog_10month 
        WHERE  (lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
                OR  lower(requesturl) LIKE 'http://chat.hexun.com/%')
            AND requesturl LIKE '%/default.html%')t2
ON
    t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day3
        FROM
            default.tracklog_10month
        WHERE        
        ( (
                    lower(requesturl) LIKE 'http://px.hexun.com/%'
                AND lower(requesturl) LIKE '%/default.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
                AND lower(requesturl) LIKE '%.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/p/%'
                AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
    t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
    stage.saleplatform_productvisitdetail_temp t4
ON
    t1.userid=t4.userid
WHERE
    t4.createtime>t1.day1
OR  t4.userid IS NULL;

3、共享中间结果集
本质就是降IO,减少MR阶段中大量读写磁盘及网络IO的压力。

巧用group by实现去重统计

网站统计中常用的指标,pv ,uv , 独立IP,登录用户等,都涉及去重操作。全年的统计,PV超过100亿以上。即使是简单的去重统计也非常困难。
1、统计去重,原来SQL如下


select substr(day,1,4) year,count(*) PV,count(distinct cookieid) UV,count(distinct ip) IP,count(distinct userid) LOGIN 
from dms.tracklog_5min a  
where substr(day,1,4)='2015'
group by substr(day,1,4);

统计中四个指示,三个都涉及了去重,任务跑了几个小时都未出结果。
2、利用group by 实现去重


select "2015","PV",count(*) from dms.tracklog_5min
where day>='2015' and day<'2016'
union all 
select "201505","UV",count(*) from (
select  cookieid from dms.tracklog_5min
where day>='2015' and day<'2016'  group by cookieid ) a 
union all 
select "2015","IP",count(*) from (
select  ip from dms.tracklog_5min
where day>='2015' and day<'2016'  group by ip ) a 
union all 
select "2015","LOGIN",count(*) from (
select  userid from dms.tracklog_5min
where day>='2015' and day<'2016' group by userid) b;

单独统计pv,uv,IP,login等指标,并union拼起来,任务跑了不到1个小时就去来结果了
3、参数的优化


SET mapred.reduce.tasks=50;
SET mapreduce.reduce.memory.mb=6000;
SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。

巧用MapJoin解决数据倾斜问题

Hive的MapJoin,在Join 操作在 Map 阶段完成,如果需要的数据在 Map 的过程中可以访问到则不再需要Reduce。
小表关联一个超大表时,容易发生数据倾斜,可以用MapJoin把小表全部加载到内存在map端进行join,避免reducer处理。


select c.channel_name,count(t.requesturl) PV
 from ods.cms_channel c
 join
 (select host,requesturl from  dms.tracklog_5min where day='20151111' ) t
 on c.channel_name=t.host
 group by c.channel_name
 order by c.channel_name;

上以为小表join大表的操作,可以使用mapjoin把小表放到内存中处理,语法很简单只需要增加 /*+ MAPJOIN(pt) */ ,把需要分发的表放入到内存中


select /*+ MAPJOIN(c) */
c.channel_name,count(t.requesturl) PV
 from ods.cms_channel c
 join
 (select host,requesturl from  dms.tracklog_5min where day='20151111' ) t
 on c.channel_name=t.host
 group by c.channel_name
 order by c.channel_name;

这种用在出现数据倾斜时经常使用
参数说明:
1)如果是小表,自动选择Mapjoin:

set hive.auto.convert.join = true; # 默认为false

该参数为true时,Hive自动对左边的表统计量,如果是小表就加入内存,即对 小表使用Map join

2)大表小表的阀值:

set hive.mapjoin.smalltable.filesize;  

hive.mapjoin.smalltable.filesize=25000000
默认值是25mb

3)map join做group by 操作时,可以使用多大的内存来存储数据,如果数据太大,则不会保存在内存里

set hive.mapjoin.followby.gby.localtask.max.memory.usage;  

默认值:0.55

4)本地任务可以使用内存的百分比

set hive.mapjoin.localtask.max.memory.usage;  

默认值: 0.90

    原文作者:xiaodf
    原文地址: https://www.jianshu.com/p/996883b2af1a
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞