【转载】hive使用技巧

2023年3月23日 272次阅读来源: xiaodf

自动化动态分配表分区及修改hive表字段名称

1、自动化动态分配表分区

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods.fund2hundsunlg PARTITION(day)
select distinct fromHostIp ,hundsunNodeIp,concat(substring(requestTime,0,10),’ ‘, substring(requestTime,12,8)) , httpStatus ,responseTimes,urlpath, responseCharts ,postBody,
concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day
from ods.fund2hundsunlog ;

说明：

1）set hive.exec.dynamic.partition.mode=nonstrict; 设置表分区可动态加载

2）concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day，根据已有时间的切分来做partition

2、快速修改hive表字段名称

1）重新创建新表

drop table ods.dratio;
create EXTERNAL table ods.dratio (
dratioId string comment “用户ID:用户ID:860010-2370010130,注册用户ID：860010-2370010131”,
cookieId string comment “mcookie”,
sex string comment “sex: 1 男， 2 女”,
age string comment “age: 1 0-18, 2 19-29, 3 30-39, 4 40以上”,
ppt string comment “ppt: 1 高购买力 2 中购买力 3 低购买力”,
degree string comment “degree: 1 本科以下 2 本科及其以上”,
favor string comment “喜好信息(不定长)”,
commercial string comment “商业价值信息(不定长)”
)
comment “用户行为分析”
partitioned by(day string comment “按天的分区表字段”)
STORED AS TEXTFILE
location ‘/dw/ods/dratio’;

2）重新分配表分区的数据，无须数据移动
alter table ods.dratio add partition(day=’20150507′) location ‘/dw/ods/dratio/day=20150507’;
alter table ods.dratio add partition(day=’20150508′) location ‘/dw/ods/dratio/day=20150508’;
alter table ods.dratio add partition(day=’20150509′) location ‘/dw/ods/dratio/day=20150509’;
alter table ods.dratio add partition(day=’20150510′) location ‘/dw/ods/dratio/day=20150510’;
alter table ods.dratio add partition(day=’20150511′) location ‘/dw/ods/dratio/day=20150511’;
alter table ods.dratio add partition(day=’20150512′) location ‘/dw/ods/dratio/day=20150512’;
alter table ods.dratio add partition(day=’20150513′) location ‘/dw/ods/dratio/day=20150513’;
alter table ods.dratio add partition(day=’20150514′) location ‘/dw/ods/dratio/day=20150514’;
alter table ods.dratio add partition(day=’20150515′) location ‘/dw/ods/dratio/day=20150515’;
alter table ods.dratio add partition(day=’20150516′) location ‘/dw/ods/dratio/day=20150516’;
alter table ods.dratio add partition(day=’20150517′) location ‘/dw/ods/dratio/day=20150517’;
alter table ods.dratio add partition(day=’20150518′) location ‘/dw/ods/dratio/day=20150518’;
alter table ods.dratio add partition(day=’20150519′) location ‘/dw/ods/dratio/day=20150519’;
alter table ods.dratio add partition(day=’20150520′) location ‘/dw/ods/dratio/day=20150520’;
alter table ods.dratio add partition(day=’20150521′) location ‘/dw/ods/dratio/day=20150521’;

共享中间结果集

很多hive的Job用到的中间结果集，存在“亲缘”关系，多作业用共用输入或输出。
1、优化前的SQL


SELECT
    COUNT(*) pv
FROM
    (
        SELECT
            cookieid,
            userid,
            to_date(DATETIME) day1
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND lower(requesturl) IN ('http://chat.hexun.com/',
                                  'http://zhibo.hexun.com/'))t1 

INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day2
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND ((
                    lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
                OR  lower(requesturl) LIKE 'http://chat.hexun.com/%')
            AND requesturl LIKE '%/default.html%'))t2
ON
    t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day3
        FROM
            ods.tracklog_5min
        WHERE
            DAY>='20151001'
        AND DAY<='20151031'
        AND ( (
                    lower(requesturl) LIKE 'http://px.hexun.com/%'
                AND lower(requesturl) LIKE '%/default.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
                AND lower(requesturl) LIKE '%.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/p/%'
                AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
    t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
    stage.saleplatform_productvisitdetail_temp t4
ON
    t1.userid=t4.userid
WHERE
    t4.createtime>t1.day1
OR  t4.userid IS NULL;

可以看，上面的SQL针对同一源表的数据查询了三次，浪费了系统的资源，相同的源完全可以通用。
2、优化后的SQL

抽出公共数据


create table default.tracklog_10month as   
select * from  ods.tracklog_5min
WHERE  DAY>='20151001' AND DAY<='20151031';

利用临时表，替换原SQL的公共部分：


SELECT
    COUNT(*) pv
FROM
    (
        SELECT
            cookieid,
            userid,
            to_date(DATETIME) day1
        FROM
            default.tracklog_10month 
        WHERE
             lower(requesturl) IN ('http://chat.hexun.com/',
                                  'http://zhibo.hexun.com/'))t1 

INNER JOIN
    (
       SELECT
            cookieid,
            to_date(DATETIME) day2
        FROM
            default.tracklog_10month 
        WHERE  (lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
                OR  lower(requesturl) LIKE 'http://chat.hexun.com/%')
            AND requesturl LIKE '%/default.html%')t2
ON
    t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
    (
        SELECT
            cookieid,
            to_date(DATETIME) day3
        FROM
            default.tracklog_10month
        WHERE        
        ( (
                    lower(requesturl) LIKE 'http://px.hexun.com/%'
                AND lower(requesturl) LIKE '%/default.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
                AND lower(requesturl) LIKE '%.html%' )
            OR  (
                    lower(requesturl) LIKE 'http://px.hexun.com/p/%'
                AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
    t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
    stage.saleplatform_productvisitdetail_temp t4
ON
    t1.userid=t4.userid
WHERE
    t4.createtime>t1.day1
OR  t4.userid IS NULL;

3、共享中间结果集
本质就是降IO，减少MR阶段中大量读写磁盘及网络IO的压力。

巧用group by实现去重统计

网站统计中常用的指标，pv ,uv , 独立IP,登录用户等，都涉及去重操作。全年的统计，PV超过100亿以上。即使是简单的去重统计也非常困难。
1、统计去重，原来SQL如下


select substr(day,1,4) year,count(*) PV,count(distinct cookieid) UV,count(distinct ip) IP,count(distinct userid) LOGIN 
from dms.tracklog_5min a  
where substr(day,1,4)='2015'
group by substr(day,1,4);

统计中四个指示，三个都涉及了去重，任务跑了几个小时都未出结果。
2、利用group by 实现去重


select "2015","PV",count(*) from dms.tracklog_5min
where day>='2015' and day<'2016'
union all 
select "201505","UV",count(*) from (
select  cookieid from dms.tracklog_5min
where day>='2015' and day<'2016'  group by cookieid ) a 
union all 
select "2015","IP",count(*) from (
select  ip from dms.tracklog_5min
where day>='2015' and day<'2016'  group by ip ) a 
union all 
select "2015","LOGIN",count(*) from (
select  userid from dms.tracklog_5min
where day>='2015' and day<'2016' group by userid) b;

单独统计pv,uv,IP,login等指标，并union拼起来，任务跑了不到1个小时就去来结果了
3、参数的优化


SET mapred.reduce.tasks=50;
SET mapreduce.reduce.memory.mb=6000;
SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

涉及数据倾斜的话，主要是reduce中数据倾斜的问题，可能通过设置hive中reduce的并行数，reduce的内存大小单位为m，reduce中 shuffle的刷磁盘的比例，来解决。

巧用MapJoin解决数据倾斜问题

Hive的MapJoin，在Join 操作在 Map 阶段完成，如果需要的数据在 Map 的过程中可以访问到则不再需要Reduce。
小表关联一个超大表时，容易发生数据倾斜，可以用MapJoin把小表全部加载到内存在map端进行join，避免reducer处理。


select c.channel_name,count(t.requesturl) PV
 from ods.cms_channel c
 join
 (select host,requesturl from  dms.tracklog_5min where day='20151111' ) t
 on c.channel_name=t.host
 group by c.channel_name
 order by c.channel_name;

上以为小表join大表的操作，可以使用mapjoin把小表放到内存中处理，语法很简单只需要增加 /*+ MAPJOIN(pt) */ ，把需要分发的表放入到内存中


select /*+ MAPJOIN(c) */
c.channel_name,count(t.requesturl) PV
 from ods.cms_channel c
 join
 (select host,requesturl from  dms.tracklog_5min where day='20151111' ) t
 on c.channel_name=t.host
 group by c.channel_name
 order by c.channel_name;

这种用在出现数据倾斜时经常使用
参数说明：
1）如果是小表，自动选择Mapjoin：

set hive.auto.convert.join = true; # 默认为false

该参数为true时，Hive自动对左边的表统计量，如果是小表就加入内存，即对小表使用Map join

2）大表小表的阀值：

set hive.mapjoin.smalltable.filesize;

hive.mapjoin.smalltable.filesize=25000000
默认值是25mb

3）map join做group by 操作时，可以使用多大的内存来存储数据，如果数据太大，则不会保存在内存里

set hive.mapjoin.followby.gby.localtask.max.memory.usage;

默认值：0.55

4）本地任务可以使用内存的百分比

set hive.mapjoin.localtask.max.memory.usage;

默认值： 0.90

    原文作者：xiaodf
    原文地址: https://www.jianshu.com/p/996883b2af1a
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。