自动化动态分配表分区及修改hive表字段名称
1、自动化动态分配表分区
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ods.fund2hundsunlg PARTITION(day)
select distinct fromHostIp ,hundsunNodeIp,concat(substring(requestTime,0,10),’ ‘, substring(requestTime,12,8)) , httpStatus ,responseTimes,urlpath, responseCharts ,postBody,
concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day
from ods.fund2hundsunlog ;
说明:
1)set hive.exec.dynamic.partition.mode=nonstrict; 设置表分区可动态加载
2)concat(substring(requestTime,0,4),substring(requestTime,6,2), substring(requestTime,9,2)) as day,根据已有时间的切分来做partition
2、快速修改hive表字段名称
1) 重新创建新表
drop table ods.dratio;
create EXTERNAL table ods.dratio (
dratioId string comment “用户ID:用户ID:860010-2370010130,注册用户ID:860010-2370010131”,
cookieId string comment “mcookie”,
sex string comment “sex: 1 男, 2 女”,
age string comment “age: 1 0-18, 2 19-29, 3 30-39, 4 40以上”,
ppt string comment “ppt: 1 高购买力 2 中购买力 3 低购买力”,
degree string comment “degree: 1 本科以下 2 本科及其以上”,
favor string comment “喜好信息(不定长)”,
commercial string comment “商业价值信息(不定长)”
)
comment “用户行为分析”
partitioned by(day string comment “按天的分区表字段”)
STORED AS TEXTFILE
location ‘/dw/ods/dratio’;
2)重新分配表分区的数据,无须数据移动
alter table ods.dratio add partition(day=’20150507′) location ‘/dw/ods/dratio/day=20150507’;
alter table ods.dratio add partition(day=’20150508′) location ‘/dw/ods/dratio/day=20150508’;
alter table ods.dratio add partition(day=’20150509′) location ‘/dw/ods/dratio/day=20150509’;
alter table ods.dratio add partition(day=’20150510′) location ‘/dw/ods/dratio/day=20150510’;
alter table ods.dratio add partition(day=’20150511′) location ‘/dw/ods/dratio/day=20150511’;
alter table ods.dratio add partition(day=’20150512′) location ‘/dw/ods/dratio/day=20150512’;
alter table ods.dratio add partition(day=’20150513′) location ‘/dw/ods/dratio/day=20150513’;
alter table ods.dratio add partition(day=’20150514′) location ‘/dw/ods/dratio/day=20150514’;
alter table ods.dratio add partition(day=’20150515′) location ‘/dw/ods/dratio/day=20150515’;
alter table ods.dratio add partition(day=’20150516′) location ‘/dw/ods/dratio/day=20150516’;
alter table ods.dratio add partition(day=’20150517′) location ‘/dw/ods/dratio/day=20150517’;
alter table ods.dratio add partition(day=’20150518′) location ‘/dw/ods/dratio/day=20150518’;
alter table ods.dratio add partition(day=’20150519′) location ‘/dw/ods/dratio/day=20150519’;
alter table ods.dratio add partition(day=’20150520′) location ‘/dw/ods/dratio/day=20150520’;
alter table ods.dratio add partition(day=’20150521′) location ‘/dw/ods/dratio/day=20150521’;
共享中间结果集
很多hive的Job用到的中间结果集 ,存在“亲缘”关系,多作业用共用输入或输出。
1、优化前的SQL
SELECT
COUNT(*) pv
FROM
(
SELECT
cookieid,
userid,
to_date(DATETIME) day1
FROM
ods.tracklog_5min
WHERE
DAY>='20151001'
AND DAY<='20151031'
AND lower(requesturl) IN ('http://chat.hexun.com/',
'http://zhibo.hexun.com/'))t1
INNER JOIN
(
SELECT
cookieid,
to_date(DATETIME) day2
FROM
ods.tracklog_5min
WHERE
DAY>='20151001'
AND DAY<='20151031'
AND ((
lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
OR lower(requesturl) LIKE 'http://chat.hexun.com/%')
AND requesturl LIKE '%/default.html%'))t2
ON
t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
(
SELECT
cookieid,
to_date(DATETIME) day3
FROM
ods.tracklog_5min
WHERE
DAY>='20151001'
AND DAY<='20151031'
AND ( (
lower(requesturl) LIKE 'http://px.hexun.com/%'
AND lower(requesturl) LIKE '%/default.html%' )
OR (
lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
AND lower(requesturl) LIKE '%.html%' )
OR (
lower(requesturl) LIKE 'http://px.hexun.com/p/%'
AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
stage.saleplatform_productvisitdetail_temp t4
ON
t1.userid=t4.userid
WHERE
t4.createtime>t1.day1
OR t4.userid IS NULL;
可以看,上面的SQL针对同一源表的数据查询了三次,浪费了系统的资源,相同的源完全可以通用。
2、优化后的SQL
抽出公共数据
create table default.tracklog_10month as
select * from ods.tracklog_5min
WHERE DAY>='20151001' AND DAY<='20151031';
利用临时表,替换原SQL的公共部分:
SELECT
COUNT(*) pv
FROM
(
SELECT
cookieid,
userid,
to_date(DATETIME) day1
FROM
default.tracklog_10month
WHERE
lower(requesturl) IN ('http://chat.hexun.com/',
'http://zhibo.hexun.com/'))t1
INNER JOIN
(
SELECT
cookieid,
to_date(DATETIME) day2
FROM
default.tracklog_10month
WHERE (lower(requesturl) LIKE 'http://zhibo.hexun.com/%'
OR lower(requesturl) LIKE 'http://chat.hexun.com/%')
AND requesturl LIKE '%/default.html%')t2
ON
t1.cookieid=t2.cookieid
AND t1.day1=t2.day2
INNER JOIN
(
SELECT
cookieid,
to_date(DATETIME) day3
FROM
default.tracklog_10month
WHERE
( (
lower(requesturl) LIKE 'http://px.hexun.com/%'
AND lower(requesturl) LIKE '%/default.html%' )
OR (
lower(requesturl) LIKE 'http://px.hexun.com/pack/%'
AND lower(requesturl) LIKE '%.html%' )
OR (
lower(requesturl) LIKE 'http://px.hexun.com/p/%'
AND lower(requesturl) LIKE '%.html%' ) ))t3
ON
t1.cookieid=t3.cookieid
AND t1.day1=t3.day3
LEFT JOIN
stage.saleplatform_productvisitdetail_temp t4
ON
t1.userid=t4.userid
WHERE
t4.createtime>t1.day1
OR t4.userid IS NULL;
3、共享中间结果集
本质就是降IO,减少MR阶段中大量读写磁盘及网络IO的压力。
巧用group by实现去重统计
网站统计中常用的指标,pv ,uv , 独立IP,登录用户等,都涉及去重操作。全年的统计,PV超过100亿以上。即使是简单的去重统计也非常困难。
1、统计去重,原来SQL如下
select substr(day,1,4) year,count(*) PV,count(distinct cookieid) UV,count(distinct ip) IP,count(distinct userid) LOGIN
from dms.tracklog_5min a
where substr(day,1,4)='2015'
group by substr(day,1,4);
统计中四个指示,三个都涉及了去重,任务跑了几个小时都未出结果。
2、利用group by 实现去重
select "2015","PV",count(*) from dms.tracklog_5min
where day>='2015' and day<'2016'
union all
select "201505","UV",count(*) from (
select cookieid from dms.tracklog_5min
where day>='2015' and day<'2016' group by cookieid ) a
union all
select "2015","IP",count(*) from (
select ip from dms.tracklog_5min
where day>='2015' and day<'2016' group by ip ) a
union all
select "2015","LOGIN",count(*) from (
select userid from dms.tracklog_5min
where day>='2015' and day<'2016' group by userid) b;
单独统计pv,uv,IP,login等指标,并union拼起来,任务跑了不到1个小时就去来结果了
3、参数的优化
SET mapred.reduce.tasks=50;
SET mapreduce.reduce.memory.mb=6000;
SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;
涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。
巧用MapJoin解决数据倾斜问题
Hive的MapJoin,在Join 操作在 Map 阶段完成,如果需要的数据在 Map 的过程中可以访问到则不再需要Reduce。
小表关联一个超大表时,容易发生数据倾斜,可以用MapJoin把小表全部加载到内存在map端进行join,避免reducer处理。
select c.channel_name,count(t.requesturl) PV
from ods.cms_channel c
join
(select host,requesturl from dms.tracklog_5min where day='20151111' ) t
on c.channel_name=t.host
group by c.channel_name
order by c.channel_name;
上以为小表join大表的操作,可以使用mapjoin把小表放到内存中处理,语法很简单只需要增加 /*+ MAPJOIN(pt) */ ,把需要分发的表放入到内存中
select /*+ MAPJOIN(c) */
c.channel_name,count(t.requesturl) PV
from ods.cms_channel c
join
(select host,requesturl from dms.tracklog_5min where day='20151111' ) t
on c.channel_name=t.host
group by c.channel_name
order by c.channel_name;
这种用在出现数据倾斜时经常使用
参数说明:
1)如果是小表,自动选择Mapjoin:
set hive.auto.convert.join = true; # 默认为false
该参数为true时,Hive自动对左边的表统计量,如果是小表就加入内存,即对 小表使用Map join
2)大表小表的阀值:
set hive.mapjoin.smalltable.filesize;
hive.mapjoin.smalltable.filesize=25000000
默认值是25mb
3)map join做group by 操作时,可以使用多大的内存来存储数据,如果数据太大,则不会保存在内存里
set hive.mapjoin.followby.gby.localtask.max.memory.usage;
默认值:0.55
4)本地任务可以使用内存的百分比
set hive.mapjoin.localtask.max.memory.usage;
默认值: 0.90