Hive窗口函数NTILE、ROW_NUMBER、RANK、DENSE_RANK入门
1. 数据说明
现有 hive 表 cookie2, 内容如下:
hive> select * from cookie2;
cookie1 2015-04-10 1
cookie1 2015-04-11 5
cookie1 2015-04-12 7
cookie1 2015-04-13 3
cookie1 2015-04-14 2
cookie1 2015-04-15 4
cookie1 2015-04-16 4
cookie2 2015-04-10 2
cookie2 2015-04-11 3
cookie2 2015-04-12 5
cookie2 2015-04-13 6
cookie2 2015-04-14 3
cookie2 2015-04-15 9
cookie2 2015-04-16 7
- 其中字段意义:
cookieid(string), createtime(string), pv(int) - 分别代表:
cookieid, 创建时间, 该cookieid当天的pv值
2. ntile()操作
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值,ntile不支持rows between
,比如 ntile(2) over(partition by cookieid order by createtime rows between 3 preceding and current row)
,如果切片不均匀,默认增加第一个切片的分布。
(1) 将分组数组按照顺序切分成2份
select cookieid, createtime, pv,
ntile(2) over(partition by cookieid order by createtime) as nt1
from cookie2;
结果:因为每组有7条数据,所以默认第一个分片加1条数据
cookie1 2015-04-10 1 1
cookie1 2015-04-11 5 1
cookie1 2015-04-12 7 1
cookie1 2015-04-13 3 1
cookie1 2015-04-14 2 2
cookie1 2015-04-15 4 2
cookie1 2015-04-16 4 2
cookie2 2015-04-10 2 1
cookie2 2015-04-11 3 1
cookie2 2015-04-12 5 1
cookie2 2015-04-13 6 1
cookie2 2015-04-14 3 2
cookie2 2015-04-15 9 2
cookie2 2015-04-16 7 2
(2) 将分组数组按照顺序切分成n份
select cookieid, createtime, pv,
ntile(3) over(partition by cookieid order by createtime) as nt2,
ntile(4) over(partition by cookieid order by createtime) as nt3
from cookie2;
cookie1 2015-04-10 1 1 1
cookie1 2015-04-11 5 1 1
cookie1 2015-04-12 7 1 2
cookie1 2015-04-13 3 2 2
cookie1 2015-04-14 2 2 3
cookie1 2015-04-15 4 3 3
cookie1 2015-04-16 4 3 4
cookie2 2015-04-10 2 1 1
cookie2 2015-04-11 3 1 1
cookie2 2015-04-12 5 1 2
cookie2 2015-04-13 6 2 2
cookie2 2015-04-14 3 2 3
cookie2 2015-04-15 9 3 3
cookie2 2015-04-16 7 3 4
(3)应用场景:统计一个每个cookie的pv数最多的前1/3的天
select cookieid, createtime, pv,
ntile(3) over(partition by cookieid order by pv desc) as nt4
from cookie2;
结果:nt4 = 1 数据就是我们要求的结果
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 1
cookie1 2015-04-16 4 1
cookie1 2015-04-15 4 2
cookie1 2015-04-13 3 2
cookie1 2015-04-14 2 3
cookie1 2015-04-10 1 3
cookie2 2015-04-15 9 1
cookie2 2015-04-16 7 1
cookie2 2015-04-13 6 1
cookie2 2015-04-12 5 2
cookie2 2015-04-11 3 2
cookie2 2015-04-14 3 3
cookie2 2015-04-10 2 3
3. row_number()操作
row_number():从1开始,按照顺序,生成分组内记录的行号
(1)组内排序后生成行号
select cookieid, createtime, pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;
结果:
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 2
cookie1 2015-04-16 4 3
cookie1 2015-04-15 4 4
cookie1 2015-04-13 3 5
cookie1 2015-04-14 2 6
cookie1 2015-04-10 1 7
cookie2 2015-04-15 9 1
cookie2 2015-04-16 7 2
cookie2 2015-04-13 6 3
cookie2 2015-04-12 5 4
cookie2 2015-04-11 3 5
cookie2 2015-04-14 3 6
cookie2 2015-04-10 2 7
(2)常用场景:分组取TopN, 比如求每个cookie排名前三的pv
create table tmp as
select cookieid, createtime, pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;
select * from tmp where index <= 3;
结果:
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 2
cookie1 2015-04-16 4 3
cookie2 2015-04-15 9 1
cookie2 2015-04-16 7 2
cookie2 2015-04-13 6 3
4. rank()操作
rank:生成数据项在分组中的排名,相同的值编相同的号,排名相等会在名次中留下空位
select cookieid, createtime, pv,
rank() over(partition by cookieid order by pv desc) as rank
from cookie2
where cookieid = 'cookie1';
结果:
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 2
cookie1 2015-04-16 4 3
cookie1 2015-04-15 4 3
cookie1 2015-04-13 3 5
cookie1 2015-04-14 2 6
cookie1 2015-04-10 1 7
5. dense_rank()操作
dense_rank:按顺序编号,相同的值编相同的号,不留空位
select cookieid, createtime, pv,
dense_rank() over(partition by cookieid order by pv desc) as rank
from cookie2
where cookieid = 'cookie1';
结果:
cookie1 2015-04-12 7 1
cookie1 2015-04-11 5 2
cookie1 2015-04-16 4 3
cookie1 2015-04-15 4 3
cookie1 2015-04-13 3 4
cookie1 2015-04-14 2 5
cookie1 2015-04-10 1 6