Hive窗口函数02-NTILE、ROW_NUMBER、RANK、DENSE_RANK

Hive窗口函数NTILE、ROW_NUMBER、RANK、DENSE_RANK入门

1. 数据说明

现有 hive 表 cookie2, 内容如下:

hive> select * from cookie2;

cookie1 2015-04-10  1
cookie1 2015-04-11  5
cookie1 2015-04-12  7
cookie1 2015-04-13  3
cookie1 2015-04-14  2
cookie1 2015-04-15  4
cookie1 2015-04-16  4
cookie2 2015-04-10  2
cookie2 2015-04-11  3
cookie2 2015-04-12  5
cookie2 2015-04-13  6
cookie2 2015-04-14  3
cookie2 2015-04-15  9
cookie2 2015-04-16  7
  • 其中字段意义:
    cookieid(string), createtime(string), pv(int)
  • 分别代表:
    cookieid, 创建时间, 该cookieid当天的pv值

2. ntile()操作

ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值,ntile不支持rows between,比如 ntile(2) over(partition by cookieid order by createtime rows between 3 preceding and current row),如果切片不均匀,默认增加第一个切片的分布。

(1) 将分组数组按照顺序切分成2份

select cookieid, createtime, pv,
ntile(2) over(partition by cookieid order by createtime) as nt1
from cookie2;

结果:因为每组有7条数据,所以默认第一个分片加1条数据
cookie1 2015-04-10  1   1
cookie1 2015-04-11  5   1
cookie1 2015-04-12  7   1
cookie1 2015-04-13  3   1
cookie1 2015-04-14  2   2
cookie1 2015-04-15  4   2
cookie1 2015-04-16  4   2
cookie2 2015-04-10  2   1
cookie2 2015-04-11  3   1
cookie2 2015-04-12  5   1
cookie2 2015-04-13  6   1
cookie2 2015-04-14  3   2
cookie2 2015-04-15  9   2
cookie2 2015-04-16  7   2

(2) 将分组数组按照顺序切分成n份

select cookieid, createtime, pv,
ntile(3) over(partition by cookieid order by createtime) as nt2,
ntile(4) over(partition by cookieid order by createtime) as nt3
from cookie2;

cookie1 2015-04-10  1   1   1
cookie1 2015-04-11  5   1   1
cookie1 2015-04-12  7   1   2
cookie1 2015-04-13  3   2   2
cookie1 2015-04-14  2   2   3
cookie1 2015-04-15  4   3   3
cookie1 2015-04-16  4   3   4
cookie2 2015-04-10  2   1   1
cookie2 2015-04-11  3   1   1
cookie2 2015-04-12  5   1   2
cookie2 2015-04-13  6   2   2
cookie2 2015-04-14  3   2   3
cookie2 2015-04-15  9   3   3
cookie2 2015-04-16  7   3   4

(3)应用场景:统计一个每个cookie的pv数最多的前1/3的天

select cookieid, createtime, pv,
ntile(3) over(partition by cookieid order by pv desc) as nt4
from cookie2;

结果:nt4 = 1 数据就是我们要求的结果
cookie1 2015-04-12  7   1
cookie1 2015-04-11  5   1
cookie1 2015-04-16  4   1
cookie1 2015-04-15  4   2
cookie1 2015-04-13  3   2
cookie1 2015-04-14  2   3
cookie1 2015-04-10  1   3
cookie2 2015-04-15  9   1
cookie2 2015-04-16  7   1
cookie2 2015-04-13  6   1
cookie2 2015-04-12  5   2
cookie2 2015-04-11  3   2
cookie2 2015-04-14  3   3
cookie2 2015-04-10  2   3

3. row_number()操作

row_number():从1开始,按照顺序,生成分组内记录的行号

(1)组内排序后生成行号

select cookieid, createtime,  pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;

结果:
cookie1 2015-04-12  7   1
cookie1 2015-04-11  5   2
cookie1 2015-04-16  4   3
cookie1 2015-04-15  4   4
cookie1 2015-04-13  3   5
cookie1 2015-04-14  2   6
cookie1 2015-04-10  1   7
cookie2 2015-04-15  9   1
cookie2 2015-04-16  7   2
cookie2 2015-04-13  6   3
cookie2 2015-04-12  5   4
cookie2 2015-04-11  3   5
cookie2 2015-04-14  3   6
cookie2 2015-04-10  2   7

(2)常用场景:分组取TopN, 比如求每个cookie排名前三的pv

create table tmp as
select cookieid, createtime,  pv,
row_number() over(partition by cookieid order by pv desc) as index
from cookie2;

select * from tmp where index <= 3;

结果:
cookie1 2015-04-12  7   1
cookie1 2015-04-11  5   2
cookie1 2015-04-16  4   3
cookie2 2015-04-15  9   1
cookie2 2015-04-16  7   2
cookie2 2015-04-13  6   3

4. rank()操作

rank:生成数据项在分组中的排名,相同的值编相同的号,排名相等会在名次中留下空位

select cookieid, createtime, pv,
rank() over(partition by cookieid order by pv desc) as rank
from cookie2
where cookieid = 'cookie1';

结果:
cookie1 2015-04-12  7   1
cookie1 2015-04-11  5   2
cookie1 2015-04-16  4   3
cookie1 2015-04-15  4   3
cookie1 2015-04-13  3   5
cookie1 2015-04-14  2   6
cookie1 2015-04-10  1   7

5. dense_rank()操作

dense_rank:按顺序编号,相同的值编相同的号,不留空位

select cookieid, createtime, pv,
dense_rank() over(partition by cookieid order by pv desc) as rank
from cookie2
where cookieid = 'cookie1';

结果:
cookie1 2015-04-12  7   1
cookie1 2015-04-11  5   2
cookie1 2015-04-16  4   3
cookie1 2015-04-15  4   3
cookie1 2015-04-13  3   4
cookie1 2015-04-14  2   5
cookie1 2015-04-10  1   6

参考文献:Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK

    原文作者:CoderJed
    原文地址: https://www.jianshu.com/p/59315c799e8f
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞