Hive窗口函数LAG、LEAD、FIRST_VALUE、LAST_VALUE入门
1. 数据说明
现有 hive 表 cookie4, 内容如下:
hive> select * from cookie4;
cookie4.cookieid cookie4.createtime cookie4.url
cookie1 2015-04-10 10:00:02 url2
cookie1 2015-04-10 10:00:00 url1
cookie1 2015-04-10 10:03:04 1url3
cookie1 2015-04-10 10:50:05 url6
cookie1 2015-04-10 11:00:00 url7
cookie1 2015-04-10 10:10:00 url4
cookie1 2015-04-10 10:50:01 url5
cookie2 2015-04-10 10:00:02 url22
cookie2 2015-04-10 10:00:00 url11
cookie2 2015-04-10 10:03:04 1url33
cookie2 2015-04-10 10:50:05 url66
cookie2 2015-04-10 11:00:00 url77
cookie2 2015-04-10 10:10:00 url44
cookie2 2015-04-10 10:50:01 url55
- 其中字段意义:
cookieid(string), createtime(string), url(int) - 分别代表:
cookieid, 创建时间, 访问的url
2. lag()操作
LAG(col,n,DEFAULT)用于统计窗口内往上第n行值
第一个参数为列名
第二个参数为往上第n行(可选,默认为1)
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
--(1)
hive> SELECT cookieid, createtime, url,
> LAG(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime) AS last_2_time
> FROM cookie4;
结果:因为没有设置默认值,当没有上两行时显示为NULL
cookieid createtime url last_2_time
cookie1 2015-04-10 10:00:00 url1 NULL
cookie1 2015-04-10 10:00:02 url2 NULL
cookie1 2015-04-10 10:03:04 1url3 2015-04-10 10:00:00
cookie1 2015-04-10 10:10:00 url4 2015-04-10 10:00:02
cookie1 2015-04-10 10:50:01 url5 2015-04-10 10:03:04
cookie1 2015-04-10 10:50:05 url6 2015-04-10 10:10:00
cookie1 2015-04-10 11:00:00 url7 2015-04-10 10:50:01
cookie2 2015-04-10 10:00:00 url11 NULL
cookie2 2015-04-10 10:00:02 url22 NULL
cookie2 2015-04-10 10:03:04 1url33 2015-04-10 10:00:00
cookie2 2015-04-10 10:10:00 url44 2015-04-10 10:00:02
cookie2 2015-04-10 10:50:01 url55 2015-04-10 10:03:04
cookie2 2015-04-10 10:50:05 url66 2015-04-10 10:10:00
cookie2 2015-04-10 11:00:00 url77 2015-04-10 10:50:01
--(2)
hive> SELECT cookieid, createtime, url,
> LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time
> FROM cookie4;
结果:
cookieid createtime url last_1_time
cookie1 2015-04-10 10:00:00 url1 1970-01-01 00:00:00 (显示默认值)
cookie1 2015-04-10 10:00:02 url2 2015-04-10 10:00:00
cookie1 2015-04-10 10:03:04 1url3 2015-04-10 10:00:02
cookie1 2015-04-10 10:10:00 url4 2015-04-10 10:03:04
cookie1 2015-04-10 10:50:01 url5 2015-04-10 10:10:00
cookie1 2015-04-10 10:50:05 url6 2015-04-10 10:50:01
cookie1 2015-04-10 11:00:00 url7 2015-04-10 10:50:05
cookie2 2015-04-10 10:00:00 url11 1970-01-01 00:00:00 (显示默认值)
cookie2 2015-04-10 10:00:02 url22 2015-04-10 10:00:00
cookie2 2015-04-10 10:03:04 1url33 2015-04-10 10:00:02
cookie2 2015-04-10 10:10:00 url44 2015-04-10 10:03:04
cookie2 2015-04-10 10:50:01 url55 2015-04-10 10:10:00
cookie2 2015-04-10 10:50:05 url66 2015-04-10 10:50:01
cookie2 2015-04-10 11:00:00 url77 2015-04-10 10:50:05
3. lead()操作
lead的作用与lag相反
LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值
第一个参数为列名
第二个参数为往下第n行(可选,默认为1)
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
hive> SELECT cookieid, createtime, url,
> LEAD(createtime, 2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time,
> LEAD(createtime, 1, '1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time
> FROM cookie4;
结果:
cookieid createtime url next_2_time next_1_time
cookie1 2015-04-10 10:00:00 url1 2015-04-10 10:03:04 2015-04-10 10:00:02
cookie1 2015-04-10 10:00:02 url2 2015-04-10 10:10:00 2015-04-10 10:03:04
cookie1 2015-04-10 10:03:04 1url3 2015-04-10 10:50:01 2015-04-10 10:10:00
cookie1 2015-04-10 10:10:00 url4 2015-04-10 10:50:05 2015-04-10 10:50:01
cookie1 2015-04-10 10:50:01 url5 2015-04-10 11:00:00 2015-04-10 10:50:05
cookie1 2015-04-10 10:50:05 url6 NULL 2015-04-10 11:00:00
cookie1 2015-04-10 11:00:00 url7 NULL 1970-01-01 00:00:00
cookie2 2015-04-10 10:00:00 url11 2015-04-10 10:03:04 2015-04-10 10:00:02
cookie2 2015-04-10 10:00:02 url22 2015-04-10 10:10:00 2015-04-10 10:03:04
cookie2 2015-04-10 10:03:04 1url33 2015-04-10 10:50:01 2015-04-10 10:10:00
cookie2 2015-04-10 10:10:00 url44 2015-04-10 10:50:05 2015-04-10 10:50:01
cookie2 2015-04-10 10:50:01 url55 2015-04-10 11:00:00 2015-04-10 10:50:05
cookie2 2015-04-10 10:50:05 url66 NULL 2015-04-10 11:00:00
cookie2 2015-04-10 11:00:00 url77 NULL 1970-01-01 00:00:00
4.FIRST_VALUE()操作
取分组内排序后,截止到当前行,第一个值
hive> SELECT cookieid, createtime, url,
> FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first
> FROM cookie4;
结果:
cookieid createtime url first
cookie1 2015-04-10 10:00:00 url1 url1
cookie1 2015-04-10 10:00:02 url2 url1
cookie1 2015-04-10 10:03:04 1url3 url1
cookie1 2015-04-10 10:10:00 url4 url1
cookie1 2015-04-10 10:50:01 url5 url1
cookie1 2015-04-10 10:50:05 url6 url1
cookie1 2015-04-10 11:00:00 url7 url1
cookie2 2015-04-10 10:00:00 url11 url11
cookie2 2015-04-10 10:00:02 url22 url11
cookie2 2015-04-10 10:03:04 1url33 url11
cookie2 2015-04-10 10:10:00 url44 url11
cookie2 2015-04-10 10:50:01 url55 url11
cookie2 2015-04-10 10:50:05 url66 url11
cookie2 2015-04-10 11:00:00 url77 url11
5.LAST_VALUE()操作
取分组内排序后,截止到当前行,最后一个值
hive> SELECT cookieid, createtime, url,
> LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last
> FROM cookie4;
结果:注意,是截止到当前行的最后一个值,其实就是它本身
cookieid createtime url last
cookie1 2015-04-10 10:00:00 url1 url1
cookie1 2015-04-10 10:00:02 url2 url2
cookie1 2015-04-10 10:03:04 1url3 1url3
cookie1 2015-04-10 10:10:00 url4 url4
cookie1 2015-04-10 10:50:01 url5 url5
cookie1 2015-04-10 10:50:05 url6 url6
cookie1 2015-04-10 11:00:00 url7 url7
cookie2 2015-04-10 10:00:00 url11 url11
cookie2 2015-04-10 10:00:02 url22 url22
cookie2 2015-04-10 10:03:04 1url33 1url33
cookie2 2015-04-10 10:10:00 url44 url44
cookie2 2015-04-10 10:50:01 url55 url55
cookie2 2015-04-10 10:50:05 url66 url66
cookie2 2015-04-10 11:00:00 url77 url77