Hive窗口函数04-LAG、LEAD、FIRST_VALUE、LAST_VALUE

Hive窗口函数LAG、LEAD、FIRST_VALUE、LAST_VALUE入门

1. 数据说明

现有 hive 表 cookie4, 内容如下:

hive> select * from cookie4;

cookie4.cookieid    cookie4.createtime          cookie4.url
cookie1             2015-04-10 10:00:02         url2
cookie1             2015-04-10 10:00:00         url1
cookie1             2015-04-10 10:03:04         1url3
cookie1             2015-04-10 10:50:05         url6
cookie1             2015-04-10 11:00:00         url7
cookie1             2015-04-10 10:10:00         url4
cookie1             2015-04-10 10:50:01         url5
cookie2             2015-04-10 10:00:02         url22
cookie2             2015-04-10 10:00:00         url11
cookie2             2015-04-10 10:03:04         1url33
cookie2             2015-04-10 10:50:05         url66
cookie2             2015-04-10 11:00:00         url77
cookie2             2015-04-10 10:10:00         url44
cookie2             2015-04-10 10:50:01         url55
  • 其中字段意义:
    cookieid(string), createtime(string), url(int)
  • 分别代表:
    cookieid, 创建时间, 访问的url

2. lag()操作

LAG(col,n,DEFAULT)用于统计窗口内往上第n行值
第一个参数为列名
第二个参数为往上第n行(可选,默认为1)
第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

--(1)
hive> SELECT cookieid, createtime, url,
    > LAG(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
    > FROM cookie4;
    
结果:因为没有设置默认值,当没有上两行时显示为NULL
cookieid    createtime          url     last_2_time
cookie1     2015-04-10 10:00:00 url1    NULL 
cookie1     2015-04-10 10:00:02 url2    NULL
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:00:00
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:00:02
cookie1     2015-04-10 10:50:01 url5    2015-04-10 10:03:04
cookie1     2015-04-10 10:50:05 url6    2015-04-10 10:10:00
cookie1     2015-04-10 11:00:00 url7    2015-04-10 10:50:01
cookie2     2015-04-10 10:00:00 url11   NULL
cookie2     2015-04-10 10:00:02 url22   NULL
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:00:00
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:00:02
cookie2     2015-04-10 10:50:01 url55   2015-04-10 10:03:04
cookie2     2015-04-10 10:50:05 url66   2015-04-10 10:10:00
cookie2     2015-04-10 11:00:00 url77   2015-04-10 10:50:01

--(2)
hive> SELECT cookieid, createtime, url,
    > LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time 
    > FROM cookie4;
    
结果:
cookieid    createtime          url     last_1_time
cookie1     2015-04-10 10:00:00 url1    1970-01-01 00:00:00 (显示默认值)
cookie1     2015-04-10 10:00:02 url2    2015-04-10 10:00:00
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:00:02
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:03:04
cookie1     2015-04-10 10:50:01 url5    2015-04-10 10:10:00
cookie1     2015-04-10 10:50:05 url6    2015-04-10 10:50:01
cookie1     2015-04-10 11:00:00 url7    2015-04-10 10:50:05
cookie2     2015-04-10 10:00:00 url11   1970-01-01 00:00:00 (显示默认值)
cookie2     2015-04-10 10:00:02 url22   2015-04-10 10:00:00
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:00:02
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:03:04
cookie2     2015-04-10 10:50:01 url55   2015-04-10 10:10:00
cookie2     2015-04-10 10:50:05 url66   2015-04-10 10:50:01
cookie2     2015-04-10 11:00:00 url77   2015-04-10 10:50:05

3. lead()操作

lead的作用与lag相反
LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值
第一个参数为列名
第二个参数为往下第n行(可选,默认为1)
第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

hive> SELECT cookieid, createtime, url,
    > LEAD(createtime, 2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time,
    > LEAD(createtime, 1, '1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time
    > FROM cookie4;
    
结果:
cookieid    createtime          url     next_2_time         next_1_time
cookie1     2015-04-10 10:00:00 url1    2015-04-10 10:03:04 2015-04-10 10:00:02
cookie1     2015-04-10 10:00:02 url2    2015-04-10 10:10:00 2015-04-10 10:03:04
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:50:01 2015-04-10 10:10:00
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:50:05 2015-04-10 10:50:01
cookie1     2015-04-10 10:50:01 url5    2015-04-10 11:00:00 2015-04-10 10:50:05
cookie1     2015-04-10 10:50:05 url6    NULL                2015-04-10 11:00:00
cookie1     2015-04-10 11:00:00 url7    NULL                1970-01-01 00:00:00
cookie2     2015-04-10 10:00:00 url11   2015-04-10 10:03:04 2015-04-10 10:00:02
cookie2     2015-04-10 10:00:02 url22   2015-04-10 10:10:00 2015-04-10 10:03:04
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:50:01 2015-04-10 10:10:00
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:50:05 2015-04-10 10:50:01
cookie2     2015-04-10 10:50:01 url55   2015-04-10 11:00:00 2015-04-10 10:50:05
cookie2     2015-04-10 10:50:05 url66   NULL                2015-04-10 11:00:00
cookie2     2015-04-10 11:00:00 url77   NULL                1970-01-01 00:00:00

4.FIRST_VALUE()操作

取分组内排序后,截止到当前行,第一个值

hive> SELECT cookieid, createtime, url,
    > FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first 
    > FROM cookie4;
    
结果:
cookieid    createtime          url     first
cookie1     2015-04-10 10:00:00 url1    url1
cookie1     2015-04-10 10:00:02 url2    url1
cookie1     2015-04-10 10:03:04 1url3   url1
cookie1     2015-04-10 10:10:00 url4    url1
cookie1     2015-04-10 10:50:01 url5    url1
cookie1     2015-04-10 10:50:05 url6    url1
cookie1     2015-04-10 11:00:00 url7    url1
cookie2     2015-04-10 10:00:00 url11   url11
cookie2     2015-04-10 10:00:02 url22   url11
cookie2     2015-04-10 10:03:04 1url33  url11
cookie2     2015-04-10 10:10:00 url44   url11
cookie2     2015-04-10 10:50:01 url55   url11
cookie2     2015-04-10 10:50:05 url66   url11
cookie2     2015-04-10 11:00:00 url77   url11

5.LAST_VALUE()操作

取分组内排序后,截止到当前行,最后一个值

hive> SELECT cookieid, createtime, url,
    > LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last
    > FROM cookie4;
    
结果:注意,是截止到当前行的最后一个值,其实就是它本身
cookieid    createtime          url     last
cookie1     2015-04-10 10:00:00 url1    url1
cookie1     2015-04-10 10:00:02 url2    url2
cookie1     2015-04-10 10:03:04 1url3   1url3
cookie1     2015-04-10 10:10:00 url4    url4
cookie1     2015-04-10 10:50:01 url5    url5
cookie1     2015-04-10 10:50:05 url6    url6
cookie1     2015-04-10 11:00:00 url7    url7
cookie2     2015-04-10 10:00:00 url11   url11
cookie2     2015-04-10 10:00:02 url22   url22
cookie2     2015-04-10 10:03:04 1url33  1url33
cookie2     2015-04-10 10:10:00 url44   url44
cookie2     2015-04-10 10:50:01 url55   url55
cookie2     2015-04-10 10:50:05 url66   url66
cookie2     2015-04-10 11:00:00 url77   url77

参考文章:Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

    原文作者:CoderJed
    原文地址: https://www.jianshu.com/p/5a1abdc78f3b
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞