Hive窗口函数01-SUM、MIN、MAX、AVG

Hive窗口函数sum、min、max、avg入门

1. 数据说明

现有 hive 表 cookie1, 内容如下:

hive> select * from cookie1;
cookie1 2015-04-10  1
cookie1 2015-04-11  5
cookie1 2015-04-12  7
cookie1 2015-04-13  3
cookie1 2015-04-14  2
cookie1 2015-04-15  4
cookie1 2015-04-16  4
  • 其中字段意义:
    cookieid(string), createtime(string), pv(int)
  • 分别代表:
    cookieid, 创建时间, 该cookieid当天的pv值

2. sum()操作

(1) 对组内的某个字段的全部数据进行累加,例如,求cookie1的总pv值

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid) as sum1 
from cookie1; 

结果:
cookie1 2015-04-16  4   26
cookie1 2015-04-15  4   26
cookie1 2015-04-14  2   26
cookie1 2015-04-13  3   26
cookie1 2015-04-12  7   26
cookie1 2015-04-11  5   26
cookie1 2015-04-10  1   26

partition by cookieid : 按照cookieid分组

(2) 从第一行开始累加到当前行

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime) as sum2
from cookie1;

# 结果
cookie1 2015-04-10  1   1
cookie1 2015-04-11  5   6
cookie1 2015-04-12  7   13
cookie1 2015-04-13  3   16
cookie1 2015-04-14  2   18
cookie1 2015-04-15  4   22
cookie1 2015-04-16  4   26

以上SQL语句是下面这个SQL语句的简写:
select 
    cookieid,
    createtime,
    pv, 
    sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as sum3
from cookie1;

order by : 在同一个组内,先累加完相同createtime的pv,再累加其他createtime的pv, 比如 :
现在在表末尾加一条数据cookie1 2015-04-10 1 1,那么结果就是 :

cookie1 2015-04-10  1   1
cookie1 2015-04-10  1   2
cookie1 2015-04-11  5   7
cookie1 2015-04-12  7   14
cookie1 2015-04-13  3   17
cookie1 2015-04-14  2   19
cookie1 2015-04-15  4   23
cookie1 2015-04-16  4   27

cookie1 2015-04-10 1 1这两条数据是否相邻无关。

unbounded : 起点
preceding : 往前
following : 往后
between…end… : 指定操作的范围
current row : 当前行,计算到哪一行就是哪一行
between unbounded preceding and current row : 从起点到当前行,往前(对应表中的从上到下的顺序)累加
默认就是从起点到当前行往前累加,所以between unbounded preceding and current row这个条件可以不写。

(3) 计算分组内当前行到剩余所有行的和

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as sum4
from cookie1; 

结果分析 : 
cookie1 2015-04-10  1   26  (1+5+7+3+2+4+4=26)
cookie1 2015-04-11  5   25  (5+7+3+2+4+4=25)
cookie1 2015-04-12  7   20  (7+3+2+4+4=20)
cookie1 2015-04-13  3   13  (3+2+4+4=13)
cookie1 2015-04-14  2   10  (2+4+4=10)
cookie1 2015-04-15  4   8   (4+4=8)
cookie1 2015-04-16  4   4   (4)

(4) 指定计算的范围

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as sum5
from cookie1; 

结果分析 : 
cookie1 2015-04-10  1   1   (这一行前面没有记录,只能取本身的值)
cookie1 2015-04-11  5   6   (这一行前面有1,不够取3行的值,只能取一行与本身相加,取5+1=6)
cookie1 2015-04-12  7   13  (这一行前面有1,5,不够取3行的值,只能取两行与本身相加,取7+5+1=13)
cookie1 2015-04-13  3   16  (这一行前面有3行可以取,依次往回取三行,取3+7+5+1=16)
cookie1 2015-04-14  2   17  (同理,取2+3+7+5=17)
cookie1 2015-04-15  4   16  (同理,取4+2+3+7=16)
cookie1 2015-04-16  4   13  (同理,取4+4+2+3=13)

select cookieid, createtime, pv, 
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as sum6
from cookie1; 

结果分析 : 
cookie1 2015-04-10  1   6   (这一行前面没有记录,后面可以取一条记录,1+5=6)
cookie1 2015-04-11  5   13  (这一行前面只能取一条记录,后面可以取一条记录,1+5+7=13)
cookie1 2015-04-12  7   16  (这一行前面只能取两条记录,后面可以取一条记录,1+5+7+3=16)
cookie1 2015-04-13  3   18  (这一行前面可以取三条记录,后面可以取一条记录,1+5+7+3+2=18)
cookie1 2015-04-14  2   21  (同理,5+7+3+2+4=21) 
cookie1 2015-04-15  4   20  (同理,7+3+2+4+4=20)
cookie1 2015-04-16  4   13  (这一行前面可以取三条记录,后面没有记录,3+2+4+4=13)

3. avg()操作

sum、avg、max、min这些窗口函数的语法都是一样的,以下用一个SQL语句来演示 :

select cookieid,createtime,pv,
avg(pv) over(partition by cookieid) as avg1,
avg(pv) over(partition by cookieid order by createtime) as avg2,
avg(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as avg3,
avg(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as avg4,
avg(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as avg5,                
avg(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as avg6
from cookie1;

结果 : 
cookie1 2015-04-10  1       3.7142857142857144      1.0                 1.0                 3.7142857142857144      1.0                 3.0
cookie1 2015-04-11  5       3.7142857142857144      3.0                 3.0                 4.166666666666667       3.0                 4.333333333333333
cookie1 2015-04-12  7       3.7142857142857144      4.333333333333333   4.333333333333333   4.0                     4.333333333333333   4.0
cookie1 2015-04-13  3       3.7142857142857144      4.0                 4.0                 3.25                    4.0                 3.6
cookie1 2015-04-14  2       3.7142857142857144      3.6                 3.6                 3.3333333333333335      4.25                4.2
cookie1 2015-04-15  4       3.7142857142857144      3.6666666666666665  3.6666666666666665  4.0                     4.0                 4.0
cookie1 2015-04-16  4       3.7142857142857144      3.7142857142857144  3.7142857142857144  4.0                     3.25                3.25

4. min()操作

select cookieid, createtime, pv,
min(pv) over(partition by cookieid) as min1,
min(pv) over(partition by cookieid order by createtime) as min2, 
min(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as min3,
min(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as min4,
min(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as min5,
min(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as min6
from cookie1;

结果 : 
cookie1 2015-04-10  1   1   1   1   1   1   1
cookie1 2015-04-11  5   1   1   1   2   1   1
cookie1 2015-04-12  7   1   1   1   2   1   1
cookie1 2015-04-13  3   1   1   1   2   1   1
cookie1 2015-04-14  2   1   1   1   2   2   2
cookie1 2015-04-15  4   1   1   1   4   2   2
cookie1 2015-04-16  4   1   1   1   4   2   2

5. max()操作

select cookieid,createtime,pv,
max(pv) over(partition by cookieid) as max1,
max(pv) over(partition by cookieid order by createtime) as max2,
max(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as max3,
max(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as max4,
max(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as max5,
max(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as max6
from cookie1;

结果 : 
cookie1 2015-04-10  1   7   1   1   7   1   5
cookie1 2015-04-11  5   7   5   5   7   5   7
cookie1 2015-04-12  7   7   7   7   7   7   7
cookie1 2015-04-13  3   7   7   7   4   7   7
cookie1 2015-04-14  2   7   7   7   4   7   7
cookie1 2015-04-15  4   7   7   7   4   7   7
cookie1 2015-04-16  4   7   7   7   4   4   4

参考文章 : Hive分析窗口函数(一) SUM,AVG,MIN,MAX

    原文作者:CoderJed
    原文地址: https://www.jianshu.com/p/fdb3c4ae0b84
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞