介绍Hive查询中数值累加的思路的方法
1. 需求分析
现有 hive 表 record, 内容如下:
hive> select * from record;
OK
A 2015-01 5
A 2015-01 15
B 2015-01 5
A 2015-01 8
B 2015-01 25
A 2015-01 5
A 2015-02 4
A 2015-02 6
B 2015-02 10
B 2015-02 5
A 2015-03 16
A 2015-03 22
B 2015-03 23
B 2015-03 10
B 2015-03 11
其中字段意义:
userid(string) month(string) count(int)
分别代表:
用户id 月份 该月访问次数
需求:
统计每个用户截止到当月为止的最大单月访问次数和累计到该月的总访问次数
最终结果为:
用户 月份 本月访问次数 截止到当月总访问次数 截止到当月最大访问次数
A 2015-01 33 33 33
A 2015-02 10 43 33
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 15 45 30
B 2015-03 44 89 44
2. 方法一
--(1)
# 先求出每个用户每个月总访问量
CREATE TABLE record_2 AS
SELECT userid, month, sum(count) as count
FROM record
GROUP BY userid, month;
# record_2 表中内容为:
A 2015-01 33
A 2015-02 10
A 2015-03 38
B 2015-01 30
B 2015-02 15
B 2015-03 44
--(2)
SELECT t1.userid, t1.month, t1.count, sum(t2.count) sum_count, max(t2.count) max_count
FROM record_2 t1 INNER JOIN record_2 t2
ON t1.userid = t2.userid
WHERE t1.month >= t2.month
GROUP BY t1.userid, t1.month, t1.count
ORDER BY t1.userid, t1.month;
# 最终结果:
A 2015-01 33 33 33
A 2015-02 10 43 33
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 15 45 30
B 2015-03 44 89 44
3. 方法二:使用Hive窗口函数max()、sum()
select userid, month, count,
sum(count) over(partition by userid order by month) as sum_count,
max(count) over(partition by userid order by month) as max_count
from record_2;
结果:
A 2015-01 33 33 33
A 2015-02 10 43 33
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 15 45 30
B 2015-03 44 89 44