Hive 的几个练习题

题目1

需求:
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数

三个字段的意思:
用户名,月份,访问次数

数据:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11

最后结果展示:

用户 月份 最大访问次数 总访问次数 当月访问次数
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44

解题思路

  1. 以username分和month分组,统计出每月访问次数,得到如下结果
    CREATE TABLE t01_s1 AS
    SELECT table01.username, table01. MONTH, sum(table01.count) sum FROM myhive.table01 GROUP BY table01.username, table01. MONTH;

  2. 进行自连接,选出tl.month>tr.month的字段,并用username和month分组就可以得到结果
    SELECT tl.username, tl. MONTH, max(tr.sum) maxvisit, sum(tr.sum) sumvisit, max(tl.sum) currentmonth
    FROM t01_s1 tl JOIN t01_s1 tr ON tl.username = tr.username
    WHERE tl. MONTH >= tr. MONTH
    GROUP BY tl.username, tl. MONTH;

进行自连接后的结果如下:

tl.username     tl.month        tl.sum  tr.username     tr.month        tr.sum
A       2015-01 33      A       2015-01 33
A       2015-02 10      A       2015-01 33
A       2015-03 38      A       2015-01 33
A       2015-01 33      A       2015-02 10
A       2015-02 10      A       2015-02 10
A       2015-03 38      A       2015-02 10
A       2015-01 33      A       2015-03 38
A       2015-02 10      A       2015-03 38
A       2015-03 38      A       2015-03 38
B       2015-01 30      B       2015-01 30
B       2015-02 15      B       2015-01 30
B       2015-03 44      B       2015-01 30
B       2015-01 30      B       2015-02 15
B       2015-02 15      B       2015-02 15
B       2015-03 44      B       2015-02 15
B       2015-01 30      B       2015-03 44
B       2015-02 15      B       2015-03 44
B       2015-03 44      B       2015-03 44

题目2

// 建表语句:
CREATE TABLE course (
id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
sid int(11) DEFAULT NULL,
course varchar(255) DEFAULT NULL,
score int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

// 插入数据
// 字段解释:id, 学号, 课程, 成绩
INSERT INTO course VALUES (1, 1, ‘yuwen’, 43);
INSERT INTO course VALUES (2, 1, ‘shuxue’, 55);
INSERT INTO course VALUES (3, 2, ‘yuwen’, 77);
INSERT INTO course VALUES (4, 2, ‘shuxue’, 88);
INSERT INTO course VALUES (5, 3, ‘yuwen’, 98);
INSERT INTO course VALUES (6, 3, ‘shuxue’, 65);

求:所有数学课程成绩 大于 语文课程成绩的学生的学号

解答:

  1. 自连接的方式:
    select c1.* from course c1 join course c2 on c1.sid=c2.sid where c1.score>c2.score and c1.course=’shuxue’;

  2. 行列转换的方式
    select a.sid from (select sid,
    max(case when course=’yuwen’then score else 0 end) yuwen,
    max(case when course=’shuxue’then score else 0 end) shuxue
    from course group by sid having shuxue>yuwen) a;

题目3

数据:
2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023

要求: 求出一年中出现最高温度的那一天
输出以下数据:
20010105 29
20070109 99
20080103 37
20100103 17
20120107 32
20130109 29
20140103 17
20150109 99

解题

解法一:
SELECT * FROM exercise3
WHERE concat( substr(DATA, 1, 4), substr(DATA, 9, 2))
IN ( SELECT concat( substr(DATA, 1, 4), max(substr(DATA, 9, 2))) FROM exercise3 GROUP BY substr(DATA, 1, 4));
思路:通过年份分组求出最高温度的那一年和最高温度,把这些数据看成一个集合。再查出原始表中出现这些数据的那一行。

解法二:
select substring(b.line, 1, 8) as max_temp_date, a.max_temp
from exercise3 b join
(select substring(c.line, 1, 4) as year, max(substring(c.line, -2)) as max_temp
from exercise3 c group by substring(c.line, 1, 4)) a
on a.year = substring(b.line, 1, 4) and
a.max_temp = substring(b.line, -2);
思路:1. 求出以你那为分组,求出最最高温度和年份

  1. 用原始表和这个表进行连接,连接条件为年份相同且最高温度相同的条目

题目4

现有一份以下格式的数据:
表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:
数据:
id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

编写Hive的HQL语句来实现以下结果:
表中的1表示选修,表中的0表示未选修
id a b c d e f
1 1 1 1 0 1 0
2 1 0 1 1 0 1
3 1 1 1 0 1 0

解题要点行列转换。
解法1:
select id,
sum(case course when “a” then 1 else 0 end) as a,
sum(case course when “b” then 1 else 0 end) as b,
sum(case course when “c” then 1 else 0 end) as c,
sum(case course when “d” then 1 else 0 end) as d,
sum(case course when “e” then 1 else 0 end) as e,
sum(case course when “f” then 1 else 0 end) as f
from id_course group by id;

解法2:
先构造以下表
id id_courses courses
1 [“a”,”b”,”c”,”e”] [“a”,”b”,”c”,”d”,”e”,”f”]
2 [“a”,”c”,”d”,”f”] [“a”,”b”,”c”,”d”,”e”,”f”]
3 [“a”,”b”,”c”,”e”] [“a”,”b”,”c”,”d”,”e”,”f”]

1.左边:
select d.id as id, collect_set(d.course) as id_courses from id_course d group by d.id;

  1. 右边:
    select sort_array(collect_set(course)) as tt from id_course;

  2. 左右连接得到需要的表:
    create id_courses table as
    select a.id, a.id_courses, b.tt
    from
    (select d.id as id, collect_set(d.course) as id_courses from id_course d group by d.id) a
    join
    (select sort_array(collect_set(c.course)) as tt from id_course c) b ;

  3. 查询出最终结果
    使用if判断
    select
    id,
    if(array_contains(a.id_courses, courses[0]),1,0) as a,
    if(array_contains(a.id_courses, courses[1]),1,0) as b,
    if(array_contains(a.id_courses, courses[2]),1,0) as c,
    if(array_contains(a.id_courses, courses[3]),1,0) as d,
    if(array_contains(a.id_courses, courses[4]),1,0) as e,
    if(array_contains(a.id_courses, courses[5]),1,0) as f
    from id_courses a;

    原文作者:活英雄
    原文地址: https://www.jianshu.com/p/aa0b2ffd0368
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞