Hive第三天

1. 后台启动HIVE的JDBC连接(就是可以把启动jdbc的页面可以关闭并且jdbc连接不会断开)
    0 表示标准输入
    1 表示标准输出
    2 表示标准错误输出
    nohup 表示挂起
    最后的 & 表示 后台启动
    nohup hive -service hiveserver2 > /usr/local/soft/hive-3.1.2/log/hiveserver2.log 2>&1 &

    jps 查看 Runjob 

    关闭后台的JDBC : kill -9 关闭 RunJob对应的PID

1.HIVE WORDCOUNT

CREATE TABLE learn3.wordcount(
word STRING COMMENT “单词”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE;

INSERT INTO TABLE learn3.wordcount (word) VALUES (“hello,word”),(“hello,java”),(“hive,hello”);

— 统计WORDCOUNT

SELECT
T1.col
, sum(num)
FROM (
SELECT
col2 as col,
count(*) as num
FROM (

SELECT
split(word,”,”)[0] as col1,
split(word,”,”)[1] as col2
FROM learn3.wordcount

)T GROUP BY T.col2
UNION ALL
SELECT
col1 as col,
count(*) as num
FROM (

SELECT
split(word,”,”)[0] as col1,
split(word,”,”)[1] as col2
FROM learn3.wordcount

)T GROUP BY T.col1
)T1 GROUP BY T1.col

2. WITH AS 用法
格式:
WITH table1  AS (
 SELECT 查询语句1
)
, table2 AS (
 SELECT 查询语句2
)
[INSERT INTO TABLE] SELECT FROM 

–WORDCOUNT的 WITH AS 用法
WITH split_res AS (
SELECT
split(word,”,”)[0] as col1,
split(word,”,”)[1] as col2
FROM learn3.wordcount
)
, col1_count AS(
SELECT
col1 as col
,count(*) as num
FROM split_res
GROUP BY col1
)

, col2_count AS(
SELECT
col2 as col
,count(*) as num
FROM split_res
GROUP BY col2
)

SELECT
T.col,
sum(T.num)
FROM (
SELECT * FROM col1_count
UNION ALL
SELECT * FROM col2_count
)T GROUP BY T.col

3. 集合函数
    COLLECT_LIST(column2) 
        需要跟GROUP BY column1 配合使用,将column1中相同的组column2数据放至一个集合中
    
    COLLECT_SET()
        需要跟GROUP BY column1 配合使用,将column1中相同的组column2数据放至一个集合中,并对集合中的数据进行去重操作

–需求:
    将一列中相同的内容放至一组数据中
        将 word列中所有相同的单词对应的num 放至一个数组中

CREATE TABLE learn3.wordcount2(
word STRING COMMENT “单词”,
num int COMMENT “数量”
)
STORED AS TEXTFILE;

INSERT INTO TABLE learn3.wordcount2 (word,num) VALUES (“hello”,1),(“hello”,2),(“hive”,3);

SELECT 
word
,COLLECT_LIST(num)
FROM learn3.wordcount2
GROUP BY word
+——–+————+
|  word  |    _c1     |
+——–+————+
| hello  | [1,2,1,2]  |
| hive   | [3,3]      |

SELECT 
word
,COLLECT_SET(num)
FROM learn3.wordcount2
GROUP BY word

| hello  | [1,2]  |
| hive   | [3]    |

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.student20(
id STRING COMMENT “学生ID”,
name STRING COMMENT “学生姓名”,
age int COMMENT “年龄”,
gender STRING COMMENT “性别”,
subject STRING COMMENT “学科”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “,”
STORED AS TEXTFILE;

load data local inpath “/usr/local/soft/hive-3.1.2/data/student_20.txt” INTO TABLE learn3.student20;

理科|男  1500100020|杭振凯,1500100019|娄曦之
① 将两列数据进行拼接

WITH concat_stu AS(
SELECT
CONCAT(subject,”|”,gender) as subject_gender
,CONCAT_WS(“|”,id,name) as id_name
FROM learn3.student20
)

SELECT 
subject_gender
,CONCAT_WS(“,”,COLLECT_LIST(id_name))
FROM concat_stu
GROUP BY subject_gender

| 文科|女            | [“1500100001|施笑槐”,”1500100007|尚孤风”,”1500100016|潘访烟”,”1500100018|骆怜雪”] |
| 文科|男            | [“1500100002|吕金鹏”,”1500100013|逯君昊”]                |
| 理科|女            | [“1500100003|单乐蕊”,”1500100005|宣谷芹”,”1500100008|符半双”,”1500100012|梁易槐”,”1500100015|宦怀绿”,”1500100017|高芷天”] |
| 理科|男            | [“1500100004|葛德曜”,”1500100006|边昂雄”,”1500100009|沈德昌”,”1500100010|羿彦昌”,”1500100011|宰运华”,”1500100014|羿旭炎”,”1500100019|娄曦之”,”1500100020|杭振凯”] |

| concat_ws(separator, [string | array(string)]+) – returns the concatenation of the strings separated by the separator. |

4.行列互换
+—————–+—————–+
| subject_gender  |     id_name     |
+—————–+—————–+
| 文科|女            | 1500100001|施笑槐  |
| 文科|男            | 1500100002|吕金鹏  |
| 理科|女            | 1500100003|单乐蕊  |
| 理科|男            | 1500100004|葛德曜  |
| 理科|女            | 1500100005|宣谷芹  |
| 理科|男            | 1500100006|边昂雄  |
| 文科|女            | 1500100007|尚孤风  |
| 理科|女            | 1500100008|符半双  |
| 理科|男            | 1500100009|沈德昌  |
| 理科|男            | 1500100010|羿彦昌  |
| 理科|男            | 1500100011|宰运华  |
| 理科|女            | 1500100012|梁易槐  |
| 文科|男            | 1500100013|逯君昊  |
| 理科|男            | 1500100014|羿旭炎  |
| 理科|女            | 1500100015|宦怀绿  |
| 文科|女            | 1500100016|潘访烟  |
| 理科|女            | 1500100017|高芷天  |
| 文科|女            | 1500100018|骆怜雪  |
| 理科|男            | 1500100019|娄曦之  |
| 理科|男            | 1500100020|杭振凯  |

| 文科|女            | 1500100001|施笑槐,1500100007|尚孤风,1500100016|潘访烟,1500100018|骆怜雪 |
| 文科|男            | 1500100002|吕金鹏,1500100013|逯君昊                      |
| 理科|女            | 1500100003|单乐蕊,1500100005|宣谷芹,1500100008|符半双,1500100012|梁易槐,1500100015|宦怀绿,1500100017|高芷天 |
| 理科|男            | 1500100004|葛德曜,1500100006|边昂雄,1500100009|沈德昌,1500100010|羿彦昌,1500100011|宰运华,1500100014|羿旭炎,1500100019|娄曦之,1500100020|杭振凯 |

行转列:
    将原先多行数据转成一行
    转换方式:
        通过COLLECT_SET() 或者 COLLECT_LIST() 和 GROUP BY 进行配合使用
        将GROUP BY 分组的数据进行存放于一个集合当中

列传行:
    将一行数据转换为多行数据
    转换:

案例:
| wordcount.word  |
+—————–+
| hello,word,hive |这里再去用切分之后聚合来实现行转列会使得逻辑过于复杂
| hello,java      |
| hello,hive      |
| hello,word      |
| hello,java      |
| hive,hello      |

结果:
+———-+
| _u1.col  |
+———-+
| hello    |
| hive     |
| java     |
| word     |
| hello    |
| hive     |
+———-+
INSERT INTO TABLE learn3.wordcount (word) VALUES (“hello,word,hello,java,hello,spark”);

EXPLODE() 函数
    将集合中的一行数据转换为多行

SELECT
EXPLODE(split(word,”,”)) as word
FROM learn3.wordcount

CREATE TABLE learn3.movie(
movie_name STRING COMMENT “电影名”,
types STRING COMMENT “电影类型”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

load data local inpath “/usr/local/soft/hive-3.1.2/data/moive.txt” INTO TABLE learn3.movie;

— 将types列中的电影类型进行分隔,并且与电影名进行对应
由  |肖申克的救赎            | 犯罪/剧情        |  
转换为:
  肖申克的救赎 犯罪
  肖申克的救赎 剧情

SELECT
movie_name,EXPLODE(split(types,”/”)) as type
FROM learn3.movie
这个语句是错误的

lateral view—行转列
lateral view用于和split, explode等UDTF—起使用,它能够将一行数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。
lateral view首先为原始表的每行调用UDTF会把一行分成一或者多行,lateral view再把结果组合,产生一个支持别名表的虚拟表。
explode函数参数仅接受array和map类型,不支持两个一起用。所以lateral view可以解决
SELECT name,new_num FROM laterv_tb LATERAL VIEW explode(weight) num AS new_num;

SELECT
movie_name,type
FROM learn3.movie LATERAL VIEW EXPLODE(split(types,”/”)) view as type

| movie.movie_name  | movie.types  |
+——————-+————–+
| 肖申克的救赎            | 犯罪/剧情        |
| 霸王别姬              | 剧情/爱情/同性     |
| 阿甘正传              | 剧情/爱情        |
| 泰坦尼克号             | 剧情/爱情/灾难     |
| 这个杀手不太冷           | 剧情/动作/犯罪     |

| movie_name  | type  |
+————-+——-+
| 肖申克的救赎      | 犯罪    |
| 肖申克的救赎      | 剧情    |
| 霸王别姬        | 剧情    |
| 霸王别姬        | 爱情    |
| 霸王别姬        | 同性    |
| 阿甘正传        | 剧情    |
| 阿甘正传        | 爱情    |
| 泰坦尼克号       | 剧情    |
| 泰坦尼克号       | 爱情    |
| 泰坦尼克号       | 灾难    |
| 这个杀手不太冷     | 剧情    |
| 这个杀手不太冷     | 动作    |
| 这个杀手不太冷     | 犯罪    |

LATERAL VIEW EXPLODE(split(types,”/”)) view as type
    解析:  
    ① 通过split方法将types中的字符串切分为数组
    ② 通过EXPLODE方法将数组由一行数据转换为多行
    ③ 通过LATERAL VIEW 将EXPLODE转换的结果包装成一个名为view的一个侧写表,并且列名为type
    ④ 通过全连接将侧写表中的数据与原表 learn3.movie 中的一行数据进行全连接 

5.

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.student1(
id STRING COMMENT “学生ID”,
name STRING COMMENT “学生姓名”,
age int COMMENT “年龄”,
gender STRING COMMENT “性别”,
clazz STRING COMMENT “班级”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “,”;

load data local inpath “/usr/local/soft/hive-3.1.2/data/students.txt” INTO TABLE learn3.student1;

CREATE EXTERNAL TABLE IF NOT EXISTS learn3.score1(
id STRING COMMENT “学生ID”,
subject_id STRING COMMENT “科目ID”,
score int COMMENT “成绩”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “,”

load data local inpath “/usr/local/soft/hive-3.1.2/data/score.txt” INTO TABLE learn3.score1;

需求:
    1.统计各性别年龄前三
— HIVE 原先老版本 不支持这种写法  
SELECT
s1.*
FROM learn3.student1 as s1
WHERE 3 > (SELECT count(*) FROM learn3.student1 as s2
WHERE s1.gender = s2.gender and s1.age < s2.age 
)

    2.统计各班级学生总成绩前三名
①算出学生总成绩
WITH score_sum AS (
SELECT
id
,sum(score) as total_score
FROM learn3.score1
GROUP BY id
)

, studen_score AS (
SELECT
T1.name
,T2.total_score
,T1.clazz
FROM learn3.student1 T1
JOIN score_sum T2 ON T1.id = T2.id 
)

–② 排序获取前三名
排序后数据【WITH score_sum AS(
SELECT
id 
,sum(score) as total_score
FROM score1
GROUP BY id 
)
,student_score AS(
SELECT
t1.name
,t2.total_score
,t1.clazz
from student1 t1
JOIN score_sum t2 on t1.id=t2.id
)

SELECT 
t.name
,t.total_score
,t.clazz
,row_number()over(PARTITION BY t.clazz ORDER BY t.total_score DESC)AS pm
FROM student_score t;】

SELECT
TT.*
FROM (
SELECT
T.name
,T.total_score
,T.clazz
, ROW_NUMBER() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as row_pm
FROM studen_score T
) TT
WHERE TT.row_pm <= 3

SELECT
T.name
,T.total_score
,T.clazz
, ROW_NUMBER() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as row_pm
, dense_rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as dense_pm
, rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as rank_pm
, percent_rank() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as percent_rank_pm
, cume_dist() OVER(PARTITION BY T.clazz ORDER BY T.total_score DESC) as cume_dist_pm
FROM studen_score T

##### row_number:无并列排名

* 用法: select xxxx, row_number() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

##### dense_rank:有并列排名,并且依次递增

* 用法: select xxxx, dense_rank() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

##### rank:有并列排名,不依次递增

* 用法: select xxxx, rank() over(partition by 分区字段 order by 排序字段 desc) as rn from tb group by xxxx

##### percent_rank:(rank的结果-1)/(分区内数据的个数-1)

* 用法: select xxxx, percent_rank() over(partition by 分组字段 order by 排序字段 desc) as rn from tb group by xxxx

需求1:
    取每个班级总分最大的同学

CREATE TABLE learn3.student_score(
name STRING COMMENT “”,
total_score int COMMENT “”,
clazz STRING COMMENT “”
);

WITH score_sum AS (
SELECT
id
,sum(score) as total_score
FROM learn3.score1
GROUP BY id
)

INSERT INTO TABLE learn3.student_score
SELECT
T1.name
,T2.total_score
,T1.clazz
FROM learn3.student1 T1
JOIN score_sum T2 ON T1.id = T2.id 

— 通过max方法取出每个班级分区中的学生成绩最大值 
SELECT
TT.*
FROM (
SELECT
T1.*
,max(T1.total_score) OVER(PARTITION BY T1.clazz ORDER BY T1.total_score DESC) as max_score
FROM learn3.student_score T1
) TT WHERE TT.total_score = TT.max_score

— 如果max() OVER() 中的窗口over()没有给定分区,那么当前的窗口表示整个学校,得到的数据是整个学校的最高的分数

SELECT
TT.*
FROM (
SELECT
T1.*
,max(T1.total_score) OVER() as max_score
FROM learn3.student_score T1
) TT WHERE TT.total_score = TT.max_score

max
    用法:
            ① max(T1.total_score) OVER(PARTITION BY T1.clazz ORDER BY T1.total_score DESC) 
                基于每个partition分区内数据取最大值
            ② max(T1.total_score) OVER()
                基于整个数据集取最大值
min、avg、count、sum:与max方法使用一致

6. 《《《《   卡口流量需求分析  》》》》

CREATE TABLE learn3.veh_pass(
id STRING COMMENT “卡口编号”,
pass_time STRING COMMENT “进过时间”,
pass_num int COMMENT “过车数”
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY “,”
STORED AS TEXTFILE;

load data local inpath “/usr/local/soft/hive-3.1.2/data/veh_pass.txt” INTO TABLE learn3.veh_pass; 

需求1:查询当月的设备及其总数

— 写法1
SELECT
T.id
,count(*) OVER()
FROM (
SELECT
id 
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = substr(current_date,1,7)
) T GROUP BY T.id

— 错误写法
SELECT
DISTINCT id 
,count(*) OVER()
FROM (
SELECT
id 
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = substr(current_date,1,7)      这里substr 是对字符串的切分, CURRENT_DATE()函数返回当前日期。
)T 

— 写法2:
SELECT
T1.id
,count(*) OVER()
FROM (
SELECT
DISTINCT id 这里distinct的意思是去重的意思
FROM (
SELECT
id 
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = substr(current_date,1,7)
)T )T1

+———————+—————–+
|        t1.id        | count_window_0  |
+———————+—————–+
| 451000000000071117  | 5               |
| 451000000000071116  | 5               |
| 451000000000071115  | 5               |
| 451000000000071114  | 5               |
| 451000000000071113  | 5               |
+———————+—————–+

+———————+
|         id          |
+———————+
| 451000000000071113  |
| 451000000000071114  |
| 451000000000071115  |
| 451000000000071116  |
| 451000000000071117  |
+———————+

— 需求2:查询所有流量明细及所有设备月流量总额

SELECT
T1.id
,T1.pass_time
,T1.pass_num
,SUM(T1.pass_num) OVER(PARTITION BY SUBSTRING(T1.pass_time,1,7)) as total_pass     SUBSTRING 即substr 字符串截取
FROM learn3.veh_pass T1 

需求3:按设备编号日期顺序展示明细 并求
        1)从第一天开始到当前天数 对流量进行累加
        2)昨天与当前天流量累加
        3)当前天数的前一天与后一天流量累加
        4)当前天与下一天的累加和
        5)当前天数与之后所有天流量累加和

unbounded:无界限
preceding:从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量
following :与preceding相反,到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0

1)从第一天开始到当前天数 对流量进行累加
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM (
SELECT 
*
FROM learn3.veh_pass ORDER BY pass_time
) T1

2)昨天与当前天流量累加

SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM (
SELECT 
*
FROM learn3.veh_pass ORDER BY pass_time
) T1

3)当前天数的前一天与后一天流量累加

SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM (
SELECT 
*
FROM learn3.veh_pass ORDER BY pass_time
) T1

4)当前天与下一天的累加和
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW  AND 1 FOLLOWING)
FROM (
SELECT 
*
FROM learn3.veh_pass ORDER BY pass_time
) T1

5)当前天数与之后所有天流量累加和
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW  AND UNBOUNDED FOLLOWING)
FROM (
SELECT 
*
FROM learn3.veh_pass ORDER BY pass_timeun
) T1

需求4:查询每个设备编号上次有数据日期和下一次有数据日期

SELECT
T1.*
, LAG(T1.pass_time,1,”2022-01-01″) OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as before_time
, LEAD(T1.pass_time,1,”2022-12-31″) OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as after_time
FROM learn3.veh_pass T1

总结:
    OVER(): 会为每条数据都开启一个窗口,默认窗口大小就是当前数据集的大小
    OVER(PARTITION BY) 会按照指定的字段进行分区,在获取一条数据时,窗口大小为整个分区的大小,之后根据分区中的数据进行计算
    OVER(PARTITION BY … ORDER BY …) 根据给定的分区,在获取一条数据时,窗口大小为整个分区的大小,并且对分区中的数据进行排序

OVER中的取数据格式
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的改变而变化。
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED :起点,
    UNBOUNDED PRECEDING 表示从前面的起点, 
    UNBOUNDED FOLLOWING 表示到后面的终点
LAG(col,n,default_val):往前第n行数据
LEAD(col,n, default_val):往后第n行数据
NTILE(n):把有序窗口的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。

— 自定义UDF函数

    老版本UDF 不推荐使用:
    ① 创建自定义类继承UDF 注意 自定义函数名必须使用 evaluate 不然识别不到
public class MyUDFAddString extends UDF {

    /**
     * 定义函数名 evaluate
     * 实现将传入的String 增加后缀 ###
     *
     * @param col HIVE中使用函数时传入的数据
     * @return 一行数据
     */
    public String evaluate(String col) {
        return col + “###”;
    }

}
    ② 将代码打包,添加jar包至HIVE中
        add jar /usr/local/soft/test/HiveCode15-1.0-SNAPSHOT.jar;

    ③ 创建临时自定义函数
        CREATE TEMPORARY FUNCTION my_udf as “com.shujia.udf.MyUDFAddString”;

    原文作者:weixin_48508747
    原文地址: https://blog.csdn.net/weixin_48508747/article/details/124180734
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞