Hive DML&分区表

2019年6月7日 530次阅读来源: 白面葫芦娃92

1.INSERT

官网说明：
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2…)[IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2…)]select_statement1 FROM from_statement;

hive> create table ruozedata_emp4 like ruozedata_emp;
hive> INSERT OVERWRITE TABLE ruozedata_emp4 select * FROM ruozedata_emp;
hive> select * from ruozedata_emp4;

《Hive DML&分区表》

hive> INSERT INTO TABLE ruozedata_emp4 select * FROM ruozedata_emp;
hive> select * from ruozedata_emp4;

《Hive DML&分区表》

数据被追加进去

上边两个命令都是select * ，两表都是一一对应的，如果insert的表的列数与原表不等或者列的顺序与原表不同会怎样呢？

hive> INSERT OVERWRITE TABLE ruozedata_emp4
> SELECT empno,ename from ruozedata_emp;

《Hive DML&分区表》列的数量不同无法插入

hive> INSERT INTO TABLE ruozedata_emp4
> select empno,job, ename,mgr, hiredate, salary, comm, deptno from ruozedata_emp;
（列的顺序与原表不同）
hive> select * from ruozedata_emp4;

《Hive DML&分区表》数据混乱

2.Writing data into the filesystem from queries

官网说明：
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive0.11.0)
SELECT … FROM …

3.where语句

hive> select * from ruozedata_emp where deptno=10;
hive> select * from ruozedata_emp where empno>=8000;
hive> select * from ruozedata_emp where salary between 800 and 1500;（左闭右闭）
hive> select * from ruozedata_emp limit 5;
hive> select * from ruozedata_emp where ename in (“SMITH”,”KING”);
hive> select * from ruozedata_emp where ename not in (“SMITH”,”KING”);
hive> select * from ruozedata_emp where comm is null;

《Hive DML&分区表》

4.聚合函数

max/min/count/sum/avg 特点：多进一出

hive> select count(1) from ruozedata_emp where deptno=10;

《Hive DML&分区表》

hive>select max(salary), min(salary), avg(salary), sum(salary) from ruozedata_emp;

《Hive DML&分区表》

5.分组函数 group by

1）

hive> select deptno,avg(salary) from ruozedata_emp group by deptno;（求部门平均工资）

《Hive DML&分区表》

hive> select ename,deptno,avg(salary) from ruozedata_emp group by deptno;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key ‘ename’

报错原因：select中出现的字段，如果没有出现在组函数/聚合函数中，必须出现在group by里面

《Hive DML&分区表》

2）求每个部门(deptno)、工作岗位(job)的最高工资(salary)

hive> select deptno,job,max(salary) from ruozedata_emp group by deptno,job;

《Hive DML&分区表》

3）求平均工资大于2000的部门

hive> select deptno,avg(salary) from ruozedata_emp group by deptno having avg(salary)>2000;

不能用where，对于分组之后的结果进行筛选只能用having；where是用来对单条数据进行筛选的，而且where需要写在group by之前的

《Hive DML&分区表》

4）case when then

select ename,salary,
case
when salary>1 and salary<=1000 then ‘lower’
when salary>1000 and salary<=2000 then ‘middle’
when salary>2000 and salary<=4000 then ‘high’
else ‘highest’
end
from ruozedata_emp;

《Hive DML&分区表》

5）join

《Hive DML&分区表》

hive> select * from a join b;

《Hive DML&分区表》笛卡尔积，非常耗性能，一定要规避

inner join=join

outer join：left join，right join，full join

hive> select a.id,a.name,b.age from a join b on a.id=b.id;

《Hive DML&分区表》

hive> select a.id,a.name,b.age from a left join b on a.id=b.id;

《Hive DML&分区表》

hive> select a.id,a.name,b.age from a full join b on a.id=b.id;

《Hive DML&分区表》

6.分区表 partition

1）静态分区

hive> create table order_partition(
> ordernumber string,
> eventtime string
> )
> partitioned by (event_month string)
> row format delimited fields terminated by ‘\t’;

hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/data/order.txt’
> OVERWRITE INTO TABLE order_partition
> PARTITION(event_month=’2014-05′);

hive> select * from order_partition;
OK
10703007267488 2014-05-01 2014-05
10101043505096 2014-05-01 2014-05
10103043509747 2014-05-01 2014-05
10103043501575 2014-05-01 2014-05
10104043514061 2014-05-01 2014-05

《Hive DML&分区表》

hive> desc formatted order_partition;

《Hive DML&分区表》

[hadoop@hadoop001 data]$ hdfs dfs -ls hdfs://192.168.137.141:9000/user/hive/warehouse/ruozedata.db/order_partition

《Hive DML&分区表》

[hadoop@hadoop001 data]$ hdfs dfs -ls hdfs://192.168.137.141:9000/user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-05

《Hive DML&分区表》

hive> select * from order_partition where event_month=’2014-05′;
分区查询的时候务必要记得带上分区

《Hive DML&分区表》

添加分区的方法

a）以上是通过建表的时候直接分区，hdfs里显示相应的分区会有相应的文件夹/order_partition/event_month=2014-05，那么可不可以通过直接在hdfs里新建一个分区文件夹来进行分区呢？

[hadoop@hadoop001 data]$ hdfs dfs -mkdir /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-06
[hadoop@hadoop001 data]$ hdfs dfs -put order.txt /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-06

新建了/event_month=2014-06文件夹，并将order.txt放入了该文件夹内

[hadoop@hadoop001 data]$ hdfs dfs -ls hdfs://192.168.137.141:9000/user/hive/warehouse/ruozedata.db/order_partition

《Hive DML&分区表》

hive> select * from order_partition where event_month=’2014-06′;

在hive里查询一下分区event_month=2014-06下的数据，却发现是空的：

《Hive DML&分区表》

进入mysql查看一下元数据

mysql> show databases;
mysql> use ruozedata_basic03;
mysql> show tables;

mysql> select * from partitions;

《Hive DML&分区表》

mysql> select * from partition_keys;

《Hive DML&分区表》

mysql> select * from partition_key_vals;

《Hive DML&分区表》

发现元数据里只有event_month=2014-05这一个分区，为什么呢？

查询官网的说明：

《Hive DML&分区表》

metastore没有感知到，需使用msck命令修复MySQL表的元数据信息，Hive上才能查到到相应的数据结果

hive> MSCK REPAIR TABLE order_partition;
OK
Partitions not in metastore: order_partition:event_month=2014-06
Repair: Added partition to metastore order_partition:event_month=2014-06

《Hive DML&分区表》

再次查看mysql里的元数据，分区信息都进来了

《Hive DML&分区表》

hive> select * from order_partition where event_month=’2014-06′;

《Hive DML&分区表》

但是，这个命令会刷新所有的分区信息，过于简单粗暴，不建议使用

b）应该用Add Partitions

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location’][, PARTITION partition_spec [LOCATION ‘location’], …];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, …)

再来试验一下

[hadoop@hadoop001 data]$ hdfs dfs -mkdir /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-07
[hadoop@hadoop001 data]$ hdfs dfs -put order.txt /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-07
[hadoop@hadoop001 data]$ hdfs dfs -ls /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-07
Found 1 items
-rw-r–r– 1 hadoop supergroup 217 2018-06-17 21:11 /user/hive/warehouse/ruozedata.db/order_partition/event_month=2014-07/order.txt
hive> select * from order_partition where event_month=’2014-07′;
OK（没有数据）

使用Add Partitions命令

hive> ALTER TABLE order_partition ADD IF NOT EXISTS
> PARTITION (event_month=’2014-07′) ;
hive> select * from order_partition where event_month=’2014-07′;

《Hive DML&分区表》

c）还有一种方法，如下：

hive> create table order_4_partition(
> ordernumber string,
> eventtime string
> )
> row format delimited fields terminated by ‘\t’;

hive> load data local inpath ‘/home/hadoop/data/order.txt’ overwrite into table order_4_partition;

hive> insert overwrite table order_partition
> partition(event_month=’2014-08′)
> select * from order_4_partition;

hive> select * from order_partition where event_month=’2014-08′;

《Hive DML&分区表》

在hive里查看当前有几个分区的方法：

hive> show partitions order_partition;

《Hive DML&分区表》

静态分区是最简单的分区，单级分区，分区与表中的字段内容没有关系，而且出现在分区内的字段内容是不能出现在表中的

d）多级分区

hive> create table order_mulit_partition(
> ordernumber string,
> eventtime string
> )
> partitioned by (event_month string,event_day string)
> row format delimited fields terminated by ‘\t’;
hive> desc formatted order_mulit_partition;

《Hive DML&分区表》

hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/data/order.txt’
> OVERWRITE INTO TABLE order_mulit_partition
> PARTITION(event_month=’2014-05′, event_day=’01’);
hive> select * from order_mulit_partition where event_month=’2014-05′ and event_day=’01’;

《Hive DML&分区表》

查看一下在hdfs下的目录结构

《Hive DML&分区表》

2）动态分区

hive> create table ruozedata_static_emp
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double,deptno string)
> PARTITIONED by(deptno string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’;
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns（报错，分区字段不能出现在表中）

hive> create table ruozedata_static_emp
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
> PARTITIONED by(deptno string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’;

hive> select * from ruozedata_static_emp;
OK（里面没有数据）

将表 ruozedata_emp中deptno=10的数据写入 ruozedata_static_emp的deptno=’10’的分区内

hive> insert into table ruozedata_static_emp partition(deptno=’10’)
> select empno,ename,job,mgr,hiredate,salary,comm from ruozedata_emp
> where deptno=10;

《Hive DML&分区表》

hive> select * from ruozedata_static_emp;

《Hive DML&分区表》

以上还是静态分区的方法，如果需要分区很多（比如1000个分区），这种方式太耗时间精力

所以需要动态分区

hive> create table ruozedata_dynamic_emp
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)
> PARTITIONED by(deptno string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’ ;

动态分区明确要求：分区字段写在select的最后面

hive> insert into table ruozedata_dynamic_emp partition(deptno)
> select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;

《Hive DML&分区表》

hive> set hive.exec.dynamic.partition.mode=nonstrict;（只是临时设置为 nonstrict ，重新进入hive又会自动变为strict模式）
【set hive.exec.dynamic.partition.mode=nonstrict;
这是hive中常用的设置key=value的方式
语法格式：
set key=value; 设置
set key; 取值】

《Hive DML&分区表》

hive> insert into table ruozedata_dynamic_emp partition(deptno)
> select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;

《Hive DML&分区表》

hive> show partitions ruozedata_dynamic_emp;

《Hive DML&分区表》

    原文作者：白面葫芦娃92
    原文地址: https://www.jianshu.com/p/d453f70f59ff
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。