Hive系列之分区表和桶

2023年4月19日 355次阅读来源: wujustin

为提升hive数据的查询和写入性能， hive提供了分区表机制。hive每个表格可以指定多个分区key, 这些分区key决定数据的存储方式，比如表格T有个日期型分区列ds, 表格的数据会存储在：表在hdfs路径/ds=<date>目录下，查询语句中ds=’2008-09-01’类似过滤条件，可以直接查询表在hdfs路径/ds=<date>目录下数据，达到提升性能的目的。
hive提供两种分区表：静态分区和动态分区。两者主要的差别在于：加载数据的时候，动态分区不需要指定分区key的值，会根据key对应列的值自动分区写入，如果该列值对应的分区目录还没有创建，会自动创建并写入数据。下面实践演示：

1.静态分区
创建分区表

hive >create table teacher(id INT, name string, tno string)
partitioned by (work_date string)
clustered by (id) sorted by (name) into 2 buckets
row format delimited fields terminated by ‘,’ stored as textfile;

静态分区加载数据

hive>load data local inpath ‘/home/warehouse/user.txt’ overwrite into table teacher partition(work_date=”2016-07-12″);

其中user.txt内容：
1, t1, 01
2, t2, 02
3, t3, 03
4, t4, 04

分区创建完成后查看hdfs目录

hive>dfs -ls /user/hive/warehouse/crwal_db.db/teacher/
drwxrwxrwx – warehouse supergroup 0 2016-07-12 09:03 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-07-12

可以看出创建的分区目录

2.动态分区

首先需要设置参数: 动态分区相关参数设置如下

set hive.exec.dynamic.partition=true;(可通过这个语句查看：set hive.exec.dynamic.partition;)
set hive.exec.dynamic.partition.mode=nonstrict; （strict要求至少有一个静态分区， nonstrict可以都是动态分区）
set hive.exec.max.dynamic.partitions=100000;(如果自动分区数大于这个参数，将会报错)
set hive.exec.max.dynamic.partitions.pernode=100000;

创建一个临时表格，用于加载数据，然后把临时表格的数据插入到分区表。

hive>create table tmp (
id int, name string, tno string, work_date string)
row format delimited fields terminated by ‘,’ stored as textfile;

本地文件数据

$ cat user1.txt
1,root,01,2016-07-11
2,sys,02,2016-07-11
3,user01,03,2016-07-11
4,user02,04,2016-07-11
5,user03,05,2016-07-11
6,user04,06,2016-06-11
7,user05,07,2016-06-11
8,user06,08,2016-06-11
9,user07,09,2016-06-11
10,user08,10,2016-05-11
11,user09,11,2016-05-11
12,user10,12,2016-05-11

加载数据到临时表

load data local inpath “/home/warehouse/user1.txt” overwrite into table tmp;

从临时表加载数据到分区表

hive>insert into table teacher partition(work_date) select id, name, tno, work_date from tmp;

再次查看hdfs中数据分区

hive>dfs -ls /user/hive/warehouse/crwal_db.db/teacher/
drwxrwxrwx – warehouse supergroup 0 2016-07-12 16:43 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-05-11
drwxrwxrwx – warehouse supergroup 0 2016-07-12 16:43 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-06-11
drwxrwxrwx – warehouse supergroup 0 2016-07-12 16:43 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-07-11

上面的临时表work_date包括三个数据：2016-05-11， 2016-06-11， 2016-07-11，插入到以workdate为分区key的teacher表时，会自动识别出这三种值，分别创建三个目录。而不需要像静态分区一样每插入一个分区key的数据都要一条如下插入语句:

insert into table teacher partition(work_date=”2016-05-11″) select id, name, tno, work_date from tmp where work_date=”2016-05-11″;

需要注意的是：

在一个表同时使用动态和静态分区表时，静态分区值必须在动态分区值的前面。
选择分区key时，要防止数据倾斜，数据严重分布不均衡。
使用动态分区，作为分区列的值要可以预测和枚举，不能目录过多而每个目录数据又很少，会严重影响性能。

3.桶
对于每一个表（table）或者分区，Hive可以进一步组织成桶。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。采用桶好处有两个：

数据sampling2. 提升某些查询操作效率，例如mapside join
JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

hive中table可以拆分成partition，table和partition可以通过‘CLUSTERED BY ’进一步分bucket，bucket中的数据可以通过‘SORT BY’排序。如上语句所示，通过id列把数据分成2个桶，桶中数据通过name排序。

可以看下分区表里面分成桶以后的文件存储格式：

hive>dfs -ls /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-06-11;
-rwxrwxrwx 3 warehouse supergroup 24 2016-07-12 16:43 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-06-11/000000_0
-rwxrwxrwx 3 warehouse supergroup 24 2016-07-12 16:43 /user/hive/warehouse/crawl_db.db/teacher/work_date=2016-06-11/000001_0

可见每个分区数据被划分到了两个桶里面。

下面看下桶在数据采样里面的应用：
tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y)

select * from teacher tablesample(bucket 1 out of 2 on id);

    原文作者：wujustin
    原文地址: https://www.jianshu.com/p/440408571bb7
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。