CentOS 7.2 安装Hive和Hive使用札记

2019年6月8日 214次阅读来源: 溯水心生

Hive由Facebook开源，是一个构建在Hadoop之上的数据仓库将结构化的数据映射成表支持类SQL查询，Hive中称为HQL无法实时更新，只支持向现有表中追加数据。

《CentOS 7.2 安装Hive和Hive使用札记》 Hive原理图.png

Hive常用文件格式类型

TEXTFILE

默认文件格式，建表时用户需要显示指定分隔符
存储方式：行存储

SequenceFile

二进制键值对序列化文件格式
存储方式：行存储

列式存储格式

RCFILE/ORC
存储方式：列存储

常用数据类型

1.整数类型

SMALLINT、INT、BIGINT
空间占用分别是1字节、2字节、4字节、8字节

2.浮点类型

DOUBLE
空间占用分别是32位和64位浮点数

3. 布尔类型BOOLEAN

用于存储true和false

4.字符串文本类型STRING

存储变长字符串，对类型长度没有限制

5.时间戳类型TIMESTAMP

存储精度为纳秒的时间戳

复杂数据类型

1.ARRAY

存储相同类型的数据，可以通过下标获取数据
定义：ARRAY<STRING>
查询：array[index]

2.MAP

存储键值对数据，键或者值的类型必须相同，通过键获取值。
定义：MAP<STRING,INT>
查询：map[‘key’]

3.STRUCT

可以存储多种不同的数据类型，一旦声明好结构，各字段的位置不能够改变。
定义：STRUCT<city:STRING, address :STRING,door_num:STRING>
查询：struct.fieldname

一、Hive的安装

1.下载Hive安装包并解压

[hadoop@hadoop01 apps]$ tar -zxvf apache-hive-1.2.2-bin.tar.gz

2.使用Root用户创建软链接

[root@hadoop01 apps]# ln -s /home/hadoop/apps/apache-hive-1.2.2-bin /usr/local/hive

3.为Hive指定用户组

[root@hadoop01 apps]# chown -R hadoop:hadoop /usr/local/hive

4. 添加Hive到系统环境变量并生效

[root@hadoop01 apps]# vim /etc/profile

添加环境变量内容为：

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:${JAVA_HOME}/bin:${ZOOKEEPER_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
:${HIVE_HOME}/bin

生效环境变量

[root@hadoop01 apps]# source /etc/profile

5.配置Hive的默认metastore

修改Hive配置目录下的hive-site.xml配置文件，编辑内容如下：

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://192.168.43.50:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hadoop</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>xxxx</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>

说明：数据库连接地址为Mysql地址，且所配置用户具有外网访问数据库权限,ConnectionPassword配置成个人Mysql数据库用户密码

二、Hive的使用

1.运行hive

[hadoop@hadoop01 ~]$ hive

Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties

2.查看数据库

hive> show databases;
OK
default
Time taken: 0.99 seconds, Fetched: 1 row(s)

3.创建用户表： user_info

字段信息

字段名称	字段类型
用户id	string
地域id	string
年龄	int
职业	string

create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t' 
lines terminated by '\n'
stored as textfile;

4.查看表

查看default库中的表，发现新建的user_info表在default库中

hive> show tables;
OK
user_info
Time taken: 0.04 seconds, Fetched: 1 row(s)

查看对应文件目录信息

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:48 /user/hive/warehouse/user_info

5.hive删除表

删除user_info表，user_info表在hdfs的目录也会被同时删除

hive> drop table user_info;
OK
Time taken: 0.935 seconds
hive> show tables;
OK
Time taken: 0.041 seconds

查看文件目录位置

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
[hadoop@hadoop01 root]$

6.创建数据库，用于存储维度

hive> create database rel;
OK
Time taken: 0.098 seconds
hive> show databases;
OK
default
rel
Time taken: 0.025 seconds, Fetched: 2 row(s)

查看对应文件目录信息：

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:55 /user/hive/warehouse/rel.db

7.创建内部管理表

在数据库rel中创建学生信息表，字段信息：学号、姓名、年龄、地域。切换使用rel数据库：

use rel;
create table student_info(
student_id string comment '学号',
name string comment '姓名',
age int comment '年龄',
origin string comment '地域'
)
comment '学生信息表'
row format delimited 
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile;

查看对应目录信息

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/rel.db
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:59 /user/hive/warehouse/rel.db/student_info

8.使用load从本地加载数据到表

使用load从本地加载数据到表student_info

hive> load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
Loading data to table rel.student_info
Table rel.student_info stats: [numFiles=1, totalSize=341]
OK
Time taken: 1.144 seconds

查看student_info 表信息和对应文件路径

hive> select * from student_info;
OK
1   xiaoming    20  11
2   xiaobai 21  31
3   zhangfei    22  44
4   likui   19  44
5   zhaoyun 21  13
6   zhangsan    20  11
7   lisi    19  11
8   wangwu  23  31
9   zhaofei 19  21
10  zhangyan    20  21
11  lihe    20  22
12  caoyang 17  32
13  lihao   19  32
14  zhaoming    21  50
15  zhouhong    18  51
16  yangshuo    23  33
17  xiaofei 24  13
18  liman   23  13
19  qianbao 20  13
20  sunce   21  41
Time taken: 0.767 seconds, Fetched: 20 row(s)

查看对应文件夹路径信息

[hadoop@hadoop01 hive_test_data]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_info
Found 1 items
-rwxr-xr-x   3 hadoop supergroup        341 2018-01-14 20:09 /user/hive/warehouse/rel.db/student_info/student_info_data.txt

9.使用load从HDFS上加载数据到表student_info

先删除原有数据文件

[hadoop@hadoop01 hive_test_data]$ hadoop fs -rm -f /user/hive/warehouse/rel.db/student_info/student_info_data.txt
18/01/14 20:15:31 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hive/warehouse/rel.db/student_info/student_info_data.txt

将本地文件上传到HDFS根目录下

[hadoop@hadoop01 hive_test_data]$ hadoop fs -put /home/hadoop/apps/hive_test_data/student_info_data.txt /
[hadoop@hadoop01 hive_test_data]$ hadoop fs -ls /
Found 6 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-14 16:23 /addata
drwxr-xr-x   - hadoop supergroup          0 2017-12-23 20:20 /data
-rw-r--r--   3 hadoop supergroup        341 2018-01-14 20:16 /student_info_data.txt
drwxrwx---   - hadoop supergroup          0 2018-01-14 17:26 /tmp
drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:48 /user
drwxr-xr-x   - hadoop supergroup          0 2018-01-13 16:26 /wordcount

使用load将HDFS文件加载到student_info 表中

hive> load data inpath '/student_info_data.txt' into table student_info;
Loading data to table rel.student_info
Table rel.student_info stats: [numFiles=1, totalSize=341]
OK
Time taken: 0.602 seconds
hive> select * from student_info;
OK
1   xiaoming    20  11
2   xiaobai 21  31
3   zhangfei    22  44
4   likui   19  44
5   zhaoyun 21  13
6   zhangsan    20  11
7   lisi    19  11
8   wangwu  23  31
9   zhaofei 19  21
10  zhangyan    20  21
11  lihe    20  22
12  caoyang 17  32
13  lihao   19  32
14  zhaoming    21  50
15  zhouhong    18  51
16  yangshuo    23  33
17  xiaofei 24  13
18  liman   23  13
19  qianbao 20  13
20  sunce   21  41
Time taken: 0.143 seconds, Fetched: 20 row(s)

采用覆盖重写方式加载文件到student_info 表中

原hdfs根目录下的student_info_data.txt已经被剪切到student_info表的hdfs路径下/user/hive/warehouse/rel.db/student_info

hive> load data inpath '/student_info_data.txt' overwrite into table student_info;
Loading data to table rel.student_info
Table rel.student_info stats: [numFiles=1, numRows=0, totalSize=341, rawDataSize=0]
OK
Time taken: 0.41 seconds

10.Hive的数据类型

字段名	类型	注释
user_id	string	用户ID
salary	int	工资
worked_citys	array<string>	工作过的城市
social_security	map<string,float>	社保缴费情况(养老，医保)
wealfare	struct<meal_allowance:float,if_regular:boolean,commerical_insurance:float>	福利(吃饭补助(float),是否转正(boolean),商业保险(float)

创建员工表

hive> create table rel.employee(
    > user_id string,
    > salary int,
    > worked_citys array<string>,
    > social_security map<string,float>,
    > welfare struct<meal_allowance:float,if_regular:boolean,commercial_insurance:float>
    > )
    > row format delimited fields terminated by '\t' 
    > collection items terminated by ','
    > map keys terminated by ':'
    > lines terminated by '\n'
    > stored as textfile;
OK
Time taken: 0.212 seconds
hive> show tables;
OK
employee
student_info
Time taken: 0.057 seconds, Fetched: 2 row(s)

从本地加载数据到表employee

hive> load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
Loading data to table rel.employee
Table rel.employee stats: [numFiles=1, totalSize=206]
OK
Time taken: 0.388 seconds
hive> select * from employee;
OK
zhangsan    10800   ["beijing","shanghai"]  {"养老":1000.0,"医疗":600.0}    {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi    20000   ["beijing","nanjing"]   {"养老":2000.0,"医疗":1200.0}   {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu  17000   ["shanghai","nanjing"]  {"养老":1800.0,"医疗":1100.0}   {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
Time taken: 0.127 seconds, Fetched: 3 row(s)

查询已转正的员工编号，工资，工作过的第一个城市，社保养老缴费情况，福利餐补金额

select user_id,
salary,
worked_citys[0],
social_security['养老'],
welfare.meal_allowance 
from rel.employee
where welfare.if_regular=true;

11.创建外部表【常用】

字段名	字段类型	字段注释
student_id	string	学生ID
name	string	学生姓名
institute_id	string	学院ID
major_id	string	专业ID
school_year	string	入学年份

可以提前创建好hdfs路径
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果没有提前创建好，在创建外部表的时候会根据指定路径自动创建

字段名	字段类型	字段注释
student_id	string	学生ID
name	string	学生姓名
institute_id	string	学院ID
major_id	string	专业ID
school_year	string	入学年份

create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';

查看对应文件目录

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/data/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-15 14:08 /user/hive/warehouse/data/student_school_info

上传本地数据文件到hdfs

[hadoop@hadoop01 root]$ hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/

12.创建内部分区表

字段名称	类型	注释
studnet_id	string	学号
name	string	姓名
institute_id	string	学院ID
major_id	string	专业ID

create table student_school_info_partition(
student_id string,
name string,
institute_id string,
major_id string
)
partitioned by(school_year string) 
row format delimited
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile;

使用insert into从student_school_info表将2017年入学的学籍信息导入到student_school_info_partition分区表中

insert into table student_school_info_partition partition(school_year='2017')
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1
where t1.school_year=2017;

13.查看分区

hive> show partitions student_school_info_partition;
OK
school_year=2017
Time taken: 0.191 seconds, Fetched: 1 row(s)

查看hdfs路径

[hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-01-16 13:20 /user/hive/warehouse/rel.db/student_school_info_partition/school_year=2017

查询student_school_info_partition

hive> select * from student_school_info_partition where school_year='2017';
OK
1   xiaoming    information software    2017
2   xiaobai information computer    2017
3   zhangfei    information computer    2017
4   likui   information bigdata 2017
5   zhaoyun information bigdata 2017
6   zhangsan    information software    2017
7   lisi    information bigdata 2017
8   wangwu  information computer    2017
Time taken: 0.226 seconds, Fetched: 8 row(s)

14.删除分区

hive> alter table student_school_info_partition drop partition (school_year='2017');
Dropped the partition school_year=2017
OK
Time taken: 0.71 seconds

15.使用动态分区添加数据

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition partition(school_year)
select t1.student_id,t1.name,t1.institute_id,t1.major_id,t1.school_year
from student_school_info t1

查看分区

hive> show partitions student_school_info_partition;
OK
school_year=2017
Time taken: 0.12 seconds, Fetched: 1 row(s)