1、HIVE基本结构
Hive中主要包含以下几种数据模型:database(数据库),Table(表),External Table(外部表),Partition(分区),Bucket(桶)
HIVE中数据存放在HDFS上
HIVE的元数据是存放在RDBMS上(meta/schema)
HIVE中包含了多个数据库,默认的数据库为default,对应于HDFS目录是/user/hadoop/hive/warehouse,可以通过${HIVE_HOME}/conf/hive-site.xml配置文件中的hive.metastore.warehouse.dir参数进行配置
HIVE中每个表是存在数据库下面的,Hive 中的每张表对应于HDFS上的一个目录,HDFS目录为:/user/hadoop/hive/warehouse/[databasename.db]/table
2、HIVE的基本语法
查看数据库:hive> show databases;
查看表:hive> show tables;
3、创建数据库
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment] 描述
[LOCATION hdfs_path] 选择存放地址
[WITH DBPROPERTIES (property_name=property_value, …)]; 属性
hive> create database hive;
OK
Time taken: 0.852 seconds
hive> show databases;
OK
default
hive
Time taken: 0.071 seconds, Fetched: 2 row(s)
查看hdfs目录
[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse
drwxr-xr-x – hadoop supergroup 0 2018-06-06 15:31 /user/hive/warehouse/hive.db
4、删除数据库
drop database [if exists] database_name [RESTRICT|CASCADE]
默认restrict 如果数据库内有表格则会报错
cascade是强制删除
5、查看数据库信息
hive> desc database hive;
OK
hive hdfs://192.168.137.130:9000/user/hive/warehouse/hive.db hadoop USER
Time taken: 0.084 seconds, Fetched: 1 row(s)
要显示属性则需要加extended
6、创建表
首先进入数据库 use hive;
create table ruozedata_emp
(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double, deptno int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’ ;
查询表结构 :desc formatted ruozedata_emp
hive> desc formatted ruozedata_emp;
OK
# col_name data_type comment
empno int
ename string
job string
mgr int
hiredate string
salary double
comm double
deptno int
# Detailed Table Information
Database: hive
Owner: hadoop
CreateTime: Wed Jun 06 15:52:43 CST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://192.168.137.130:9000/user/hive/warehouse/hive.db/ruozedata_emp
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1528271563
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.603 seconds, Fetched: 34 row(s)
查看数据:select * from ruozedata_emp;发现没有数据
查看hdfs:[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/hive.db
Found 1 itemsdrwxr-xr-x – hadoop supergroup 0 2018-06-06 15:52 /user/hive/warehouse/hive.db/ruozedata_emp
7、导入数据
hive> LOAD DATA LOCAL INPATH ‘/home/hadoop/data/emp.txt’ OVERWRITE INTO TABLE ruozedata_emp;
Loading data to table hive.ruozedata_emp
Table hive.ruozedata_emp stats: [numFiles=1, numRows=0, totalSize=700, rawDataSize=0]
OK
Time taken: 8.16 seconds
local: 从本地文件系统加载数据到hive表
非local:从HDFS文件系统加载数据到hive表
OVERWRITE: 加载数据到表的时候数据的处理方式,覆盖
非OVERWRITE:追加
再查询:
hive> select * from ruozedata_emp;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL
Time taken: 0.208 seconds, Fetched: 15 row(s)
8、CTAS创建表
hive> create table ruozedata_emp2 as select * from ruozedata_emp;
ctas会走MapReduce
Query ID = hadoop_20180606155656_d5ba6272-9987-4adc-b6df-62dbd83c67ec
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there’s no reduce operator
Starting Job = job_1528254562006_0001, Tracking URL = http://hadoop001:8088/proxy/application_1528254562006_0001/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1528254562006_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-06-07 13:43:15,789 Stage-1 map = 0%, reduce = 0%
2018-06-07 13:43:44,466 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.8 sec
MapReduce Total cumulative CPU time: 2 seconds 800 msec
Ended Job = job_1528254562006_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://192.168.137.130:9000/user/hive/warehouse/hive.db/.hive-staging_hive_2018-06-07_13-41-33_336_3165899760491697186-1/-ext-10001
Moving data to: hdfs://192.168.137.130:9000/user/hive/warehouse/hive.db/ruozedata_emp2
Table hive.ruozedata_emp2 stats: [numFiles=1, numRows=15, totalSize=708, rawDataSize=693]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.8 sec HDFS Read: 4083 HDFS Write: 784 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 800 msec
OK
Time taken: 148.678 seconds
hive> select * from ruozedata_emp2;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL
Time taken: 0.183 seconds, Fetched: 15 row(s)
CTAS是新建一张emp2表并将emp的结构和数据都拷贝过来
9、如果只需要拷贝表结构而不需要其中的数据
CREATE table ruozedata_emp3 like ruozedata_emp;
hive> CREATE table ruozedata_emp3 like ruozedata_emp;
OK
Time taken: 0.409 seconds
hive> select * from ruozedata_emp3;
OK
Time taken: 0.2 seconds
hive> desc ruozedata_emp3;
OK
empno int
ename string
job string
mgr int
hiredate string
salary double
comm double
deptno int
Time taken: 0.301 seconds, Fetched: 8 row(s)
10、修改表名
alter table tablename rename to newname;
hive> alter table ruozedata_emp3 rename to ruozedata_emp3_new;
OK
Time taken: 1.212 seconds
hive> show tables;
OK
ruozedata_emp
ruozedata_emp2
ruozedata_emp3_new
Time taken: 0.165 seconds, Fetched: 3 row(s)
11、插入数据
之前的emp3为仅拷贝表结构的,里面没有数据
使用insert overwrite(into)来插入数据
hive> insert overwrite(into) table ruozedata_emp3 select * from ruozedata_emp;
跑MapReduce
hive> select * from ruozedata_emp3;
OK
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL
Time taken: 0.742 seconds, Fetched: 15 row(s)
12、内部表和外部表
以上创建的表格均为内部表(managed table),默认的未被external修饰;而被external修饰的表为外部表(external table)。
删除内部表会直接删掉metadata和hdfs上的存储数据;而删除外部表仅仅会删除metadata。
生产中基本都用外部表,因为会有留底。
创建外部表的语法
hive> create EXTERNAL table ruozedata_emp_external
> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double, deptno int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> LOCATION “/ruozedata/external/emp” ;