Hive高级
1)产生背景
2)部署
3)DDL
4)DML
5)JOIN
6)function:build-in & udf
7)Sqoop
hive cli:hive/webui/beeline/Java API
HiveServer2 服务
beeline/Java API 连上HS2: client发起sql
!connect jdbc:hive2://localhost:10000 hadoop
beeline -u jdbc:hive2://localhost:14000 -n hadoop
大数据中著名的端口:
50070
8088
4040
2181
8020
7077
60010
10000
19888
hiveserver2 –hiveconf hive.server2.thrift.port=14000
Java API操作Hive
1)pom.xml添加如下dependency
org.apache.hivehive-jdbc${hive.version}
2)JDBC CODE
3)官网有错,请慎重
note:
java.sql.SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://hadoop000:14000/default: java.net.ConnectException: Connection refused: connect
WEB UI
Hive官方的不建议使用
HUE: Hadoop User Experience
http://github.com/cloudera/hue
Zeppelin
基础数据类型
复杂数据类型:array、map、struct
* 根据你所需要的复杂数据类型创建表
* 取数据
1) arraycreate table hive_array(name string, work_locations array)
row format delimited fields terminated by ‘\t’
COLLECTION ITEMS TERMINATED BY ‘,’;
load data local inpath ‘/home/hadoop/data/hive_array.txt’ overwrite into table hive_array;
array[index]
2) map Map(‘a’#1,’b’#2)
item , item
key#value
create table ruoze_map(id int, name string, family map, age int)
row format delimited fields terminated by ‘,’
COLLECTION ITEMS TERMINATED BY ‘#’
MAP KEYS TERMINATED BY ‘:’ ;
load data local inpath ‘/home/hadoop/data/hive_map.txt’ overwrite into table ruoze_map;
map[‘key’]
3) struct 192.168.1.1#zhangsan:40:xxx:bbb:aaacreate table ruoze_struct(ip string,userinfo struct)
row format delimited fields terminated by ‘#’
COLLECTION ITEMS TERMINATED BY ‘:’;
struct.property
metadata
VERSION
Hive是一个进程级别
TODO… VERSION + 1record ==> 2条记录
DBS
DATABASE_PARAMS
TBLS 和DBS是通过DB_ID关联的
JOIN
执行计划
explain sql
ABSTRACT SYNTAX TREE <== extended
STAGE DEPENDENCIES
STAGE PLANS
SQL on Hadoop
common join/shuffle join/reduce join 有shuffle
mapjoin/broadcastjoin 没有shuffle 通常情况下性能高于common join,但是有前提
explain select e.empno,e.ename,d.dname from ruozedata_emp e join ruoze_dept d on e.depnto=d.deptno;
empempno, ename, deptnomap:ruoze_deptdname,deptnomap:shuffle: 相同的key分发到一个reduce task上去执行 join的过程其实真正是发生在reduce阶段的
mapjoin: join是发生在map阶段,无shuffle
前提:大表 join 小表
原理:把小表加入到分布式缓存中去,在读取大表的时候,
直接和分布式缓存中的数据匹配,匹配上就ok,匹配不上就滚蛋
select /*+mapjoin(d)*/ e.empno,e.ename,d.dname from emp e join ruoze_dept d on e.depnto=d.deptno;
hint
压缩!!! 杀手锏 圈起来必考
大数据 ==> HDFS <== 压缩
带来的好处:
1)disk
2)file
3)shuffle
坏处:cpu
压缩技术:
有损
无损
压缩常用场景:
map输入
map输出
reduce输出