快乐大数据第5次课 hive(1)工作原理Hive的执行入口是Driver,执行的SQL语句首先提交到Drive驱动,然后调用compiler解释驱动,最终解释成MapReduce任务去执行。无法实现实时更新,只能向现有的表中追加数据。(2)数据模型分区减少不必要的全表数据扫描。 对表使用分区,将表数据按照某个或某些字段划分。分区在HDFS的表现为表路径下的不同文件目录。 为了避免使用分区产生过多的小文件,建议只对离散的字段进行分区。如日期,地域,类型。 每一个表或者分区,Hive可以进一步组织成桶,桶是更为细颗粒度的数据范围划分hashcode(col_value) %nums_bucket 使用分桶的原因 获取更高的查询处理效率,使得取样更有效(3)常用文件格式 默认的数据格式有TEXTDFILE。SequenceFile。列式存储格式。(4)常规数据类型:整数,浮点,布尔,字符串,时间戳 (5)复杂数据类型:Array,Map,Struct实战mysql 5.7.19的安装1 检查是否有已经存在的mysql并且删除rpm -qa | grep mysqlyum -y remove mysql-libs-5.1.66-2.el6_3.x86_64find / -name “*mysql*” | xargs rm -rf2.赋予mysql和mysql用户组groupadd mysqluseradd mysql -g mysqlcd /usr/local3.上传mysql-5.7.19,并且修改属主和权限rz tar -zxvf mysql-5.7.19-linux-glibc2.12-x86_64.tar.gzrm mysql-5.7.19-linux-glibc2.12-x86_64.tar.gzmv mysql-5.7.19-linux-glibc2.12-x86_64 mysql#建立data文件夹,mysql的启动需要datadir 否组mysql无法启动mkdir /usr/local/mysql/datacd /usr/local/chown -R mysql:mysql mysql/chmod -R 755 mysql/mkdir /var/lib/mysql/#建立sock通讯文件,并赋予mysql的权限在/var/lib/下 cd /var/lib/chown mysql:mysql mysql/chmod 777 mysql/cd /var/lib/mysql/touch mysql.sockchown mysql:mysql mysql.sockchmod 777 mysql.sock4.编辑配置文件,以便mysql的启动 cd /etc touch my.cnf vim /etc/my.cnf配置文件如下[mysql]#设置mysql客户端默认字符集default-character-set=utf8socket=/var/lib/mysql/mysql.sock[mysqld]skip-name-resolve#设置3306端口port=3306socket=/var/lib/mysql/mysql.sock#设置mysql的安装目录basedir=/usr/local/mysqldatadir=/usr/local/mysql/data#最大了连接数max_connections=200#服务端使用的字符集,默认为8比特编码的latin1字符集character-set-server=utf8#创建新表时将使用的默认存储引擎default-storage-engine=INNODBlower_case_table_names=1max_allowed_packet=16M5.mysql的启动/usr/local/mysql/bin/mysqld –initialize –user=mysql –datadir=/usr/local/mysql/data –basedir=/usr/local/mysql记住最后一行的临时密码/usr/local/mysql/support-files/mysql.server start#添加软连接,方便启动ln -s /usr/local/mysql/support-files/mysql.server /etc/init.d/mysql#把mysql的操作放入到系统路径echo “export PATH=$PATH:/usr/local/mysql/bin” >> /etc/profilesource /etc/profileservice mysql restart#把mysql客户端放到启动目录ln -s /usr/local/mysql/bin/mysql /usr/binmysql -u root -p 输入刚才的临时密码6.mysql的设置msql>alter user ‘root’@’localhost’ identified by ‘123456’; mysql>use mysql; msyql>update user set user.Host=’%’ where user.User=’root’; GRANT ALL PRIVILEGES ON *.* TO ‘hadoop’@’%’ IDENTIFIED BY ‘hadoop’ WITH GRANT OPTION;(hadoop用户再访问时,将以用户名是hadoop,密码是hadoop的形式访问。输入命令mysql -uhadoop -phadoop) mysql> create user ‘hive’@’%’ identified by ‘hive123’; mysql> GRANT ALL PRIVILEGES ON *.* TO ‘hive’@’%’ IDENTIFIED BY ‘hive123’ WITH GRANT OPTION; #和后面的hive.xml的连接信息需要对应。 mysql>flush privileges; mysql>show databases; mysql>quit允许3306端口 #iptables -I INPUT -p tcp -m state –state NEW -m tcp –dport 3306 -j ACCEPT #iptables -L -n #service iptables save(当虚拟机重新启动后 需要启动mysql 则 先用命令(在root用户下):service mysql start 再用命令:mysql -u root -p123456 或者mysql -uhadoop -phadoop) Hive的数据仓库配置创建数据仓库,用户具有读和写的权限在node02上hadoop fs -mkdir -p /user/hive/warehousehadoop fs -chmod a+w /user/hive/warehousehadoop fs -mkdir -p /temphadoop fs -chmod a+w /tempcd ~/appsrztar -zxvf apache-hive-1.2.2-bin.tar.gz hive-1.2.2su rootln -s /home/hadoop/apps/hive-1.2.2 /usr/local/hivechown -R hadoop:hadoop /usr/local/hive添加环境变量(已经添加)cd /home/hadoop/apps/hive-1.2.2/confrz hive.xml(注意自己电脑的mysql的连接地址)cd /usr/local/hive/bin/hiverz 上传mysql的jar包(先在root用户下 启动service mysql start)su hadoop 启动hivecd /usr/local/hive/binhive#内部表的创建和使用hive>show databases; use default; show tables; create table user_info(user_id string,area_id string,age int,occupation string)row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’stored as textfile; show tables; desc user_info; create database rel; use rel; show tables; create table student_info(student_id string comment ‘学号’,name string comment ‘姓名’,age int comment ‘年龄’,origin string comment ‘地域’)comment ‘学生信息表’row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’stored as textfile; #等于创建了一个表头新开一个窗口在Xshelll中新开一个node02的窗口 hadoop fs -ls /user/hive/warehouse/ hadoop fs -ls /user/hive/warehouse/rel.dbmkdir /home/hadoop/apps/hive_test_data将课程的文件都移动到 hive_test_data的下面。cat student_info_data.txthadoop fs -ls /user/hive/warehouse/在hive下执行load data local inpath ‘/home/hadoop/apps/hive_test_data/student_info_data.txt’ into table student_info; #上传表的内容回看一下hdfs下的路径:切换到node02的新开窗口 hadoop fs -ls /user/hive/warehouse/rel.db/student_info;再在hive下执行select * from student_info;先传到根目录下,然后再执行(如果没有特殊说明,即在新开窗口的命令行下)在/home/hadoop/apps/hive_test_data下 上传到根目录下 hadoop fs -put student_info_data.txt /再回到hive下#使用load将hdfs文件加载到student_info表中load data inpath ‘/student_info_data.txt’ into table student_info;回到窗口下hadoop fs -ls /user/hive/warehouse/rel.db/student_info;(会看到拷贝了一份里面的数据文本,名字是student_info_data_copy_1.txt)回到hive下load data inpath ‘/student_info_data.txt’ overwrite into table student_info;(先删除调以前重名的,再拷贝新的相同名字的数据文件)再回到窗口下hadoop fs -ls /user/hive/warehouse/rel.db/student_info;回到hive下create table rel.employee(user_id string,salary int,worked_citys array,social_security map,welfare struct)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’
map keys terminated by ‘:’
lines terminated by ‘\n’
stored as textfile;
load data local inpath ‘/home/hadoop/apps/hive_test_data/employee_data.txt’ into table employee;
select * from employee;
#在hadoop下查看
hadoop fs -ls /user/hive/warehouse/rel.db/employee
#外部表的创建和使用
create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile
location ‘/user/hive/warehouse/data/student_school_info’;
上传本地数据文件到hdfs
hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
注意:不必提前用hadoop fs -mkdir创建路径,如果没有提前创建好,在创建外部表的时候会根据指定路径自动创建
#创建内部分区表(静态分区)
create table student_school_info_partition(
student_id string,
name string,
institute_id string,
major_id string
)
partitioned by(school_year string)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile;
使用insert into从student_school_info表将2017年入学的学籍信息导入到student_school_info_partition分区表中
insert into table student_school_info_partition partition(school_year=’2017′)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1
where t1.school_year=2017;
查看分区
show partitions student_school_info_partition;
select * from student_school_info_partition;
查看分区
show partitions student_school_info_partition;
在hadoop窗口下,查看载入的数据
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/ 会增加school_year=’2017’目录
删除分区
alter table student_school_info_partition drop partition (school_year=’2017′);
再查看下分区
show partitions student_school_info_partition;
再在hdf上对应的数据
回到hadf上
hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/
会发现内部的分区目录和数据也没有了
#创建内部分区表(动态分区)
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition partition(school_year)
select t1.student_id,t1.name,t1.institute_id,t1.major_id,t1.school_year
from student_school_info t1
show partitions student_school_info_partition;
select * from student_school_info_partition where school_year=”2017″;
#创建外部分区表(先创建外部表,然后再创建分区)
create external table rel.student_school_info_external_partition(
student_id string,
name string,
institute_id string,
major_id string
)
partitioned by(school_year string)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile
location ‘/user/hive/warehouse/data/student_school_info_external_partition’;
手动在hdfs下:创建分区表的字段 hadoop fs -mkdir /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
接着在hive下需要手动添加表结构才可以在hive下识别(直接不能被识别);show partitions student_school_info_external_partition;
alter table student_school_info_external_partition add partition(school_year=’2017′);
show partitions student_school_info_external_partition; 就出现分区提示了。
现在添加数据
回到hdf下
cd /home/hadoop/apps/hive_test_data/
hadoop fs -put student_school_external_partition_data.txt /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
再回到hive下
select * from student_school_info_external_partition where school_year=”2017″;
删除外部分区
alter table student_school_info_external_partition drop partition(school_year=’2017′);
再验证下
show partitions student_school_info_external_partition;
但是在hdfs下 hdoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition/ 会发现目录存在,、数据也存在是(00000)的形式
#***************使用LIKE、AS创建表,表重命名,添加、修改、删除列*************
1. 根据已存在的表结构,使用like关键字,复制一个表结构一模一样的新表
create table student_info2 like student_info;
2. 根据已经存在的表,使用as关键字,创建一个与查询结果字段一致的表,同时将查询结果数据插入到新表
create table student_info3 as select * from student_info;
只有student_id,name两个字段的表
create table student_info4 as select student_id,name from student_info;
#分桶数据表
***************创建分桶表*************
1. 按照指定字段取它的hash散列值分桶
创建学生入学信息分桶表
字段信息:学号、姓名、学院ID、专业ID
分桶字段:学号,4个桶,桶内按照学号升序排列
create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile;
2. 向student_info_bucket分桶表插入数据
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
cluster by(student_id);
查看hdfs分桶文件
hadoop fs -ls /user/hive/warehouse/rel.db/student_info_bucket
分桶表一般不使用load向分桶表中导入数据,因为load导入数据只是将数据复制到表的数据存储目录下,hive并不会
在load的时候对数据进行分析然后按照分桶字段分桶,load只会将一个文件全部导入到分桶表中,并没有分桶。一般
采用insert从其他表向分桶表插入数据。
分桶表在创建表的时候只是定义表的模型,插入的时候需要做如下操作:
在每次执行分桶插入的时候在当前执行的session会话中要设置hive.enforce.bucketing = true;声明本次执行的是一次分桶操作。
需要指定reduce个数与分桶的数量相同set mapreduce.job.reduces=4,这样才能保证有多少桶就生成多少个文件。
如果定义了按照分桶字段排序,需要在从其他表查询数据过程中将数据按照分区字段排序之后插入各个桶中,分桶表并不会将各分桶中的数据排序。
排序和分桶的字段相同的时候使用Cluster by(字段),cluster by 默认按照分桶字段在桶内升序排列,如果需要在桶内降序排列,
使用distribute by (col) sort by (col desc)组合实现。
例如:
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin
from student_info
distribute by (student_id) sort by (student_id desc);
#导出数据
使用insert将student_info表数据导出到本地指定路径
insert overwrite local directory ‘/home/hadoop/apps/hive_test_data/export_data’
row format delimited fields terminated by ‘\t’ select * from student_info;
导出数据到本地的第二种方法
在另一个窗口下 hive -e”select * from rel.student_info”> ./student_info_data.txt
ll
#join关联
ll
create table rel.a(
id int,
name string
)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile;
create table rel.b(
id int,
name string
)
row format delimited
fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile;
load data local inpath ‘/home/hadoop/apps/hive_test_data/a_join_data’ into table a;
load data local inpath ‘/home/hadoop/apps/hive_test_data/b_join_data’ into table b;
****join或inner join
两个表通过id关联,只把id值相等的数据查询出来。join的查询结果与inner join的查询结果相同。
select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
****full outer join或full join
两个表通过id关联,把两个表的数据全部查询出来
select * from a full outer join b on a.id=b.id;
(执行时报错,在另一个窗口中执行 直接加大reduce数到2000,成功
set mapred.reduce.tasks = 2000;
set mapreduce.reduce.memory.mb=16384;
set mapreduce.reduce.java.opts=-Xmx16384m;)
****left join
左连接时,左表中出现的join字段都保留,右表没有连接上的都为空
select * from a left join b on a.id=b.id;
****right join
右连接时,右表中出现的join字段都保留,左表没有连接上的都是空
select * from a right join b on a.id=b.id;
****left semi join
左半连接实现了类似IN/EXISTS的查询语义,输出符合条件的左表内容。
hive不支持in …exists这种关系型数据库中的子查询结构,hive暂时不支持右半连接。
例如:
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive对应于如下语句:
select a.id,a.name from a left semi join b on a.id = b.id;
****map side join
使用分布式缓存将小表数据加载都各个map任务中,在map端完成join,map任务输出后,不需要将数据拷贝到reducer阶段再进行join,
降低的数据在网络节点之间传输的开销。多表关联数据倾斜优化的一种手段。多表连接,如果只有一个表比较大,其他表都很小,
则join操作会转换成一个只包含map的Job。运行日志中会出现Number of reduce tasks is set to 0 since there’s no reduce operator
没有reduce的提示。
例如:
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id;
#内置函数
创建用户评分表
create table rel.user_core_info(
user_id string,
age int,
gender string,
core int
)
row format delimited fields terminated by ‘\t’
lines terminated by ‘\n’
stored as textfile;
load data local inpath ‘/home/hadoop/apps/hive_test_data/user_core.txt’ into table rel.user_core_info;
1. 条件函数 case when
语法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
说明:如果a等于b,那么返回c;如果a等于d,那么返回e;否则返回f
例如:
hive> select case 1 when 2 then ‘two’ when 1 then ‘one’ else ‘zero’ end;
one
语法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
说明:如果a为TRUE,则返回b;如果c为TRUE,则返回d;否则返回e
例如:
hive> select case when 1=2 then ‘two’ when 1=1 then ‘one’ else ‘zero’ end;
one
查询用户评分表,每个年龄段的最大评分值
select gender,
case when age<=20 then ‘p0’ when age>20 and age<=50 then ‘p1’ when age>=50 then ‘p3’ else ‘p0’ end,
max(core) max_core
from rel.user_core_info
group by gender,
case when age<=20 then ‘p0’ when age>20 and age<=50 then ‘p1’ when age>=50 then ‘p3’ else ‘p0’ end;
#自定义UDF函数
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
UDF 作用于单个数据行,产生一个数据行作为输出。
步骤:
1. 先开发一个java类,继承UDF,并重载evaluate方法
2. 打成jar包上传到服务器
3. 在使用的时候将jar包添加到hive的classpath
hive>add jar /home/hadoop/apps/hive_test_data/HiveUdfPro-1.0-SNAPSHOT.jar;
4. 创建临时函数与开发好的java class关联
hive>create temporary function age_partition as ‘cn.chinahadoop.udf.AgePartitionFunction’;
5. 即可在hql中使用自定义的函数
select gender,
age_partition(age),
max(core) max_core
from rel.user_core_info
group by gender,
age_partition(age);