利用Sqoop实现HDFS的数据与MySQL数据的互导
1. 查看帮助
[root@repo bin]# ./sqoop help
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
2. 查看mysql数据中有哪些数据库
[root@repo bin]# ./sqoop list-databases \
--connect jdbc:mysql://192.168.9.100:3306/ \
--username root \
--password 123456
# 结果:
information_schema
hive_more_users
hive_single_user
mysql
test
3. 导入数据到HDFS
(1) 配置概述
Common arguments:
--connect <jdbc-uri> Specify JDBC connect string
--connection-manager <class-name> Specify connection manager class name
--connection-param-file <properties-file> Specify connection parameters file
--driver <class-name> Manually specify JDBC driver class to use
--hadoop-home <hdir> Override $HADOOP_MAPRED_HOME_ARG
--hadoop-mapred-home <dir> Override $HADOOP_MAPRED_HOME_ARG
--help
(2) mysql表准备
mysql> select * from Ownerinfo;
+--------------------+--------+------+
| Ownerid | Name | Age |
+--------------------+--------+------+
| 110101200001010001 | 刘备 | 53 |
| 110101200001010002 | 关羽 | 42 |
| 110101200001010003 | 张飞 | 35 |
| 110101200001010004 | 赵云 | 33 |
| 110101200001010005 | 马超 | 38 |
| 110101200001010006 | 黄忠 | 70 |
+--------------------+--------+------+
(3) Mysql表中数据导入HDFS的默认路径下
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--split-by Ownerid
# 结果:
[root@repo bin]# hdfs dfs -cat /user/root/Ownerinfo/*
110101200001010001,刘备,53
110101200001010002,关羽,42
110101200001010003,张飞,35
110101200001010004,赵云,33
110101200001010005,马超,38
110101200001010006,黄忠,70
注意:
- 如果不指定存储在HDFS哪个路径下,会直接存到/user/user_name/table_name/下
- 如果表中没有主键,需要使用–split-by选项来指定
- 默认用4个map task并行导入
(4) Mysql表中数据导入HDFS的指定路径下,并指定导入时的map task个数
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_common \
--num-mappers 1 \
--split-by Ownerid
结果:
[root@repo bin]# hdfs dfs -cat SQOOP/import/Ownerinfo_common/*
110101200001010001,刘备,53
110101200001010002,关羽,42
110101200001010003,张飞,35
110101200001010004,赵云,33
110101200001010005,马超,38
110101200001010006,黄忠,70
(5) Mysql表中数据导入HDFS时设置数据存储格式为parquet
命令:
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_parquet \
--num-mappers 1 \
--as-parquetfile \
--split-by Ownerid
HDFS的存储文件有以下3种格式:
- textfile(默认)
- sequencefile
- parquetfile
创建hive表来解读刚才导入的parquetfile:
drop table if exists hive_Ownerinfo;
create table hive_Ownerinfo (Ownerid string,Name string,Age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS PARQUET;
load data inpath '/user/root/SQOOP/import/Ownerinfo_parquet/3283c8d3-a239-44b9-80dd-0df4bdcebea1.parquet' into table hive_Ownerinfo;
hive> select * from hive_ownerinfo;
110101200001010001 刘备 53
110101200001010002 关羽 42
110101200001010003 张飞 35
110101200001010004 赵云 33
110101200001010005 马超 38
110101200001010006 黄忠 70
(6) 按条件过滤Mysql表中数据后再导入HDFS
--<1> 指定要导入的列
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_column \
--num-mappers 1 \
--columns Ownerid,Age \
--split-by Ownerid
--<2> 指定查询语句
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--query 'select Ownerid, Age from Ownerinfo where $CONDITIONS' \
--target-dir /user/root/SQOOP/import/Ownerinfo_select \
--num-mappers 1 \
--split-by Ownerid
# 结果:
[root@repo bin]# hdfs dfs -cat SQOOP/import/Ownerinfo_select/*
110101200001010001,53
110101200001010002,42
110101200001010003,35
110101200001010004,33
110101200001010005,38
110101200001010006,70
注意:
查询语句必须包含where条件,即使不需要where条件,也需要写上”where $CONDITIONS”来表示没有select语句没有where条件
(7) Mysql表中数据导入HDFS时设置数据压缩格为snappy,并自动判断输出路径是否存在,存在则删除
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_compress \
--num-mappers 1 \
--columns Ownerid,Age \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--delete-target-dir
--split-by Ownerid
创建hive表来解读刚才导入的snappy压缩格式文件:
drop table if exists hive_Ownerinfo;
create table hive_Ownerinfo (Ownerid string, Age int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
load data inpath '/user/root/SQOOP/import/Ownerinfo_compress/part-m-00000.snappy' into table hive_Ownerinfo;
hive> select * from hive_Ownerinfo;
110101200001010001 53
110101200001010002 42
110101200001010003 35
110101200001010004 33
110101200001010005 38
110101200001010006 70
4. 增量导入
(1) 逐条导入
-- 方法一:指定查询语句
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--query 'select * from Ownerinfo where Age > 30 and Age < 50 \
--target-dir /user/root/SQOOP/import/Ownerinfo_select \
--num-mappers 1 \
--split-by Ownerid
-- 方法二:使用相关的选项参数(用追加的方式导入Age>33的数据)
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_incremental \
--num-mappers 1 \
--incremental append \
--check-column Age \
--last-value 33
# 结果
[root@repo bin]# hdfs dfs -cat /user/root/SQOOP/import/Ownerinfo_incremental/part-m-00000
110101200001010001,刘备,53
110101200001010002,关羽,42
110101200001010003,张飞,35
110101200001010005,马超,38
110101200001010006,黄忠,70
注意:
(1) 非数值型的值不能当做增量
Error during import: Character column (Ownerid) can not be used to determine which rows to incrementally import.
(2) 增量导入不能与–delete-target-dir参数一起使用
(2) direct方式:直接从数据库导出数据,效率高
[root@repo bin]# ./sqoop import \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table Ownerinfo \
--target-dir /user/root/SQOOP/import/Ownerinfo_direct \
--columns Ownerid,Age \
--direct \
--delete-target-dir \
--split-by Ownerid
5. 把HDFS上的数据导出到MySQL表中
/user/root/SQOOP/export/users.txt内容:
1,Jed,15
2,Tom,16
3,Tony,17
4,Bob,18
5,Harry,19
6,Jack,20
[root@repo bin]# ./sqoop export \
--connect jdbc:mysql://192.168.9.100:3306/test \
--username root \
--password 123456 \
--table users \
--export-dir /user/root/SQOOP/export \
--num-mappers 1
mysql> select * from users;
+----+-------+------+
| Id | name | age |
+----+-------+------+
| 1 | Jed | 15 |
| 2 | Tom | 16 |
| 3 | Tony | 17 |
| 4 | Bob | 18 |
| 5 | Harry | 19 |
| 6 | Jack | 20 |
+----+-------+------+
注意:
MySQL表需要提前创建好