背景
因为hadoop/hive本质上不支持更新,所以hive不能够采用update行级别的维度数据的更新。可以采用的变通的方式。
- hive和hbase结合
我认为这是首选的方式,hbase本质上也是不支持行级更新,只不过是追加行加上时间戳,然后取最新的时间戳的数据而已,但是对于我们来说是透明的。可以在hbase中建立一张表,然后在hive中也建立这张维度表,再hive中将这张表映射到hbase中,然后在hbase中按照行级别更新维度数据就简单多了。在ETL中,往往从其他的在线的系统的数据库中导出有更新的维度信息,然后加载到hadoop,用MR更新到hbase的表,这样就达到了更新hive中维度表的作用。
下面介绍下语句:
English Version:
Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.
Argument | Description |
---|---|
--check-column (col) | Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) |
--incremental (mode) | Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified . |
--last-value (value) | Specifies the maximum value of the check column from the previous import. |
Sqoop supports two types of incremental imports: append
and lastmodified
. You can use the --incremental
argument to specify the type of incremental import to perform.
You should specify append
mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column
. Sqoop imports rows where the check column has a value greater than the one specified with --last-value
.
An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value
are imported.
At the end of an incremental import, the value which should be specified as --last-value
for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value
in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.
翻译:==================================
翻译上述段落的意思其实不难理解,增量导入共有三个参数
第一个参数:
–check-column (col):控制增量的变量字段,这个字段最好不要是字符串类型的。比如说是time, id 等等字段。
第二个字段:
–incremental (mode):增加的模式选择,共有两个选择一个是 append, 一个是lastmodified.
第三个字段:
–last-value (value): 根据第一个参数的变量,从哪里开始导入,例如这个参数是 –last-value 0 那么就从0开始导入。
加上其余的语句如下:
sqoop import –connect jdbc:mysql://ip:port/db –table tablename –hbase-table namespace:tablename –column-family columnfamily –hbase-create-table -username ‘username’ -password ‘password’ –incremental append –check-column ‘id’ –last-value 0