hive读取es数据

2019年6月8日 201次阅读来源: nicklbx

参考：

hive读写es数据 http://blog.csdn.net/u013063153/article/details/60757307
官方文档 hive集成es https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html#hive-type-conversion
hive复杂数据类型Array,map,struct使用 http://blog.csdn.net/gamer_gyt/article/details/52169441

有几点说明一下：

针对es中的数据类型和hive类型的对应，可以在hive中使用复杂数据类型Map,array等存储es中二维json

添加elasticsearch-hadoop-hive-2.2.0-beta1.jar;

add jar file:///home/liuxiaowen/elasticsearch-hadoop-2.2.0-beta1/dist/elasticsearch-hadoop-hive-2.2.0-beta1.jar;

创建外部表

CREATE EXTERNAL TABLE lxw1234_es_tags (
cookieid string,
area string,
media_view_tags string,
interest string ,
userInfo map<string,string>,
)
STORED BY ‘org.elasticsearch.hadoop.hive.EsStorageHandler’
TBLPROPERTIES(
‘es.nodes’ = ‘172.16.212.17:9200,172.16.212.102:9200’,
‘es.index.auto.create’ = ‘false’,
‘es.resource’ = ‘lxw1234/tags’,
‘es.read.metadata’ = ‘true’,
‘es.mapping.names’ = ‘cookieid:_metadata._id, area:area, media_view_tags:media_view_tags, interest:interest,userInfo:userInfo’);

《hive读取es数据》 image.png

建表语句中es的配置（注意事项）

（参考：https://www.elastic.co/guide/en/elasticsearch/hadoop/2.4/configuration.html）

一、es.resource

Elasticsearch resource location, where data is read and written to. Requires the format <index>/<type> (relative to the Elasticsearch host/port (see below))).

es.resource = twitter/tweet # index ‘twitter’, type ‘tweet’

es.resource.read (defaults to es.resource)

Elasticsearch resource used for reading (but not writing) data. Useful when reading and writing data to different Elasticsearch indices within the same job. Typically set automatically (except for the Map/Reduce module which requires manual configuration).

es.resource.write(defaults to es.resource)

Elasticsearch resource used for writing (but not reading) data. Used typically for dynamic resource writes or when writing and reading data to different Elasticsearch indices within the same job. Typically set automatically (except for the Map/Reduce module which requires manual configuration).

Note that multiple indices and/or types are allowed only for reading. Use _all/types to search types in all indices or index/ to search all types within index. Do note that reading multiple indices/types typically works only when they have the same structure and only with some libraries. Integrations that require a strongly typed mapping (such as a table like Hive or SparkSQL) are likely to fail.

解释

查询所有索引所有的type
‘es.resource’ = ‘_all/types’
查询index索引下所有的type
‘es.resource’ = ‘index/’

二、 es.query (default none)

Holds the query used for reading data from the specified es.resource. By default it is not set/empty, meaning the entire data under the specified index/type is returned. es.query can have three forms:

uri query

using the form ?uri_query, one can specify a query string. Notice the leading ?.

query dsl

using the form query_dsl – note the query dsl needs to start with { and end with } as mentioned here

external resource

if none of the two above do match, elasticsearch-hadoop will try to interpret the parameter as a path within the HDFS file-system. If that is not the case, it will try to load the resource from the classpath or, if that fails, from the Hadoop DistributedCache. The resource should contain either a uri query or a query dsl.

解释

1）以uri 方式查询
es.query = ?q=98E5D2DE059F1D563D8565
2）以dsl 方式查询
es.query = { “query” : { “term” : { “user” : “costinl” } } }
‘es.query’='{“query”: {“match_all”: { }}}’,

external resource
es.query = org/mypackage/myquery.json

    原文作者：nicklbx
    原文地址: https://www.jianshu.com/p/f797e132d341
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。