使用Solr+Hbase-solr(Hbase-indexer)配置实现HBase二级索引

前言:
因为项目需要,试着搭建了一下HBase二级索引的环境,网上看了一些教程,无一不坑,索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容,适合刚接触这一领域但是有一定Linux和Hadoop基础的读者,不适合完全初学者。

环境约束:
OS:CentOS6.7-x86_64
JDK:jdk1.7.0_109
hadoop-2.6.0+cdh5.4.1
hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)
solr-4.10.3-cdh5.4.1
zookeeper-3.4.5-cdh5.4.1
hbase-1.0.0-cdh5.4.1

文中所用CDH软件下载页:
CDH 5.4.x Packaging and Tarball Information | 5.x | Cloudera Documentation

一、基本环境准备

1.一个3节点Hadoop集群,服务器计划角色分配如下:

《使用Solr+Hbase-solr(Hbase-indexer)配置实现HBase二级索引》 服务器角色分配

先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来,具体技术自行突破,不是本文重点,无需多言。

2.下载好所需的CDH版本软件:

在文首的链接页面下载好tarball,需要注意的是HBase-solr的tarball是整个项目文件,但是我们用到的只是它的部署文件,解压缩hbase-solr-1.5+cdh5.4.1的tarball,在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz,后面会用到。

二、部署hbase-indexer

将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上
解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz:

tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz

修改hbase-indexer的参数:

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml
<?xml version="1.0"?>
<configuration>
<property>
  <name>hbaseindexer.zookeeper.connectstring</name>
  <!--此处需根据zookeeper集群的实际配置修改-->
  <value>node1:2181,node2:2181,node3:2181</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <!--此处需根据zookeeper集群的实际配置修改-->
  <value>node1,node2,node3</value>
</property>
</configuration>

配置hbase-indexer-env.sh:

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh

修改JAVA_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase-indexer process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase-indexer, etc.)

# The java implementation to use.  Java 1.6 required.
export JAVA_HOME=/usr/java/jdk1.7.0/
#根据实际环境修改

# Extra Java CLASSPATH elements.  Optional.
# export HBASE_INDEXER_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_INDEXER_HEAPSIZE=1000

# Extra Java runtime options.
# Below are what we set by default.  May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"

使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上

三、部署HBase

解压缩hbase的tarball

tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz

同样要修改hbase-site.xml

vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml

需要在<configuration>标签内增加如下内容:

   <property>
    <name>hbase.rootdir</name>
    <value>hdfs://node1:9000/hbase</value>
    <description>The directory shared by RegionServers</description>
  </property>
  <property>
    <name>hbase.master</name>
    <value>node1:60000</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in.Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  <property>
    <name>hbase.replication</name>
    <value>true</value>
    <description>SEP is basically replication, so enable it</description>
  </property>
  <property>
    <name>replication.source.ratio</name>
    <value>1.0</value>
    <description>Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)</description>
  </property>
  <property>
    <name>replication.source.nb.capacity</name>
    <value>1000</value>
    <description>Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.</description>
  </property>
  <property>
    <name>replication.replicationsource.implementation</name>
    <value>com.ngdata.sep.impl.SepReplicationSource</value>
    <description>A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).</description>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
    <description>The directory shared by RegionServers</description>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <!--注意这里配置的是zookeeper集群的数据目录,参照zookeeper的zoo.cfg-->
    <value>/home/HBasetest/zookeeperdata</value>
    <description>Property from ZooKeeper's config zoo.cfg.
      The directory where the snapshot is stored.
    </description>
  </property>

类似的,修改hbase-env.sh

vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh

修改JAVA_HOME和HBASE_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)

# The java implementation to use.  Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/

export JAVA_HOME=/opt/jdk1.7.0_79
export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1
#根据实际填写

# Extra Java CLASSPATH elements.  Optional.
# export HBASE_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000

# Uncomment below if you intend to use off heap cache.
# export HBASE_OFFHEAPSIZE=1000

# For example, to allocate 8G of offheap, to 8G:
# export HBASE_OFFHEAPSIZE=8G

# Extra Java runtime options.
# Below are what we set by default.  May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"

hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下

hbase-sep-api-1.5-cdh5.4.1.jar
hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar
hbase-sep-impl-common-1.5-cdh5.4.1.jar
hbase-sep-tools-1.5-cdh5.4.1.jar

修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容:

node2
node3

然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面

四、部署Solr

直接在node1上解压缩就好。。。

五、运行测试

1.运行HBase

在node1上执行:

./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh 
2.运行HBase-indexer

分别在node2和node3上执行:

./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server 

如果想以后台方式运行,可以使用screen或者nohup

3.运行Solr

分别在node1上进入solr下面的sample子目录,执行:

java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar

同样,如果想以后台方式运行,可以使用screen或者nohup
使用http://node1:8983/solr/#/访问solr的主页

六、数据索引测试

将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后,首先用HBase创建一个数据表:
在任一node上的HBase安装目录下运行:

./bin/hbase shell
create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' } 

在部署了HBase-Indexer的节点上,进入HBase-Indexer部署目录,使用HBase-Indexer的demo下的配置文件创建一个索引:

./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 

编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件:

<?xml version="1.0"?>
<indexer table="indexdemo-user">
  <field name="firstname_s" value="info:firstname"/>
  <field name="lastname_s" value="info:lastname"/>
  <field name="age_i" value="info:age" type="int"/>
</indexer>

保存为indexdemo-indexer.xml

添加indexer实例
在hbase-indexer-1.5-cdh5.4.1/demo下运行:

./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \
solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3

准备一些测试数据,因为项目需要对千万级以上的记录进行索引的测试,所以用命令行手敲的方式插入数据有点不大现实,HBase也支持使用shell命令批量执行以文本方式存储的命令集合,但在千万级别这个数量级的数据量面前还是很苍白,最后我还是选择了用Java编程的方式实现快速的批量插入记录。
Eclipse里面新建一个Java工程,导入HBase部署目录下lib内的所有内容。程序源代码如下:

package com.hbasetest.hbtest;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;

public class DataInput {
    private static Configuration configuration;
    static {
        configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.property.clientPort", "2181");
        configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");
    }
    public static void main(String[] args) {        
        try {
            List<Put> putList = new ArrayList<Put>();
            HTable table = new HTable(configuration, "indexdemo-user");
            for (int i =0; i<=14000000 ;i++)
        {
            Put put = new Put(Integer.toString(i).getBytes());
            put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());
            put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());
            putList.add(put);
            System.out.println("put successfully! " + Integer.toString(i) );        
           
        }   table.put(putList);     
        } catch (IOException e) {
            e.printStackTrace();
                }       
    }
}

这段代码使用了批量put的办法,如果运行这个程序的机器内存不够大,建议做问题分治,多搞几个putList。

剩下的检索测试就简单了,不再赘述。

    原文作者:耗子在简书
    原文地址: https://www.jianshu.com/p/a7e2079ded33
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞