(十四)Win10 IDEA环境下SparkSQL连接Hive的几个坑

在之前的文章中介绍了通过spark-shell访问hive中数据的方法,那么在IDEA中应该怎样连接Hive并访问数据呢?
网上有很多篇文章介绍,但可能是因为环境不同,访问过程中出现了很多问题,在此记录一下
一、初始环境
1.最开始,我的pom文件中有以下依赖:

    <scala.version>2.11.8</scala.version>
    <spark.version>2.3.1</spark.version>
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
    <hive.version>1.1.0-cdh5.7.0</hive.version>

<dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
 <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>
<dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.27</version>
    </dependency>

2.resource文件夹中,将hive-site.xml, core-site.xml和 hdfs-site.xml拷贝进来
《(十四)Win10 IDEA环境下SparkSQL连接Hive的几个坑》

三个配置文件内容分别如下:(hive-site中有一些是从网上直接copy过来的,有坑,一个一个讲;另外,尝试了一下只有hive-site.xml没有另外两个配置文件,也可以访问=_=)
1)hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
#用于存储不同map/reduce阶段的执行计划和这些阶段的中间输出结果
    <name>hive.exec.scratchdir</name>
    <value>hdfs://192.168.137.141:9000/hive/tmp</value>
    <description>Scratch space for Hive jobs</description>
</property>
<property>
#hive数据存储在hdfs上的目录
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://192.168.137.141:9000/hive/warehouse</value>
    <description>location of default database for the warehous</description>
</property>
<property>
#Hive实时查询日志所在的目录,如果该值为空,将不创建实时的查询日志
    <name>hive.querylog.location</name>
    <value>hdfs://192.168.137.141:9000/hive/log</value>
    <description>Location of Hive run time structured log file</description>
</property>
<property>
# hive元数据服务的地址
      <name>hive.metastore.uris</name>
      <value>thrift://192.168.137.141:9083</value>
</property>
<property>
  <name>hive.metastore.local</name>
  <value>false</value>
  <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
  <property>
  #监听的TCP端口号, 默认为 10000
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    <description>Port number of HiveServer2 Thrift interface.Can be overridden by setting $HIVE_SERVER2_THRIFT_PORT</description>
  </property>
  <property> #元数据schema验证
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
    Enforce metastore schema version consistency.
    True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default)
    False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>
  <property>
  #元数据地址
       <name>javax.jdo.option.ConnectionURL</name>
       <value>jdbc:mysql://192.168.137.141:3306/ruozedata_basic03?createDatabaseIfNotExist=true</value>
    <description>The default connection string for the database that stores temporary hive statistics.</description>
   </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>
   <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
        <description>password to use against metastore database</description>
    </property>                             
 </configuration>

2)hdfs-site.xml

<configuration>
<property>
       <name>dfs.replication</name>
       <value>1</value>
 </property>
 <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>192.168.137.141:50090</value>
 </property>
 <property>
      <name>dfs.namenode.secondary.https-address</name>
      <value>192.168.137.141:50091</value>
 </property>
</configuration>

3)core-site.xml

<configuration>
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.137.141:9000</value>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>hdfs://192.168.137.141:9000/hadoop/tmp</value>
</property>
</configuration>

3.IDEA开发代码如下:

import org.apache.spark.sql.SparkSession

object HiveConnApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("HiveConnApp")
      .enableHiveSupport()
      .getOrCreate()
    spark.sql("show databases").show(false)
    spark.sql("use ruozedata")
    spark.sql("show tables").show(false)
  }
}

二、运行过程中可能出现的几个错误及解决办法
1.首次运行上述代码,尝试连接

//报错
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.

查看enableHiveSupport()的源码

 /**
     * Enables Hive support, including connectivity to a persistent Hive metastore, support for
     * Hive serdes, and Hive user-defined functions.
     *
     * @since 2.0.0
     */
    def enableHiveSupport(): Builder = synchronized {
      if (hiveClassesArePresent) {
        config(CATALOG_IMPLEMENTATION.key, "hive")
      } else {
        throw new IllegalArgumentException(
          "Unable to instantiate SparkSession with Hive support because " +
            "Hive classes are not found.")
      }
    }

发现hiveClassesArePresent判断为true才可以,不然就报上面的错误,继续深入

 /**
   * @return true if Hive classes can be loaded, otherwise false.
   */
  private[spark] def hiveClassesArePresent: Boolean = {
    try {
      Utils.classForName(HIVE_SESSION_STATE_BUILDER_CLASS_NAME)
      Utils.classForName("org.apache.hadoop.hive.conf.HiveConf")
      true
    } catch {
      case _: ClassNotFoundException | _: NoClassDefFoundError => false
    }
  }

hiveClassesArePresent判断为true需要两个条件,一个是HIVE_SESSION_STATE_BUILDER_CLASS_NAME(private val HIVE_SESSION_STATE_BUILDER_CLASS_NAME =
“org.apache.spark.sql.hive.HiveSessionStateBuilder”),一个是”org.apache.hadoop.hive.conf.HiveConf”
spark-hive的jar包我已经导入了,猜想是不是因为没有hadoop的jar包
于是添加依赖

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>

2.再次运行,之前的错误没有了,出现了新的错误

Exception in thread "main" java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME

有人在网上问过这个问题:
https://stackoverflow.com/questions/44151374/getting-either-exception-java-lang-nosuchfielderror-metastore-client-socket-li

《(十四)Win10 IDEA环境下SparkSQL连接Hive的几个坑》 建议是把hive更新到1.2.1版本及以上

官网描述
《(十四)Win10 IDEA环境下SparkSQL连接Hive的几个坑》 查看idea下载的依赖包
《(十四)Win10 IDEA环境下SparkSQL连接Hive的几个坑》

于是尝试把hive版本改成1.2.1

<hive.version>1.2.1</hive.version>

3.再次运行,之前的错误没有了,又出现了新的错误

HiveConf of name hive.metastore.local does not exist

于是尝试把hive-site.xml文件中将hive.metastore.local设为false的配置删掉

<property>
  <name>hive.metastore.local</name>
  <value>false</value>
  <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>

4.错误消失,运行又报另外一个错

metastore: Failed to connect to the MetaStore Server...

于是尝试把hive-site.xml文件中配置hive元数据服务地址的配置删掉

<property>
# hive元数据服务的地址
      <name>hive.metastore.uris</name>
      <value>thrift://192.168.137.141:9083</value>
</property>

5.错误消失,运行继续报错

Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: hdfs://192.168.137.141:9000/hive/tmp on HDFS should be writable. Current permissions are: rwxr-xr-x;

提示hdfs中/hive/tmp文件夹没有写权限,于是将权限改为777

[hadoop@hadoop001 ~]$ hadoop fs -chmod -R 777 /hive/tmp
[hadoop@hadoop001 ~]$ hadoop fs -ls /hive
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2018-11-02 20:08 /hive/log
drwxrwxrwx   - hadoop supergroup          0 2018-11-02 20:08 /hive/tmp
drwxr-xr-x   - hadoop supergroup          0 2018-11-02 20:08 /hive/warehouse

6.再次运行,终于访问成功

//IDEA访问hive成功
// spark.sql("show databases").show(false)
+---------------+
|databaseName   |
+---------------+
|default        |
|hive           |
|hive2_ruozedata|
|hive3          |
|ruozedata      |
+---------------+
//    spark.sql("use ruozedata")
//    spark.sql("show tables").show(false)
+---------+-----------------------+-----------+
|database |tableName              |isTemporary|
+---------+-----------------------+-----------+
|ruozedata|a                      |false      |
|ruozedata|b                      |false      |
|ruozedata|city_info              |false      |
|ruozedata|dual                   |false      |
|ruozedata|emp_sqoop              |false      |
|ruozedata|order_4_partition      |false      |
|ruozedata|order_mulit_partition  |false      |
|ruozedata|order_partition        |false      |
|ruozedata|product_info           |false      |
|ruozedata|product_rank           |false      |
|ruozedata|ruoze_dept             |false      |
|ruozedata|ruozedata_dynamic_emp  |false      |
|ruozedata|ruozedata_emp          |false      |
|ruozedata|ruozedata_emp2         |false      |
|ruozedata|ruozedata_emp3_new     |false      |
|ruozedata|ruozedata_emp4         |false      |
|ruozedata|ruozedata_emp_partition|false      |
|ruozedata|ruozedata_person       |false      |
|ruozedata|ruozedata_static_emp   |false      |
|ruozedata|user_click             |false      |
+---------+-----------------------+-----------+
only showing top 20 rows

//进入hive查看
hive> show databases;
OK
default
hive
hive2_ruozedata
hive3
ruozedata
hive> use ruozedata;
OK
hive> show tables;
OK
a
b
city_info
dual
emp_sqoop
order_4_partition
order_mulit_partition
order_partition
product_info
product_rank
ruoze_dept
ruozedata_dynamic_emp
ruozedata_emp
ruozedata_emp2
ruozedata_emp3_new
ruozedata_emp4
ruozedata_emp_partition
ruozedata_person
ruozedata_static_emp
user_click
user_click_tmp

新建一个数据库

spark.sql("create database test_1")

运行代码,报错

ERROR log: Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=zh, access=WRITE, inode="/hive/warehouse":hadoop:supergroup:drwxr-xr-x

又是没有写的权限,干脆把hive下的三个文件夹权限全部改成777

[hadoop@hadoop001 bin]$ hadoop fs -ls /hive
Found 3 items
drwxrwxrwx   - hadoop supergroup          0 2018-11-02 20:08 /hive/log
drwxrwxrwx   - hadoop supergroup          0 2018-11-02 23:59 /hive/tmp
drwxrwxrwx   - hadoop supergroup          0 2018-11-02 20:08 /hive/warehouse

再次执行语句spark.sql(“create database test_1”)

[hadoop@hadoop001 bin]$ hadoop fs -ls /hive/warehouse                    
Found 1 items
drwxrwxrwx   - zh supergroup          0 2018-11-03 10:48 /hive/warehouse/test_1.db

已经新建了数据库test_1.db,并存在hdfs的/hive/warehouse/文件夹下
查看Hive里已经有了test_1

hive> show databases;
OK
default
hive
hive2_ruozedata
hive3
ruozedata
test_1

查看MySQL里的元数据信息

mysql> mysql> select * from dbs;
+-------+------------------------------------+--------------------------------------------------------------------+-----------------+------------+------------+
| DB_ID | DESC                               | DB_LOCATION_URI                                                    | NAME            | OWNER_NAME | OWNER_TYPE |
+-------+------------------------------------+--------------------------------------------------------------------+-----------------+------------+------------+
|     1 | Default Hive database              | hdfs://192.168.137.141:9000/user/hive/warehouse                    | default         | public     | ROLE       |
|     6 | NULL                               | hdfs://192.168.137.141:9000/user/hive/warehouse/hive.db            | hive            | hadoop     | USER       |
|     9 | this is ruozedata 03 test database | hdfs://192.168.137.141:9000/user/hive/warehouse/hive2_ruozedata.db | hive2_ruozedata | hadoop     | USER       |
|    10 | NULL                               | hdfs://192.168.137.141:9000/zh                                     | hive3           | hadoop     | USER       |
|    11 | NULL                               | hdfs://192.168.137.141:9000/user/hive/warehouse/ruozedata.db       | ruozedata       | hadoop     | USER       |
|    16 |                                    | hdfs://192.168.137.141:9000/hive/warehouse/test_1.db               | test_1          | NULL       | USER       |
+-------+------------------------------------+--------------------------------------------------------------------+-----------------+------------+------------+
6 rows in set (0.00 sec)

/tmp和/log两个文件夹暂时还都是空的

[hadoop@hadoop001 bin]$ hadoop fs -ls /hive/tmp
Found 1 items
drwx------   - zh supergroup          0 2018-11-03 10:48 /hive/tmp/zh
[hadoop@hadoop001 bin]$ hadoop fs -ls /hive/tmp/zh
[hadoop@hadoop001 bin]$ hadoop fs -ls /hive/log

***最终的hive-site.xml文件内容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
#用于存储不同map/reduce阶段的执行计划和这些阶段的中间输出结果
    <name>hive.exec.scratchdir</name>
    <value>hdfs://192.168.137.141:9000/hive/tmp</value>
    <description>Scratch space for Hive jobs</description>
</property>
<property>
#hive数据存储在hdfs上的目录,默认就在/user/hive/warehouse文件夹下,可以不设置
    <name>hive.metastore.warehouse.dir</name>
    <value>hdfs://192.168.137.141:9000/user/hive/warehouse</value>
    <description>location of default database for the warehous</description>
</property>
<property>
#Hive实时查询日志所在的目录,如果该值为空,将不创建实时的查询日志
    <name>hive.querylog.location</name>
    <value>hdfs://192.168.137.141:9000/hive/log</value>
    <description>Location of Hive run time structured log file</description>
</property>

<property>
  <name>hive.metastore.local</name>
  <value>true</value>
  <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>

  <property>
  #监听的TCP端口号, 默认为 10000
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
    <description>Port number of HiveServer2 Thrift interface.Can be overridden by setting $HIVE_SERVER2_THRIFT_PORT</description>
  </property>
  
  <property> #元数据schema验证
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
    Enforce metastore schema version consistency.
    True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures proper metastore schema migration. (Default)
    False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>

  <property>
  #元数据地址
       <name>javax.jdo.option.ConnectionURL</name>
       <value>jdbc:mysql://192.168.137.141:3306/ruozedata_basic03?createDatabaseIfNotExist=true</value>
    <description>The default connection string for the database that stores temporary hive statistics.</description>
   </property>
   
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>

   <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>123456</value>
        <description>password to use against metastore database</description>
    </property>
                                
 </configuration>

    原文作者:白面葫芦娃92
    原文地址: https://www.jianshu.com/p/27a798013990
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞