运行spark 官方hive范例的完整记录

sample java:
org.apache.spark.examples.sql.hive.JavaSparkHiveExample

几处修改:

   SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive jar Example")
      //.master("spark://hadoopnode3:7077")
      .config("spark.executor.memory", "512m")
      //.config("spark.sql.warehouse.dir", warehouseLocation)
      .enableHiveSupport()
      .getOrCreate();

    spark.sql("use myhive");
    spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");
    //spark.sql("LOAD DATA LOCAL INPATH 'resources/kv1.txt' INTO TABLE src");
    spark.sql("LOAD DATA INPATH '/resources/kv1.txt' INTO TABLE src");

kv1.txt 是个简单的2列数据, 空格分开, 执行后会自动删除。

1. eclipse java 搭建

《运行spark 官方hive范例的完整记录》 image.png

2. export to jar file from eclipse

a. 右键点击该java file, 并导出export to a jar file: hiveTesting.jar

b. 上传范例需要的txt or json 等文件

hadoop fs -put ./resources /resources

3. upload to one testbed on hadoop cluster

/tmp/hiveTesting.jar

  1. spark submit
 spark-submit --class org.apache.spark.examples.sql.hive.JavaSparkHiveExample --master spark://hadoopnode3:7077 /tmp/hiveTesting.jar 

因为有个hadoop node关了, 就直接指定了 — master 机器, 没有使用 yarn cluster.

2019-01-04 12:21:58 INFO  metastore:376 - Trying to connect to metastore with URI thrift://hadoopnode3:9083
2019-01-04 12:21:58 INFO  metastore:472 - Connected to metastore.
2019-01-04 12:21:58 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.2.28.8:53935) with ID 0
2019-01-04 12:21:58 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 10.2.28.8:54120 with 93.3 MB RAM, BlockManagerId(0, 10.2.28.8, 54120, None)
2019-01-04 12:22:00 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.2.135.104:46928) with ID 1
2019-01-04 12:22:00 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 10.2.135.104:35993 with 93.3 MB RAM, BlockManagerId(1, 10.2.135.104, 35993, None)
2019-01-04 12:22:19 INFO  SessionState:641 - Created local directory: /opt/hive/iotmp/dd87c362-9b6e-4f83-b21a-14196a6cd64d_resources
2019-01-04 12:22:19 INFO  SessionState:641 - Created HDFS directory: /user/hive/tmp/hadoop/dd87c362-9b6e-4f83-b21a-14196a6cd64d
2019-01-04 12:22:19 INFO  SessionState:641 - Created local directory: /opt/hive/iotmp/root/dd87c362-9b6e-4f83-b21a-14196a6cd64d
2019-01-04 12:22:19 INFO  SessionState:641 - Created HDFS directory: /user/hive/tmp/hadoop/dd87c362-9b6e-4f83-b21a-14196a6cd64d/_tmp_space.db
2019-01-04 12:22:19 INFO  HiveClientImpl:54 - Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse
2019-01-04 12:22:21 INFO  SQLStdHiveAccessController:95 - Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=dd87c362-9b6e-4f83-b21a-14196a6cd64d, clientType=HIVECLI]
2019-01-04 12:22:21 INFO  metastore:291 - Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook
2019-01-04 12:22:21 INFO  metastore:376 - Trying to connect to metastore with URI thrift://hadoopnode3:9083
2019-01-04 12:22:21 INFO  metastore:472 - Connected to metastore.
2019-01-04 12:22:21 INFO  metastore:376 - Trying to connect to metastore with URI thrift://hadoopnode3:9083
2019-01-04 12:22:21 INFO  metastore:472 - Connected to metastore.
2019-01-04 12:22:22 ERROR KeyProviderCache:87 - Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
2019-01-04 12:22:22 INFO  Hive:2641 - Renaming src: hdfs://ns1/resources/kv1.txt, dest: hdfs://ns1/user/hive/warehouse/myhive.db/src/kv1_copy_1.txt, Status:true
2019-01-04 12:23:04 INFO  CodeGenerator:54 - Code generated in 263.536255 ms
2019-01-04 12:23:04 INFO  CodeGenerator:54 - Code generated in 34.751521 ms
2019-01-04 12:23:05 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 618.1 KB, free 911.7 MB)
2019-01-04 12:23:06 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 54.9 KB, free 911.6 MB)
2019-01-04 12:23:06 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on hadoopnode3:41685 (size: 54.9 KB, free: 912.2 MB)
2019-01-04 12:23:06 INFO  SparkContext:54 - Created broadcast 0 from 
2019-01-04 12:23:07 INFO  FileInputFormat:249 - Total input paths to process : 2
2019-01-04 12:23:07 INFO  SparkContext:54 - Starting job: show at JavaSparkHiveExample.java:87
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Got job 0 (show at JavaSparkHiveExample.java:87) with 1 output partitions
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Final stage: ResultStage 0 (show at JavaSparkHiveExample.java:87)
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Parents of final stage: List()
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Missing parents: List()
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[6] at show at JavaSparkHiveExample.java:87), which has no missing parents
2019-01-04 12:23:07 INFO  MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 11.8 KB, free 911.6 MB)
2019-01-04 12:23:07 INFO  MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.8 KB, free 911.6 MB)
2019-01-04 12:23:07 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on hadoopnode3:41685 (size: 5.8 KB, free: 912.2 MB)
2019-01-04 12:23:07 INFO  SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2019-01-04 12:23:07 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at show at JavaSparkHiveExample.java:87) (first 15 tasks are for partitions Vector(0))
2019-01-04 12:23:07 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 1 tasks
2019-01-04 12:23:07 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 10.2.28.8, executor 0, partition 0, ANY, 7903 bytes)
2019-01-04 12:23:08 INFO  BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 10.2.28.8:54120 (size: 5.8 KB, free: 93.3 MB)
2019-01-04 12:23:08 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 10.2.28.8:54120 (size: 54.9 KB, free: 93.2 MB)
2019-01-04 12:23:30 INFO  TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 22968 ms on 10.2.28.8 (executor 0) (1/1)
2019-01-04 12:23:30 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2019-01-04 12:23:30 INFO  DAGScheduler:54 - ResultStage 0 (show at JavaSparkHiveExample.java:87) finished in 23.171 s
2019-01-04 12:23:30 INFO  DAGScheduler:54 - Job 0 finished: show at JavaSparkHiveExample.java:87, took 23.556848 s
+---+-------+
|key|  value|
+---+-------+
|238|val_238|
| 86| val_86|
|311|val_311|
| 27| val_27|
|165|val_165|
|409|val_409|
|255|val_255|
|278|val_278|
| 98| val_98|
|484|val_484|
|265|val_265|
|193|val_193|
|401|val_401|
|150|val_150|
|273|val_273|
|224|val_224|
|369|val_369|
| 66| val_66|
|128|val_128|
|213|val_213|
+---+-------+
only showing top 20 rows

2019-01-04 12:23:59 INFO  DAGScheduler:54 - Job 1 finished: show at JavaSparkHiveExample.java:97, took 27.595771 s
+--------+
|count(1)|
+--------+
|     502|
+--------+


2019-01-04 12:24:03 INFO  DAGScheduler:54 - Job 5 finished: show at JavaSparkHiveExample.java:111, took 0.074884 s
+--------------------+
|               value|
+--------------------+
|Key: 0, Value: val_0|
|Key: 0, Value: val_0|
|Key: 0, Value: val_0|
|Key: 2, Value: val_2|
|Key: 4, Value: val_4|
|Key: 5, Value: val_5|
|Key: 5, Value: val_5|
|Key: 5, Value: val_5|
|Key: 8, Value: val_8|
|Key: 9, Value: val_9|
+--------------------+

2019-01-04 12:24:04 INFO  AbstractConnector:318 - Stopped Spark@6230c6fb{HTTP/1.1,[http/1.1]}{10.2.28.8:4041}
2019-01-04 12:24:04 INFO  SparkUI:54 - Stopped Spark web UI at http://hadoopnode3:4041
2019-01-04 12:24:04 INFO  StandaloneSchedulerBackend:54 - Shutting down all executors
2019-01-04 12:24:04 INFO  CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-01-04 12:24:04 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-01-04 12:24:04 INFO  MemoryStore:54 - MemoryStore cleared
2019-01-04 12:24:04 INFO  BlockManager:54 - BlockManager stopped
2019-01-04 12:24:04 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2019-01-04 12:24:04 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-01-04 12:24:04 INFO  SparkContext:54 - Successfully stopped SparkContext
2019-01-04 12:24:04 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-01-04 12:24:04 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-80628156-cd36-433a-813d-71dccf299d09
2019-01-04 12:24:04 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-e17b4714-922b-4e8d-bcfd-00d2ca298b1a

运行了两次, fs文件列表如下

[hadoop@hadoopnode3 ~]$ hadoop fs -ls /user/hive/warehouse/myhive.db/src

Found 2 items
-rwxr-xr-x   2 hadoop supergroup       5812 2019-01-04 11:43 /user/hive/warehouse/myhive.db/src/kv1.txt
-rwxr-xr-x   2 hadoop supergroup         16 2019-01-04 12:14 /user/hive/warehouse/myhive.db/src/kv1_copy_1.txt


[hadoop@hadoopnode3 ~]$ hadoop fs -cat /user/hive/warehouse/myhive.db/src/kv1_copy_1.txt
888 999
666 777

    原文作者:DONG999
    原文地址: https://www.jianshu.com/p/afadd84cbc88
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞