(十四)Spark on Yarn的基本使用及常见错误

将spark作业提交到yarn上执行
spark仅仅作为一个客户端

./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
 /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.3.1.jar \
3

–master yarn 相当于 –deploy-mode client,也就是yarn-client模式时,后边这句–deploy-mode client可写可不写
如果是yarn-cluster模式,则需要写上–deploy-mode cluster

直接按上方代码启动,会报错:

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
        at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:288)
        at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:248)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:130)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

需要在环境变量中设置HADOOP_CONF_DIR or YARN_CONF_DIR

[hadoop@hadoop001 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop001 conf]$ vi spark-env.sh
export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0/etc/hadoop

查看日志,发现有一个步骤耗费了比较长的时间:
Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME

18/09/19 17:30:32 INFO yarn.Client: Preparing resources for our AM container
18/09/19 17:30:35 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/09/19 17:30:44 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_libs__2104928720237052389.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_libs__2104928720237052389.zip
18/09/19 17:30:54 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_conf__1822648312505136721.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_conf__.zip

官网上也有相关说明:
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer toSpark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.
可做如下配置

[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/jars/* /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -chmod -R 755 /system/spark-lib
[hadoop@hadoop000 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop000 conf]$ cp spark-defaults.conf.template spark-defaults.conf
[hadoop@hadoop000 conf]$ vi spark-defaults.conf
spark.yarn.jars    hdfs://192.168.137.251:9000//system/spark-lib/*
(如果没有*会报错:Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher)

运行日志由之前的upload……….变成

18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-asn1-api-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-util-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arpack_combined_all-0.1.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-format-0.8.0.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-memory-0.8.0.jar
.............
.............

spark.yarn.jar配置成HDFS上的公共lib库中的jar包。这个配置项会使提交job时,不是从本地上传.jar包,而是从HDFS的一个目录复制到另一个目录,总的来说节省了一点时间。(网上有的文章里说,这里的配置,会节省掉上传jar包的步骤,其实是不对的,只是把从本地上传的步骤改成了在HDFS上的复制操作。)
这是每次提交申请资源都要耗费几十秒的时间的根本原因,这些jar包在yarn环境里都能访问的到,意味着在yarn的所有节点,所有container都能访问的到才可以,对离线来说还可以接受,但要求很高的话,每次启动spark作业都要耗费几十秒是不能接受的。spark可以和微服务结合起来,使用sping boot等把spark做成一个长服务,让它7
24小时不停运行,提交作业时不用一次又一次地重新申请资源

其他常用命令

--executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                            or all available cores on the worker in standalone mode)
--queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
--num-executors NUM         Number of executors to launch (Default: 2).
                            If dynamic allocation is enabled, the initial number of
                            executors will be at least NUM.
--executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
    原文作者:白面葫芦娃92
    原文地址: https://www.jianshu.com/p/eef73f3f4819
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞