（十四）Spark on Yarn的基本使用及常见错误

2023年8月5日 405次阅读来源: 白面葫芦娃92

将spark作业提交到yarn上执行
spark仅仅作为一个客户端

./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
 /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.3.1.jar \
3

–master yarn 相当于 –deploy-mode client，也就是yarn-client模式时，后边这句–deploy-mode client可写可不写
如果是yarn-cluster模式，则需要写上–deploy-mode cluster

直接按上方代码启动，会报错：

Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
        at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:288)
        at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:248)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:130)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

需要在环境变量中设置HADOOP_CONF_DIR or YARN_CONF_DIR

[hadoop@hadoop001 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop001 conf]$ vi spark-env.sh
export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0/etc/hadoop

查看日志，发现有一个步骤耗费了比较长的时间：
Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME

18/09/19 17:30:32 INFO yarn.Client: Preparing resources for our AM container
18/09/19 17:30:35 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/09/19 17:30:44 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_libs__2104928720237052389.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_libs__2104928720237052389.zip
18/09/19 17:30:54 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_conf__1822648312505136721.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_conf__.zip

官网上也有相关说明：
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer toSpark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.
可做如下配置

[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/jars/* /system/spark-lib
[hadoop@hadoop000 ~]$ hadoop fs -chmod -R 755 /system/spark-lib
[hadoop@hadoop000 ~]$ cd $SPARK_HOME/conf
[hadoop@hadoop000 conf]$ cp spark-defaults.conf.template spark-defaults.conf
[hadoop@hadoop000 conf]$ vi spark-defaults.conf
spark.yarn.jars    hdfs://192.168.137.251:9000//system/spark-lib/*
（如果没有*会报错：Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher）

运行日志由之前的upload……….变成

18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-asn1-api-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-util-1.0.0-M20.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arpack_combined_all-0.1.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-format-0.8.0.jar
18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-memory-0.8.0.jar
.............
.............

spark.yarn.jar配置成HDFS上的公共lib库中的jar包。这个配置项会使提交job时，不是从本地上传.jar包，而是从HDFS的一个目录复制到另一个目录，总的来说节省了一点时间。（网上有的文章里说，这里的配置，会节省掉上传jar包的步骤，其实是不对的，只是把从本地上传的步骤改成了在HDFS上的复制操作。）
这是每次提交申请资源都要耗费几十秒的时间的根本原因，这些jar包在yarn环境里都能访问的到，意味着在yarn的所有节点，所有container都能访问的到才可以，对离线来说还可以接受，但要求很高的话，每次启动spark作业都要耗费几十秒是不能接受的。spark可以和微服务结合起来，使用sping boot等把spark做成一个长服务，让它724小时不停运行，提交作业时不用一次又一次地重新申请资源

其他常用命令

--executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                            or all available cores on the worker in standalone mode)
--queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
--num-executors NUM         Number of executors to launch (Default: 2).
                            If dynamic allocation is enabled, the initial number of
                            executors will be at least NUM.
--executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

    原文作者：白面葫芦娃92
    原文地址: https://www.jianshu.com/p/eef73f3f4819
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。