java – spark-submite中与Spark部署相关的属性

2024年2月2日 226次阅读

在创建基于spark的
java应用程序时,使用创建SparkConf

sparkConf = new SparkConf().setAppName("SparkTests")
                           .setMaster("local[*]").set("spark.executor.memory", "2g")
                           .set("spark.driver.memory", "2g")
                           .set("spark.driver.maxResultSize", "2g");

但在文档here中,它说

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

那么是否有这些部署相关属性的列表,我只能在spark-submit中作为命令行参数提供？

这里给出了本地[*],但在运行时我们通过纱线集群进行部署.

最佳答案我也不确定这句话是什么：

在运行时通过SparkConf以编程方式设置时,这种属性可能不会受到影响,或者行为取决于您选择的集群管理器和部署模式,因此建议通过配置文件或spark-submit命令行选项进行设置;另一个主要与Spark有关

确切意味着.也许有人可以为我们说清楚.虽然我知道在YARN的情况下,优先顺序如下：

>如果您使用代码设置设置

SparkSession.builder()
.config(sparkConf)
.getOrCreate()

这将覆盖所有其他设置(命令行,defaults.conf).这里唯一的例外是修改a时
初始化会话后设置(调用后)
session.getOrCreate).在这种情况下,它会被忽略
想像
>如果您不修改代码中的设置,它将回退到
命令行设置(spark会考虑那些
在命令行中指定,否则将从defaults.conf加载它们)
>最后,如果没有给出以上任何内容,它将从中加载设置
defaults.conf

所以我最后的建议是随意设置代码中的“spark.driver.memory”,“spark.executor.instances”等设置.