我有一个简单的Spark作业,它从管道分离的文件中读取值并在其上执行一些业务逻辑,并在我们的DB中写入已处理的值.
所以要加载文件,我使用的是org.apache.spark.sql.SQLContext.这是我必须将文件作为DataFrame加载的代码
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("header", "false")
.option("comment", null)
.option("delimiter", "|")
.option("quote", null)
.load(pathToTheFile);
现在的问题是
1.加载功能无法加载文件
2.除了在我的控制台中,它没有提供有关该问题的更多细节(例外)
WARN 2017-11-07 17:26:40,108 akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@172.17.0.2:35359] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
ERROR 2017-11-07 17:26:40,134 org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
并继续进行投票.
我确信,该文件在正确格式的预期文件夹中可用.但不知道这个日志是什么以及为什么SQLContext能够加载文件.
这是我的build.gradle的依赖项部分:
dependencies {
provided(
[group: 'org.apache.spark', name: 'spark-core_2.10', version: '1.4.0'],
[group: 'org.apache.spark', name: 'spark-sql_2.10', version: '1.4.0'],
[group: 'com.datastax.spark', name: 'spark-cassandra-connector-java_2.10', version: '1.4.0']
)
compile([
[group: 'com.databricks', name: 'spark-csv_2.10', version: '1.4.0'],
])
}
我在docker容器中运行这个工作
任何帮助,将不胜感激
最佳答案 您可以检查该问题是否与
as this thread不同:
07001 short, akka opens up dynamic, 07002. So, simple NAT fails.
You might try some trickery with a DNS server and docker’s--net=host
.Based on Jacob’s suggestion, I started using
--net=host
which is a new option in latest version of docker.
I also setSPARK_LOCAL_IP
to the host’s IP address and then AKKA does not use the hostname and I don’t need the Spark driver’s hostname to be resolvable.
您还可以将Dockerfile与P7h/docker-spark 2.2.0中使用的Dockerfile进行比较,看看是否存在可能解释该问题的任何差异.