spark算法本地测试正常,放在集群上就出错的原因

最近在用一个算法做用户相关特征的提取,直接使用Java 调用Spark内置算法接口,进行数据的处理工作和特征提取,在本地测试正常,但是放到集群上面,总是跑到最后就提示各种错误,通道机一般报错界面如下:

         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 10.39.14.179
         ApplicationMaster RPC port: 0
         queue: root.weibo_bigdata_push.weibo_bigdata_push
         start time: 1503035528594
         final status: FAILED
         tracking URL: http://yz48223.hadoop.data.sina.com.cn:9008/proxy/application_1501011963582_297128/history/application_1501011963582_297128/2
         user: weibo_bigdata_push
17/08/18 13:56:17 INFO Client: Deleting staging directory hdfs://ns1-backup/user/xxx/.sparkStaging/application_1501011963582_297128
Exception in thread "main" org.apache.spark.SparkException: Application application_1501011963582_297128 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
        at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
        at org.apache.spark.deploy.yarn.Client.main(Client.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

spark日志错误提示很多节点都失败了。

主界面提示错误:

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9 (map at ALS.scala:854) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

但是 logs中的详细错误主要两个错误是:

出现org.apache.hadoop.util.Shell$ExitCodeException 。Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:

最严重的是org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 总是提示 缺少接受 shuffle输出的地址,

如下:

17/08/18 13:52:43 WARN TaskSetManager: Lost task 38.0 in stage 9.0 (TID 365, 10.39.15.31): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1501011963582_297128_01_000009 on host: 10.39.15.31. Exit status: 127. Diagnostics: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
	at org.apache.hadoop.util.Shell.run(Shell.java:418)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:208)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

地址错误:

17/08/18 13:52:44 INFO TaskSetManager: Task 48.1 in stage 9.0 (TID 427) failed, but another instance of the task has already succeeded, so not re-queuing the task to be re-executed.
17/08/18 13:52:44 WARN TaskSetManager: Lost task 99.0 in stage 9.0 (TID 469, 10.39.14.232): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=99, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
	at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:695)
	at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:691)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:691)
	at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:145)
	at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:109)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
	at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
	at org.apache.spark.scheduler.Task.run(Task.scala:85)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

查了很多资料,好像是JDK 版本不同引起的,本地JDK是1.8 ,而集群是1.7,不知道为什么不兼容,修改了JDK 为1.7 发现报错出现了减少,但是还是集群并行不能跑,但是单机节点却能跑。 不知道是什么原因。

(利用 Spark-submit 提交 参数都正常)

但是 集群上的版本是1.5,和2.0都无法跑出来结果,但是1.3就能出结果, 所以目前确定是 Spark 1.5以上的版本对协同过滤算法不兼容引起,具体原因不详。

task倾斜原因比较多,网络io,cpu,mem都有可能造成这个节点上的任务执行缓慢,可以去看该节点的性能监控来分析原因。以前遇到过同事在spark的一台worker上跑R的任务导致该节点spark task运行缓慢。一般遇到这种问题提高executor内存即可,同时增加每个executor的cpu,这样不会减少task并行度。

org.apache.spark.shuffle.FetchFailedException:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 

    原文作者:AI算法践行者
    原文地址: https://zhuanlan.zhihu.com/p/28599076
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞