Spark异常处理——Executor&Task Lost

错误提示

1、executor lost
WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local):
ExecutorLostFailure (executor lost)
2、task lost
WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, 192.168.47.217):
java.io.IOException: Connection from /192.168.47.217:55483 closed
3、各种timeout
java.util.concurrent.TimeoutException: Futures timed out after [120 second]

ERROR TransportChannelHandler: Connection to /192.168.47.212:35409 
has been quiet for 120000 ms while there are outstanding requests.
Assuming connection is dead; please adjust spark.network.
timeout if this is wrong

解决

  • 一般由网络或者gc引起,worker或executor没有接收到executor或task的心跳反馈。
  • 提高spark.network.timeout的值,改成300或更高(=5min,单位s,默认为 120)
  • 配置所有网络传输的延时,如果没有主动设置以下参数,默认覆盖其属性:
spark.core.connection.ack.wait.timeout
spark.akka.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout 或 spark.rpc.lookupTimeout
    原文作者:望京老司机
    原文地址: https://www.jianshu.com/p/5d0ae6a0343c
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞