错误提示
1、executor lost
WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local):
ExecutorLostFailure (executor lost)
2、task lost
WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, 192.168.47.217):
java.io.IOException: Connection from /192.168.47.217:55483 closed
3、各种timeout
java.util.concurrent.TimeoutException: Futures timed out after [120 second]
ERROR TransportChannelHandler: Connection to /192.168.47.212:35409
has been quiet for 120000 ms while there are outstanding requests.
Assuming connection is dead; please adjust spark.network.
timeout if this is wrong
解决
- 一般由网络或者gc引起,worker或executor没有接收到executor或task的心跳反馈。
- 提高
spark.network.timeout
的值,改成300或更高(=5min,单位s,默认为 120) - 配置所有网络传输的延时,如果没有主动设置以下参数,默认覆盖其属性:
spark.core.connection.ack.wait.timeout
spark.akka.timeout
spark.storage.blockManagerSlaveTimeoutMs
spark.shuffle.io.connectionTimeout
spark.rpc.askTimeout 或 spark.rpc.lookupTimeout