-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry #24533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this account for the cases when external shuffle service is enabled?
ExecutorAliveChecker is only implemented in nettyBlockTransferService. |
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
Outdated
Show resolved
Hide resolved
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java
Outdated
Show resolved
Hide resolved
703211a to
3578df9
Compare
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java
Outdated
Show resolved
Hide resolved
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
Outdated
Show resolved
Hide resolved
566848d to
72ce9a4
Compare
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
Outdated
Show resolved
Hide resolved
|
thanks @cloud-fan |
common/network-common/src/main/java/org/apache/spark/network/client/ExecutorDeadException.java
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala
Outdated
Show resolved
Hide resolved
|
LGTM, also cc @jiangxb1987 @zsxwing |
|
Test build #105491 has finished for PR 24533 at commit
|
core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala
Outdated
Show resolved
Hide resolved
|
Test build #105554 has finished for PR 24533 at commit
|
|
This test(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105561/console) has been hanged due to an unknown reason for 2 hours. |
|
Test build #105569 has finished for PR 24533 at commit
|
|
Test build #105561 has finished for PR 24533 at commit
|
|
thanks, merging to master! |
…occurred while fetching data, check whether relative executor is alive before retry There are several kinds of shuffle client, blockTransferService and externalShuffleClient. For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors. For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled. When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully. When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout. If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective. If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes. In this PR, we check whether relative executor is alive before retry. Unit test. Closes apache#24533 from turboFei/SPARK-27637. Authored-by: hustfeiwang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ce, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533" This reverts commit 036fd39.
…ferService, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533"" This reverts commit 4f3fde2.
…n fast fail time window Squashed commit of the following: commit aeedd82 Author: hustfeiwang <[email protected]> Date: Tue May 21 13:45:42 2019 +0800 [SPARK-27637][SHUFFLE] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry There are several kinds of shuffle client, blockTransferService and externalShuffleClient. For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors. For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled. When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully. When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout. If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective. If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes. In this PR, we check whether relative executor is alive before retry. Unit test. Closes apache#24533 from turboFei/SPARK-27637. Authored-by: hustfeiwang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
There are several kinds of shuffle client, blockTransferService and externalShuffleClient.
For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors.
For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled.
When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully.
When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout.
If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective.
If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes.
In this PR, we check whether relative executor is alive before retry.
How was this patch tested?
Unit test.