[SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry #24533

turboFei · 2019-05-06T05:00:53Z

What changes were proposed in this pull request?

There are several kinds of shuffle client, blockTransferService and externalShuffleClient.

For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors.

For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled.

When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully.

When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout.

If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective.

If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes.

In this PR, we check whether relative executor is alive before retry.

How was this patch tested?

Unit test.

turboFei · 2019-05-06T05:17:02Z

@cloud-fan @gatorsmile

felixcheung

does this account for the cases when external shuffle service is enabled?

turboFei · 2019-05-06T08:13:59Z

does this account for the cases when external shuffle service is enabled?

ExecutorAliveChecker is only implemented in nettyBlockTransferService.
When external shuffle service is enabled, nettyBlockTransferService is used to fetch broadcast block data and externalShuffleClient is used to fetch shuffle block data.
Therefore, this account for fetching broadcast block when external shuffle service is enabled.
When external shuffle service is not enabled, this account for fetching both shuffle block data and broadcast data. @felixcheung

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

turboFei · 2019-05-13T09:02:25Z

thanks @cloud-fan

common/network-common/src/main/java/org/apache/spark/network/client/ExecutorDeadException.java

core/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala

cloud-fan · 2019-05-17T12:37:40Z

LGTM, also cc @jiangxb1987 @zsxwing

SparkQA · 2019-05-17T15:24:29Z

Test build #105491 has finished for PR 24533 at commit 2b2bf00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala

SparkQA · 2019-05-20T07:05:01Z

Test build #105554 has finished for PR 24533 at commit decb68f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-05-20T11:25:27Z

This test(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105561/console) has been hanged due to an unknown reason for 2 hours.
So I force push a commit to trigger a new test.

SparkQA · 2019-05-20T13:27:02Z

Test build #105569 has finished for PR 24533 at commit 90b1d7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-20T13:58:13Z

Test build #105561 has finished for PR 24533 at commit 1c43c10.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-21T05:46:05Z

thanks, merging to master!

…occurred while fetching data, check whether relative executor is alive before retry There are several kinds of shuffle client, blockTransferService and externalShuffleClient. For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors. For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled. When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully. When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout. If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective. If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes. In this PR, we check whether relative executor is alive before retry. Unit test. Closes apache#24533 from turboFei/SPARK-27637. Authored-by: hustfeiwang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ce, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533" This reverts commit 036fd39.

…ferService, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533"" This reverts commit 4f3fde2.

…n fast fail time window Squashed commit of the following: commit aeedd82 Author: hustfeiwang <[email protected]> Date: Tue May 21 13:45:42 2019 +0800 [SPARK-27637][SHUFFLE] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry There are several kinds of shuffle client, blockTransferService and externalShuffleClient. For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors. For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled. When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully. When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout. If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective. If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes. In this PR, we check whether relative executor is alive before retry. Unit test. Closes apache#24533 from turboFei/SPARK-27637. Authored-by: hustfeiwang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

turboFei force-pushed the SPARK-27637 branch from 5cdd45a to bd7d729 Compare May 6, 2019 05:03

felixcheung reviewed May 6, 2019

View reviewed changes

cloud-fan reviewed May 6, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 6, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 8, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 8, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-27637 branch 4 times, most recently from 703211a to 3578df9 Compare May 8, 2019 14:52

cloud-fan reviewed May 9, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 9, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 9, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-27637 branch 3 times, most recently from 566848d to 72ce9a4 Compare May 10, 2019 01:27

cloud-fan reviewed May 10, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-27637 branch from e6a857b to 1b28755 Compare May 10, 2019 15:31

This was referenced May 12, 2019

About my pull request contributed to apache spark. turboFei/turbofei.github.com#2

Closed

About the implementation of transactions support for transmitting data from sparksql to greenplum. turboFei/turbofei.github.com#1

Closed

cloud-fan reviewed May 13, 2019

View reviewed changes

common/network-common/src/main/java/org/apache/spark/network/client/ExecutorDeadException.java Outdated Show resolved Hide resolved

cloud-fan reviewed May 17, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 17, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala Outdated Show resolved Hide resolved

fix nit

2b2bf00

cloud-fan reviewed May 20, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala Outdated Show resolved Hide resolved

turboFei force-pushed the SPARK-27637 branch from 52d978c to 2b2bf00 Compare May 20, 2019 06:23

[SPARK-27637] If asksync fails, throw the origin exception

decb68f

remove the assert

90b1d7e

turboFei force-pushed the SPARK-27637 branch from 1c43c10 to 90b1d7e Compare May 20, 2019 11:20

cloud-fan closed this in d90c460 May 21, 2019

AngersZhuuuu mentioned this pull request Oct 25, 2019

[SPARK-29551][CORE] Fix a bug about fetch failed when an executor is lost #26206

Closed

Ngone51 mentioned this pull request Aug 17, 2021

[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight #32114

Closed

[SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry #24533

[SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry #24533

Uh oh!

Conversation

turboFei commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

turboFei commented May 6, 2019

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

turboFei commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

turboFei commented May 13, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented May 17, 2019

Uh oh!

SparkQA commented May 17, 2019

Uh oh!

Uh oh!

SparkQA commented May 20, 2019

Uh oh!

turboFei commented May 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 20, 2019

Uh oh!

SparkQA commented May 20, 2019

Uh oh!

cloud-fan commented May 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

turboFei commented May 6, 2019 •

edited

Loading

turboFei commented May 6, 2019 •

edited

Loading

turboFei commented May 20, 2019 •

edited

Loading