Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented May 6, 2019

What changes were proposed in this pull request?

There are several kinds of shuffle client, blockTransferService and externalShuffleClient.

For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors.

For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled.

When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully.

When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout.

If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective.

If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes.

In this PR, we check whether relative executor is alive before retry.

How was this patch tested?

Unit test.

@turboFei turboFei changed the title [SPARK-17637] For nettyBlockTransferService, when exception occured hen fetching data, check whether relative executor is alive before retry [SPARK-27637] For nettyBlockTransferService, when exception occured hen fetching data, check whether relative executor is alive before retry May 6, 2019
@turboFei
Copy link
Member Author

turboFei commented May 6, 2019

@cloud-fan @gatorsmile

@felixcheung felixcheung changed the title [SPARK-27637] For nettyBlockTransferService, when exception occured hen fetching data, check whether relative executor is alive before retry [SPARK-27637] For nettyBlockTransferService, when exception occurred when fetching data, check whether relative executor is alive before retry May 6, 2019
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this account for the cases when external shuffle service is enabled?

@turboFei
Copy link
Member Author

turboFei commented May 6, 2019

does this account for the cases when external shuffle service is enabled?

ExecutorAliveChecker is only implemented in nettyBlockTransferService.
When external shuffle service is enabled, nettyBlockTransferService is used to fetch broadcast block data and externalShuffleClient is used to fetch shuffle block data.
Therefore, this account for fetching broadcast block when external shuffle service is enabled.
When external shuffle service is not enabled, this account for fetching both shuffle block data and broadcast data. @felixcheung

@turboFei turboFei force-pushed the SPARK-27637 branch 4 times, most recently from 703211a to 3578df9 Compare May 8, 2019 14:52
@turboFei turboFei force-pushed the SPARK-27637 branch 3 times, most recently from 566848d to 72ce9a4 Compare May 10, 2019 01:27
@turboFei turboFei changed the title [SPARK-27637] For nettyBlockTransferService, when exception occurred when fetching data, check whether relative executor is alive before retry [SPARK-27637] For nettyBlockTransferService, if IOException occurred when fetching data, check whether relative executor is alive before retry May 12, 2019
@turboFei turboFei changed the title [SPARK-27637] For nettyBlockTransferService, if IOException occurred when fetching data, check whether relative executor is alive before retry [SPARK-27637] For nettyBlockTransferService, if IOException occurred during fetching data, check whether relative executor is alive before retry May 12, 2019
@turboFei turboFei changed the title [SPARK-27637] For nettyBlockTransferService, if IOException occurred during fetching data, check whether relative executor is alive before retry [SPARK-27637] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry May 12, 2019
@turboFei turboFei changed the title [SPARK-27637] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry [SPARK-27637][Shuffle] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry May 13, 2019
@turboFei
Copy link
Member Author

thanks @cloud-fan

@cloud-fan
Copy link
Contributor

LGTM, also cc @jiangxb1987 @zsxwing

@SparkQA
Copy link

SparkQA commented May 17, 2019

Test build #105491 has finished for PR 24533 at commit 2b2bf00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105554 has finished for PR 24533 at commit decb68f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Member Author

turboFei commented May 20, 2019

This test(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105561/console) has been hanged due to an unknown reason for 2 hours.
So I force push a commit to trigger a new test.

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105569 has finished for PR 24533 at commit 90b1d7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105561 has finished for PR 24533 at commit 1c43c10.

  • This patch fails from timeout after a configured wait of 400m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d90c460 May 21, 2019
turboFei pushed a commit to turboFei/spark that referenced this pull request May 21, 2019
…occurred while fetching data, check whether relative executor is alive before retry

There are several kinds of shuffle client, blockTransferService and externalShuffleClient.

For the externalShuffleClient,  there are relative external shuffle service, which guarantees  the shuffle block data and regardless the  state of executors.

For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled.

When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully.

When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true,  the executor will be removed while it has been idle  for more than idleTimeout.

If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective.

If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big,  such as 30s and 10 times, it would waste 5 minutes.

In this PR, we check whether relative executor is alive before retry.
Unit test.

Closes apache#24533 from turboFei/SPARK-27637.

Authored-by: hustfeiwang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Aug 19, 2019
…ce, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533"

This reverts commit 036fd39.
HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Aug 19, 2019
…ferService, if IOException occurred while create client, check whether relative executor is alive before retry apache#24533""

This reverts commit 4f3fde2.
pan3793 pushed a commit to NetEase/spark that referenced this pull request Feb 9, 2022
…n fast fail time window

Squashed commit of the following:

commit aeedd82
Author: hustfeiwang <[email protected]>
Date:   Tue May 21 13:45:42 2019 +0800

    [SPARK-27637][SHUFFLE] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry

    There are several kinds of shuffle client, blockTransferService and externalShuffleClient.

    For the externalShuffleClient,  there are relative external shuffle service, which guarantees  the shuffle block data and regardless the  state of executors.

    For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled.

    When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully.

    When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true,  the executor will be removed while it has been idle  for more than idleTimeout.

    If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective.

    If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big,  such as 30s and 10 times, it would waste 5 minutes.

    In this PR, we check whether relative executor is alive before retry.
    Unit test.

    Closes apache#24533 from turboFei/SPARK-27637.

    Authored-by: hustfeiwang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants