Skip to content

Conversation

@jxiang
Copy link
Contributor

@jxiang jxiang commented Feb 28, 2017

What changes were proposed in this pull request?

While some executors are being killed due to idleness, if some new tasks come in, driver could assign them to some executors are being killed. These tasks will fail later when the executors are lost. This patch is to make sure CoarseGrainedSchedulerBackend#killExecutors and DriverEndpoint#makeOffers are properly synchronized.

How was this patch tested?

manual tests

@jxiang jxiang force-pushed the spark-19757 branch 3 times, most recently from 6485fb8 to 6bc6b2f Compare March 1, 2017 17:36
Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the title and description?

The title should summarize the change, not the bug. The description should explain better why this is necessary, and you left parts of the template in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this racing against?

If it's requestTotalExecutors you're using the wrong lock, since that method is in a different class. It would be good to add comments mentioning the race (like the code handling the RegisterExecutor message).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Updated the description and the title, and added comment to mention the race.
Fixed the synchronized statement.

@jxiang jxiang changed the title [SPARK-19757][CORE] Executor with task scheduled could be killed due to idleness DriverEndpoint#makeOffers race against CoarseGrainedSchedulerBackend#killExecutors Mar 3, 2017
@vanzin
Copy link
Contributor

vanzin commented Mar 3, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73859 has finished for PR 17091 at commit 6a04ff7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 3, 2017

Can you add back the bug to the PR title?

@jxiang jxiang changed the title DriverEndpoint#makeOffers race against CoarseGrainedSchedulerBackend#killExecutors [SPARK-19757] DriverEndpoint#makeOffers race against CoarseGrainedSchedulerBackend#killExecutors Mar 5, 2017
@jxiang
Copy link
Contributor Author

jxiang commented Mar 5, 2017

Sure. Thanks.

@jxiang jxiang changed the title [SPARK-19757] DriverEndpoint#makeOffers race against CoarseGrainedSchedulerBackend#killExecutors [SPARK-19757][CORE] DriverEndpoint#makeOffers race against CoarseGrainedSchedulerBackend#killExecutors Mar 6, 2017
@jxiang
Copy link
Contributor Author

jxiang commented Mar 6, 2017

@vanzin I updated the patch a little so as not to kill an executor if it is busy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should change. This is the user's code explicitly asking for executors to be killed, so there's no reason to not honor that request. It would also change the behavior of a public API, and there's no way for the user to know which executors were killed or not from the method's return value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found out that SPARK-16554 just fixed the issue of killing busy executors. Reverted this change and updated the patch. Thanks.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74046 has finished for PR 17091 at commit 67b874e.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74117 has finished for PR 17091 at commit 9ade0fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jxiang jxiang closed this Mar 7, 2017
@vanzin
Copy link
Contributor

vanzin commented Mar 8, 2017

Did you mean to close this?

@jxiang jxiang reopened this Mar 8, 2017
@jxiang
Copy link
Contributor Author

jxiang commented Mar 8, 2017

Oh, sorry, I closed it by mistake. I'd like to get it in. Thanks.

@vanzin
Copy link
Contributor

vanzin commented Mar 8, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74214 has finished for PR 17091 at commit 9ade0fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 9, 2017

Merging to master.

@asfgit asfgit closed this in b60b9fc Mar 9, 2017
@jxiang jxiang deleted the spark-19757 branch June 6, 2017 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants