[SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support #26440

holdenk · 2019-11-08T21:17:41Z

This PR is based on an existing/previou PR - #19045

What changes were proposed in this pull request?

This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache.

There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing

Why are the changes needed?

With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required.

Does this PR introduce any user-facing change?

There is no API change, however an additional configuration flag is added to enable/disable this behaviour.

How was this patch tested?

New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s.

…he cloud provider/scheduler lets them know they aren't going to be removed immeditely but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / pre-emptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs & caching blocks, in the future we could perform some kind of migration of data during scale-down.

SparkQA · 2019-11-08T21:27:15Z

Test build #113474 has finished for PR 26440 at commit 55b6f9e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

just took a quick skim through, I know this is WIP so feel free to ignore comments if you just haven't implemented parts yet.

core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

holdenk · 2019-11-09T00:27:03Z

Looks like the JDK11 build is borked right now. I'll merge in master on Monday.

SparkQA · 2019-11-09T01:05:17Z

Test build #113477 has finished for PR 26440 at commit bfa06ce.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-10T01:55:43Z

GitHub Action failure is due to Maven Outage.

[ERROR] Failed to execute goal on project spark-tools_2.12: Could not resolve dependencies for project org.apache.spark:spark-tools_2.12:jar:3.0.0-SNAPSHOT: Failed to collect dependencies at org.clapper:classutil_2.12:jar:1.5.1: Failed to read artifact descriptor for org.clapper:classutil_2.12:jar:1.5.1: 
Could not transfer artifact org.clapper:classutil_2.12:pom:1.5.1 from/to central (https://repo.maven.apache.org/maven2): Connection timed out (Read failed) -> [Help 1]

dongjoon-hyun · 2019-11-10T02:03:37Z

Hi, @holdenk . Could you fix the WorkerDecommissionSuite UT failure?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113477/testReport/org.apache.spark.scheduler/WorkerDecommissionSuite/verify_a_task_with_all_workers_decommissioned_succeeds/

I'm also facing the UT failure locally on my mac.

[info] WorkerDecommissionSuite:
[info] - verify task with no decommissioning works as expected (4 seconds, 318 milliseconds)
[info] - verify a task with all workers decommissioned succeeds *** FAILED *** (4 seconds, 194 milliseconds)

After fixing all UTs, let's trigger JDK11 testing, too.

holdenk · 2019-11-10T16:05:01Z

Sure, I’ll work on that this Monday.

…going-to-be-shutdown-r4

holdenk · 2019-11-11T23:28:55Z

Ok digging into, looks like while I was updating the PR from 2017 I accidentally broke the message receive code path, but I've just been doing integration on K8s with the new PR hence why the UT is broken. This might take another day to resolve because the code path is a little convulted and life is busy.

SparkQA · 2019-11-12T17:18:26Z

Test build #113637 has started for PR 26440 at commit a63b68f.

SparkQA · 2019-11-12T23:10:07Z

Test build #113654 has finished for PR 26440 at commit c1a0735.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T01:41:22Z

Test build #113655 has finished for PR 26440 at commit 317c76b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…going-to-be-shutdown-r4

SparkQA · 2019-11-13T03:44:18Z

Test build #113658 has finished for PR 26440 at commit 86c0ff6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-13T04:16:03Z

Test build #113660 has finished for PR 26440 at commit bdd7df3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SaurabhChawla100 · 2020-02-10T13:20:57Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

+      decommissioned = true
+      // Tell master we are are decommissioned so it stops trying to schedule us
+      if (driver.nonEmpty) {
+        driver.get.askSync[Boolean](DecommissionExecutor(executorId))


Instead of decommission Executor, Can we have Entire node decommission
eg. driver.get.askSync[Boolean](AddNodeToDecommission(hostname, terminationTime, NodeLossReason))

Same as previous comment, in standalone only sure, but in YARN/K8s we could see individual executors decommission.

holdenk · 2020-02-10T19:31:59Z

So @tooptoop4 in its present state it could help, you'd call decom instead of stop. But we'd probably want to see the last step to fully consider https://issues.apache.org/jira/browse/SPARK-30610 solved, right now it won't schedule any new jobs but won't exit & shutdown automatically (you can use a timer and sort of approximate it but it's not perfect).

holdenk · 2020-02-10T19:38:33Z

.@itskals so I'm not 100% sure what you want us to do in handleFailedTask of fetch failure, I'm open to changing behavior there but given how long this PR has been going on I'm partial to pushing out any non-bug fixes in it to follow-ups (since there is already plan for follow-ups).

holdenk · 2020-02-10T19:48:19Z

It passes tests now, I'm going to do a read through this week. If no one has any outstanding concerns though, I'm planning on merging this on Friday (to master). We can continue the discussion about 3.0 on dev@

SparkQA · 2020-02-14T00:08:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/23137/

SparkQA · 2020-02-14T00:29:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/23137/

SparkQA · 2020-02-14T02:06:03Z

Test build #118380 has finished for PR 26440 at commit af55030.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-02-15T05:58:31Z

Hi, @holdenk . This PR seems to add a flaky test on master branch. When I triggered twice, one run passed and the other run failed.

[WIP][K8S] Jenkins IT health check #27589

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning *** FAILED ***
  The code passed to eventually never returned normally. Attempted 126 times over 2.0003060982000003 minutes.

Last failure message: "++ id -u ..." did not contain "decommissioning executor" The application did not complete.. (KubernetesSuite.scala:383)
- Run SparkR on simple dataframe.R example

dongjoon-hyun · 2020-02-15T06:00:47Z

Do you have any idea about the root cause of flakiness?

…emption support This PR is based on an existing/previou PR - apache#19045 ### What changes were proposed in this pull request? This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache. There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing ### Why are the changes needed? With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required. ### Does this PR introduce any user-facing change? There is no API change, however an additional configuration flag is added to enable/disable this behaviour. ### How was this patch tested? New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s. Closes apache#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>

Ngone51 · 2020-05-08T07:32:37Z

Right so just send SIGPWR to the worker. Are you saying in standalone mode you have one worker with multiple executors and you want to decommission a specific executor? Regardless this isn’t really the place, how about you go back to the PR introducing SIGPWR handler

I am talking about decommission all executors in standalone. IIUC, all executors will shutdown at once (because of WorkerWatcher) when a worker shutdown(e.g. receive SIGPWR) in standalone. So we obviously have no time or it's meaningless to do decommission(cache copy, adding into executorsPendingDecommission ) for executors.

But I think it would work if SIGPWR is manually controlled.

@holdenk

holdenk · 2020-05-08T17:12:18Z

I'm having difficulty understanding your concern @Ngone51

So in standalone mode it's up to the user to write their decommissioning shell script and register it with the cloud provider or whatever mechanism is being used to notify the executors of decommissioning (or if it's maintenance then send the signal manually). All of the executors will not shut down because one executor receives a SIGPWR. WorkerWatcher just checks to make sure the RPC connection stills works and we don't shut down the RPC mechanism during decommissioning.

Can you point out what in my understanding doesn't match your understanding?

Ngone51 · 2020-05-09T02:10:41Z

thanks @holdenk . It was me trying to understand the whole story. I can image how decommission performs in Standalone now if the signal is somehow controlled manually.

(At the beginning, I was thinking if SIGPWR is from hard-ware failure, then no one could escape and even impossible to do decommission.)

gatorsmile · 2020-05-22T14:33:17Z

@holdenk Did we get any LGTM from the other committer before merging this PR?

holdenk · 2020-05-22T15:44:13Z

Some other committers reviewed it early on and I brought it to the dev@ list and left it open after stating intent to merge Incase any committer who had been involved with reviewing had any blocking issues they wanted to raise.

HyukjinKwon · 2020-06-19T00:49:52Z

I left a comment but I removed it back here. I will comment on the JIRA to discuss in single place to avoid having many branches of the same discussions, see SPARK-20624

HyukjinKwon · 2020-06-19T00:53:40Z

For this PR specifically, I think it should have been explicitly approved as it affects all other components and it's pretty big. I think here is when we needed to call more reviews, and explicit approvals.

For #28370 specifically, the review feedback was not fully addressed, but just merged.

Maybe we should better avoid merging in this way.

holdenk · 2020-06-19T01:06:22Z

So #28370 was merged with all committer comments addressed. There were some minor concerns from @Ngone51 but they were all suitable for follow up work. Is there a comment in there I was missing though that you believe needed to be addressed @HyukjinKwon ?

HyukjinKwon · 2020-06-19T01:28:03Z

I haven't looked into the codes closely yet - I will try to read and follow more closely. I just noticed the discussions made in these PRs which are virtually from you.

My point is that:

Here looks, to me, when we needed to call more review and explicit approvals given that this affects all other components in Spark.
Seems the review comments are not fully addressed, but being merged. I see that PR [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370 was merged right away after leaving [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370 (comment).
It looks we needed to have a SPIP.

It looks to me that we're rushing on these PRs where actually we should be the most conservative.

holdenk · 2020-06-19T01:38:08Z

I haven't looked into the codes closely yet - I will try to read and follow more closely. I just noticed the discussions made in these PRs which are virtually from you.

My point is that:

Here looks, to me, when we needed to call more review and explicit approvals given that this affects all other components in Spark.

Seems the review comments are not fully addressed, but being merged. I see that PR [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370 was merged right away after leaving #28370 (comment).

That was my LGTM I'm going to merge this comment so yeah that's sort of what I expect. If there was another engaged committer who had expressed interest here of course I'd wait a bit for them to sign off as well.

It looks we needed to have a SPIP.

I'm not sure I agree, but if you do feel free to bring it up on the dev@ list and I can refactor the design doc into an SPIP formatted one.

It looks to me that we're rushing on these PRs where actually we should be the most conservative.

There is no plans to cut a release from master anytime soon, this isn't being back ported to branch-3, we've had multiple eyes on the design doc from various committers, it's disabled by default. The PR was open for multiple weeks (I've seen commiters merge commits larger than this with the PR being open for less than a day). I don't agree with you here, and if you still think I've been too hasty lets have the discussion on dev@ or private@ as appropriate.

(edit: formatting)

holdenk · 2020-06-19T01:40:51Z

Also the first PR I made here was back in Aug 24, 2017. There has been plenty of time.

…emption support This PR is based on an existing/previou PR - apache#19045 This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache. There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required. There is no API change, however an additional configuration flag is added to enable/disable this behaviour. New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s. Closes apache#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>

holdenk added 5 commits November 7, 2019 12:26

Add the decom script and test python

c935a6e

Cleanup and reduce time a bit

16b3880

Add decom script was previously untracked

8f7ed26

Add some more docstring

55b6f9e

holdenk mentioned this pull request Nov 8, 2019

[WIP][SPARK-20628][CORE][K8S] Keep track of nodes (/ spot instances) which are going to be shutdown #19045

Closed

tgravescs reviewed Nov 8, 2019

View reviewed changes

Style fix (long line)

bfa06ce

dongjoon-hyun added the SPARK CORE label Nov 9, 2019

dongjoon-hyun requested a review from dbtsai November 10, 2019 02:06

Merge branch 'master' into SPARK-20628-keep-track-of-nodes-which-are-…

bdda06d

…going-to-be-shutdown-r4

Minor style and comment fixes

a63b68f

holdenk added 2 commits November 12, 2019 15:08

Now we ACK the decom msg

9476e22

Launch a task even in decom state to avoid a race condition.

317c76b

holdenk force-pushed the SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4 branch from c1a0735 to 317c76b Compare November 12, 2019 23:09

Allow tasks to launch in the executor base class as well

86c0ff6

Merge branch 'master' into SPARK-20628-keep-track-of-nodes-which-are-…

bdd7df3

…going-to-be-shutdown-r4

holdenk changed the title ~~[WIP][SPARK-20628][CORE] Start to improve Spark decommissioning & preemption support~~ [WIP][SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support Nov 13, 2019

SaurabhChawla100 reviewed Feb 10, 2020

View reviewed changes

Merge in master

af55030

asfgit closed this in d273a2b Feb 14, 2020

SaurabhChawla100 mentioned this pull request Feb 19, 2020

[SPARK-30873][CORE][YARN]Handling Node Decommissioning for Yarn cluster manger in Spark #27636

Closed

prakharjain09 mentioned this pull request Apr 8, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #27864

Closed

agrawaldevesh mentioned this pull request Jun 11, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370

Closed

agrawaldevesh mentioned this pull request Jul 24, 2020

[SPARK-31197][CORE] Shutdown executor once we are done decommissioning #29211

Closed

[SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support #26440

[SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support #26440

Uh oh!

Conversation

holdenk commented Nov 8, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 8, 2019

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holdenk commented Nov 9, 2019

Uh oh!

SparkQA commented Nov 9, 2019

Uh oh!

dongjoon-hyun commented Nov 10, 2019

Uh oh!

dongjoon-hyun commented Nov 10, 2019

Uh oh!

holdenk commented Nov 10, 2019

Uh oh!

holdenk commented Nov 11, 2019

Uh oh!

SparkQA commented Nov 12, 2019

Uh oh!

SparkQA commented Nov 12, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SparkQA commented Nov 13, 2019

Uh oh!

SaurabhChawla100 Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk Feb 10, 2020

Choose a reason for hiding this comment

Uh oh!

holdenk commented Feb 10, 2020

Uh oh!

holdenk commented Feb 10, 2020

Uh oh!

holdenk commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

dongjoon-hyun commented Feb 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 15, 2020

Uh oh!

Ngone51 commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented May 8, 2020

Uh oh!

Ngone51 commented May 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented May 22, 2020

Uh oh!

holdenk commented May 22, 2020

Uh oh!

HyukjinKwon commented Jun 19, 2020

Uh oh!

HyukjinKwon commented Jun 19, 2020

SaurabhChawla100 Feb 10, 2020 •

edited

Loading

dongjoon-hyun commented Feb 15, 2020 •

edited

Loading

Ngone51 commented May 8, 2020 •

edited

Loading

Ngone51 commented May 9, 2020 •

edited

Loading

holdenk commented Jun 19, 2020 •

edited

Loading