-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support #26440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20628][CORE][K8S] Start to improve Spark decommissioning & preemption support #26440
Conversation
…he cloud provider/scheduler lets them know they aren't going to be removed immeditely but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / pre-emptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs & caching blocks, in the future we could perform some kind of migration of data during scale-down.
Test build #113474 has finished for PR 26440 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just took a quick skim through, I know this is WIP so feel free to ignore comments if you just haven't implemented parts yet.
core/src/main/scala/org/apache/spark/deploy/client/StandaloneAppClient.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Outdated
Show resolved
Hide resolved
Looks like the JDK11 build is borked right now. I'll merge in master on Monday. |
Test build #113477 has finished for PR 26440 at commit
|
GitHub Action failure is due to
|
Hi, @holdenk . Could you fix the I'm also facing the UT failure locally on my mac.
After fixing all UTs, let's trigger JDK11 testing, too. |
Sure, I’ll work on that this Monday. |
…going-to-be-shutdown-r4
Ok digging into, looks like while I was updating the PR from 2017 I accidentally broke the message receive code path, but I've just been doing integration on K8s with the new PR hence why the UT is broken. This might take another day to resolve because the code path is a little convulted and life is busy. |
Test build #113637 has started for PR 26440 at commit |
c1a0735
to
317c76b
Compare
Test build #113654 has finished for PR 26440 at commit
|
Test build #113655 has finished for PR 26440 at commit
|
…going-to-be-shutdown-r4
Test build #113658 has finished for PR 26440 at commit
|
Test build #113660 has finished for PR 26440 at commit
|
decommissioned = true | ||
// Tell master we are are decommissioned so it stops trying to schedule us | ||
if (driver.nonEmpty) { | ||
driver.get.askSync[Boolean](DecommissionExecutor(executorId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of decommission Executor, Can we have Entire node decommission
eg. driver.get.askSync[Boolean](AddNodeToDecommission(hostname, terminationTime, NodeLossReason))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment, in standalone only sure, but in YARN/K8s we could see individual executors decommission.
So @tooptoop4 in its present state it could help, you'd call decom instead of stop. But we'd probably want to see the last step to fully consider https://issues.apache.org/jira/browse/SPARK-30610 solved, right now it won't schedule any new jobs but won't exit & shutdown automatically (you can use a timer and sort of approximate it but it's not perfect). |
.@itskals so I'm not 100% sure what you want us to do in |
It passes tests now, I'm going to do a read through this week. If no one has any outstanding concerns though, I'm planning on merging this on Friday (to master). We can continue the discussion about 3.0 on dev@ |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #118380 has finished for PR 26440 at commit
|
Hi, @holdenk . This PR seems to add a flaky test on
|
Do you have any idea about the root cause of flakiness? |
…emption support This PR is based on an existing/previou PR - apache#19045 ### What changes were proposed in this pull request? This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache. There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing ### Why are the changes needed? With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required. ### Does this PR introduce any user-facing change? There is no API change, however an additional configuration flag is added to enable/disable this behaviour. ### How was this patch tested? New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s. Closes apache#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>
I am talking about decommission all executors in standalone. IIUC, all executors will shutdown at once (because of But I think it would work if SIGPWR is manually controlled. |
I'm having difficulty understanding your concern @Ngone51 So in standalone mode it's up to the user to write their decommissioning shell script and register it with the cloud provider or whatever mechanism is being used to notify the executors of decommissioning (or if it's maintenance then send the signal manually). All of the executors will not shut down because one executor receives a SIGPWR. WorkerWatcher just checks to make sure the RPC connection stills works and we don't shut down the RPC mechanism during decommissioning. Can you point out what in my understanding doesn't match your understanding? |
thanks @holdenk . It was me trying to understand the whole story. I can image how decommission performs in Standalone now if the signal is somehow controlled manually. (At the beginning, I was thinking if SIGPWR is from hard-ware failure, then no one could escape and even impossible to do decommission.) |
@holdenk Did we get any LGTM from the other committer before merging this PR? |
Some other committers reviewed it early on and I brought it to the dev@ list and left it open after stating intent to merge Incase any committer who had been involved with reviewing had any blocking issues they wanted to raise. |
I left a comment but I removed it back here. I will comment on the JIRA to discuss in single place to avoid having many branches of the same discussions, see SPARK-20624 |
For this PR specifically, I think it should have been explicitly approved as it affects all other components and it's pretty big. I think here is when we needed to call more reviews, and explicit approvals. For #28370 specifically, the review feedback was not fully addressed, but just merged. Maybe we should better avoid merging in this way. |
So #28370 was merged with all committer comments addressed. There were some minor concerns from @Ngone51 but they were all suitable for follow up work. Is there a comment in there I was missing though that you believe needed to be addressed @HyukjinKwon ? |
I haven't looked into the codes closely yet - I will try to read and follow more closely. I just noticed the discussions made in these PRs which are virtually from you. My point is that:
It looks to me that we're rushing on these PRs where actually we should be the most conservative. |
That was my LGTM I'm going to merge this comment so yeah that's sort of what I expect. If there was another engaged committer who had expressed interest here of course I'd wait a bit for them to sign off as well.
I'm not sure I agree, but if you do feel free to bring it up on the dev@ list and I can refactor the design doc into an SPIP formatted one.
There is no plans to cut a release from master anytime soon, this isn't being back ported to branch-3, we've had multiple eyes on the design doc from various committers, it's disabled by default. The PR was open for multiple weeks (I've seen commiters merge commits larger than this with the PR being open for less than a day). I don't agree with you here, and if you still think I've been too hasty lets have the discussion on dev@ or private@ as appropriate. (edit: formatting) |
Also the first PR I made here was back in Aug 24, 2017. There has been plenty of time. |
…emption support This PR is based on an existing/previou PR - apache#19045 This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache. There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required. There is no API change, however an additional configuration flag is added to enable/disable this behaviour. New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s. Closes apache#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>
…emption support This PR is based on an existing/previou PR - apache#19045 This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache. There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required. There is no API change, however an additional configuration flag is added to enable/disable this behaviour. New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s. Closes apache#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4. Lead-authored-by: Holden Karau <[email protected]> Co-authored-by: Holden Karau <[email protected]> Signed-off-by: Holden Karau <[email protected]>
This PR is based on an existing/previou PR - #19045
What changes were proposed in this pull request?
This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache.
There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing
Why are the changes needed?
With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required.
Does this PR introduce any user-facing change?
There is no API change, however an additional configuration flag is added to enable/disable this behaviour.
How was this patch tested?
New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s.