[SPARK-12552][Core]Correctly count the driver resource when recovering from failure for Master #10506

jerryshao · 2015-12-29T09:46:31Z

Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.

So here fix this to also count the driver's resource usage.

Also changing the recovered app's state to RUNNING when fully recovered. Previously it will always be WAITING even fully recovered.

@andrewor14 please help to review, thanks a lot.

SparkQA · 2015-12-29T11:01:25Z

Test build #48410 has finished for PR 10506 at commit 710f5de.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2015-12-29T13:45:14Z

Jenkins, retest this please.

SparkQA · 2015-12-29T16:05:05Z

Test build #48413 has finished for PR 10506 at commit 710f5de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-12-29T19:29:09Z

Can you add a unit test? You might have to mock the completeRecovery method

jerryshao · 2015-12-30T00:57:39Z

Sure, will do.

jerryshao · 2015-12-30T08:18:47Z

Jenkins, retest this please.

SparkQA · 2015-12-30T10:15:21Z

Test build #48463 has finished for PR 10506 at commit 3eb0b71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2016-03-01T06:43:39Z

@andrewor14 , would you please review this patch again, it is pending here a long time and I think it is actually a bug here. Thanks a lot.

SparkQA · 2016-03-01T06:44:18Z

Test build #52223 has finished for PR 10506 at commit a117dcd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T09:17:34Z

Test build #52225 has finished for PR 10506 at commit 7cec07c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

GavinGavinNo1 · 2016-03-31T01:11:38Z

Thank you for your comment for PR #12054. I think changing app state from WAITING to RUNNING in function completeRecovery. Suppose that some app is WAITING before master toggle, then all apps and all workers get known of master changed. But if last signal (WorkerSchedulerStateResponse or MasterChangeAcknowledged) is from some worker, then function completeRecovery is revoked, which means the app I mentioned above is in RUNNING state. If the cluster doesn't have enough resource for all apps, maybe that app will be in a wrong state for a while.

kayousterhout · 2017-03-02T05:44:28Z

Is anyone still working on this and if not, can you close the PR?

jerryshao · 2017-03-02T05:54:20Z

Hi @kayousterhout , I guess the issue still exists, but unfortunately there's no one reviewing this patch. I could rebase the code if someone could review it.

kayousterhout · 2017-03-02T05:58:02Z

OK fine to leave this open then (I don't have the time or expertise to review this unfortunately)

jerryshao · 2017-03-02T08:32:29Z

Ping @zsxwing , hopes you're the right person to review this very old PR, the issue still exists in the latest master, can you please take a review, thanks a lot.

SparkQA · 2017-03-02T10:52:28Z

Test build #73740 has finished for PR 10506 at commit 88b58eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-02T11:02:29Z

Test build #73742 has finished for PR 10506 at commit f231aed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-03-23T08:30:20Z

I also don't feel like I know enough to review this, but if you're confident about the fix, i think you can go ahead. The change looks reasonable on its face.

jerryshao · 2017-03-23T08:39:14Z

Thanks @srowen , I think the fix is OK, at least should be no worse than previous code.

jiangxb1987 · 2017-05-31T23:57:58Z

Could you rebase this? @jerryshao

jerryshao · 2017-06-01T03:07:40Z

Sure, I will bring this to update.

Change-Id: Iee06c055b42757611731f2e0b9419d6adf68d665

SparkQA · 2017-06-01T08:11:19Z

Test build #77627 has finished for PR 10506 at commit e2d6dbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-08T22:42:46Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

              driver.worker = Some(worker)
              driver.state = DriverState.RUNNING
-              worker.drivers(driverId) = driver
+              worker.addDriver(driver)


One major question(though I haven't tested this) -- Won't we call schedule() after we completed recovery? I think we will handle the resource change correctly there.

From my understanding, schedule() will only handle waiting drivers, but here is trying to calculate the exiting drivers, so I don't think schedule() will save the issue here. Let me try to test on latest master and back to you the result.

jiangxb1987 · 2017-06-08T22:45:56Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

    apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)

+    // Update the state of recovered apps to RUNNING
+    apps.filter(_.state == ApplicationState.WAITING).foreach(_.state = ApplicationState.RUNNING)


This should also been done later in schedule().

jiangxb1987 · 2017-06-08T22:47:12Z

I think this problem shouldn't have happen in general case, could you give more specific description on your integrated cluster?

jerryshao · 2017-06-09T06:23:16Z

@jiangxb1987 , to reproduce this issue, you can:

Configure to enable standalone HA, for example "spark.deploy.recoveryMode FILESYSTEM" and "spark.deploy.recoveryDirectory recovery"
Start a local standalone cluster (master and worker on one the same machine).
Submit a spark application with standalone cluster mode, for example "./bin/spark-submit --master spark://NT00022.local:6066 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar 10000"
During application running, stop the master process and restart it.
Wait for application to finish, you will see the unexpected core/memory number in master UI.

This is mainly because when Master recover Driver, Master don't count the resources (core/memory) used by Driver, so this part of resources are free, which will be used to allocate a new executor, when the application is finished, this over-occupied resource by new executor will make the worker resources to be negative.

Besides, in the current Master, only when new executor is allocated, then application state will be changed to "RUNNING", recovered application will never have the chance to change the state from "WAITING" to "RUNNING" because there's no new executor allocated.

Can you please take a try, this issue do exist and be reported in JIRA and mail list several times.

jiangxb1987 · 2017-06-09T10:08:56Z

@jerryshao Thank you for your effort, I'll try this tomorrow!

jiangxb1987 · 2017-06-12T15:35:17Z

I think the fix is right and the test case also looks good, we'd better merge this after add some new test cases over the application running state issue. @cloud-fan Could please have a look too?

jiangxb1987 · 2017-06-12T15:39:18Z

BTW, @jerryshao It would be great if we can add test framework to verify the states and statistics on the condition of Driver/Executor Lost/Join/Relaunch, is there any hope that you would invest some time on that?

jerryshao · 2017-06-13T01:48:04Z

It would be great if we can add test framework to verify the states and statistics on the condition of Driver/Executor Lost/Join/Relaunch

@jiangxb1987 can you explain more about what you want?

jiangxb1987 · 2017-06-13T02:06:04Z

Currently we don't cover the Driver/Executor Lost/Relaunch cases in MasterSuite, and we have seen several issues related to relaunching drivers in standalone mode, so it would be great if we can write a test frame to verify the Driver/Worker states and statistics(memory/cores etc.) meets our expectations on Worker Join/Lost/ReJoin, and fix the inconsistencies in follow up PRs.

We don't need to do these in current PR, but it would be great if we can do as a follow up of this PR.

cloud-fan · 2017-06-13T03:42:01Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+      fakeWorkerInfo.coresFree should be(0)
+      fakeWorkerInfo.coresUsed should be(16)
+      // State of application should be RUNNING
+      fakeAppInfo.state should be(ApplicationState.RUNNING)


shall we also test these before the recovery? To show that we do change something when recovering

Done, thanks for review.

cloud-fan · 2017-06-13T03:42:21Z

LGTM

Change-Id: I8eb01af5dc47cf57fcba459670704f481c3f8ac3

cloud-fan · 2017-06-13T08:14:23Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+      master.self.send(
+        WorkerSchedulerStateResponse(fakeWorkerInfo.id, fakeExecutors, Seq(fakeDriverInfo.id)))
+
+      eventually(timeout(1 second), interval(10 milliseconds)) {


hmmm will this be flaky?

Because RPC send is asynchronous, if we check the app state immediately after send we will possibly get "UNKNOWN" state instead of "WAITING".

cloud-fan · 2017-06-13T08:14:38Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+
+      // If driver's resource is also counted, free cores should 0
+      fakeWorkerInfo.coresFree should be(0)
+      fakeWorkerInfo.coresUsed should be(16)


we can also test these 2 before recovering

Change-Id: Ibe18dc34d629aca0bf2c1f405b8500ded9ce5b04

SparkQA · 2017-06-13T09:29:44Z

Test build #77976 has finished for PR 10506 at commit c62889a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-13T12:19:44Z

Test build #77985 has finished for PR 10506 at commit 0bb82bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2017-06-13T13:02:31Z

Jenkins, retest this please.

SparkQA · 2017-06-13T16:12:03Z

Test build #77990 has finished for PR 10506 at commit 0bb82bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ng from failure for Master Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI. So here fix this to also count the driver's resource usage. Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered. andrewor14 please help to review, thanks a lot. Author: jerryshao <[email protected]> Closes #10506 from jerryshao/SPARK-12552. (cherry picked from commit 9eb0952) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-06-14T00:14:24Z

thanks, merging to master/2.2! The fix is only 2 lines so should be safe to backport

…ng from failure for Master Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI. So here fix this to also count the driver's resource usage. Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered. andrewor14 please help to review, thanks a lot. Author: jerryshao <[email protected]> Closes apache#10506 from jerryshao/SPARK-12552.

jerryshao changed the title ~~[SPARK-12552]Correctly count the driver resource when recover from failure for Master~~ [SPARK-12552][Core]Correctly count the driver resource when recover from failure for Master Dec 29, 2015

jerryshao changed the title ~~[SPARK-12552][Core]Correctly count the driver resource when recover from failure for Master~~ [SPARK-12552][Core]Correctly count the driver resource when recovering from failure for Master Dec 29, 2015

jerryshao force-pushed the SPARK-12552 branch from 3eb0b71 to a117dcd Compare March 1, 2016 06:40

jerryshao mentioned this pull request Mar 30, 2016

[SPARK-14262] correct app's state after master leader changed #12054

Closed

jerryshao force-pushed the SPARK-12552 branch from 7cec07c to 88b58eb Compare March 2, 2017 08:27

jerryshao added 3 commits June 1, 2017 11:16

Correctly count the driver resource when recover from failure for Master

9985ed7

Add unit test to verify the correct core number after recovered

f7805ca

Scala style fix

e2d6dbf

Change-Id: Iee06c055b42757611731f2e0b9419d6adf68d665

jerryshao force-pushed the SPARK-12552 branch from f231aed to e2d6dbf Compare June 1, 2017 05:16

jiangxb1987 reviewed Jun 8, 2017

View reviewed changes

cloud-fan reviewed Jun 13, 2017

View reviewed changes

Address the comment

c62889a

Change-Id: I8eb01af5dc47cf57fcba459670704f481c3f8ac3

cloud-fan reviewed Jun 13, 2017

View reviewed changes

Add more test case

0bb82bb

Change-Id: Ibe18dc34d629aca0bf2c1f405b8500ded9ce5b04

asfgit closed this in 9eb0952 Jun 14, 2017

[SPARK-12552][Core]Correctly count the driver resource when recovering from failure for Master #10506

[SPARK-12552][Core]Correctly count the driver resource when recovering from failure for Master #10506

Uh oh!

Conversation

jerryshao commented Dec 29, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 29, 2015

Uh oh!

jerryshao commented Dec 29, 2015

Uh oh!

SparkQA commented Dec 29, 2015

Uh oh!

andrewor14 commented Dec 29, 2015

Uh oh!

jerryshao commented Dec 30, 2015

Uh oh!

jerryshao commented Dec 30, 2015

Uh oh!

SparkQA commented Dec 30, 2015

Uh oh!

jerryshao commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

GavinGavinNo1 commented Mar 31, 2016

Uh oh!

kayousterhout commented Mar 2, 2017

Uh oh!

jerryshao commented Mar 2, 2017

Uh oh!

kayousterhout commented Mar 2, 2017

Uh oh!

jerryshao commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

srowen commented Mar 23, 2017

Uh oh!

jerryshao commented Mar 23, 2017

Uh oh!

jiangxb1987 commented May 31, 2017

Uh oh!

jerryshao commented Jun 1, 2017

Uh oh!

SparkQA commented Jun 1, 2017

Uh oh!

jiangxb1987 Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao Jun 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jun 8, 2017

Uh oh!

jerryshao commented Jun 9, 2017

Uh oh!

jiangxb1987 commented Jun 9, 2017

Uh oh!

jiangxb1987 commented Jun 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangxb1987 commented Jun 12, 2017

Uh oh!

jerryshao commented Jun 13, 2017

Uh oh!

jiangxb1987 commented Jun 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan Jun 13, 2017

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Dec 29, 2015 •

edited

Loading

jerryshao Jun 9, 2017 •

edited

Loading

jiangxb1987 commented Jun 12, 2017 •

edited

Loading

jiangxb1987 commented Jun 13, 2017 •

edited

Loading

jerryshao Jun 13, 2017 •

edited

Loading