Skip to content

Conversation

jerryshao
Copy link
Contributor

@jerryshao jerryshao commented Dec 29, 2015

Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.

So here fix this to also count the driver's resource usage.

Also changing the recovered app's state to RUNNING when fully recovered. Previously it will always be WAITING even fully recovered.

@andrewor14 please help to review, thanks a lot.

@jerryshao jerryshao changed the title [SPARK-12552]Correctly count the driver resource when recover from failure for Master [SPARK-12552][Core]Correctly count the driver resource when recover from failure for Master Dec 29, 2015
@jerryshao jerryshao changed the title [SPARK-12552][Core]Correctly count the driver resource when recover from failure for Master [SPARK-12552][Core]Correctly count the driver resource when recovering from failure for Master Dec 29, 2015
@SparkQA
Copy link

SparkQA commented Dec 29, 2015

Test build #48410 has finished for PR 10506 at commit 710f5de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 29, 2015

Test build #48413 has finished for PR 10506 at commit 710f5de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Can you add a unit test? You might have to mock the completeRecovery method

@jerryshao
Copy link
Contributor Author

Sure, will do.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #48463 has finished for PR 10506 at commit 3eb0b71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

@andrewor14 , would you please review this patch again, it is pending here a long time and I think it is actually a bug here. Thanks a lot.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52223 has finished for PR 10506 at commit a117dcd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52225 has finished for PR 10506 at commit 7cec07c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@GavinGavinNo1
Copy link

Thank you for your comment for PR #12054. I think changing app state from WAITING to RUNNING in function completeRecovery. Suppose that some app is WAITING before master toggle, then all apps and all workers get known of master changed. But if last signal (WorkerSchedulerStateResponse or MasterChangeAcknowledged) is from some worker, then function completeRecovery is revoked, which means the app I mentioned above is in RUNNING state. If the cluster doesn't have enough resource for all apps, maybe that app will be in a wrong state for a while.

@kayousterhout
Copy link
Contributor

Is anyone still working on this and if not, can you close the PR?

@jerryshao
Copy link
Contributor Author

Hi @kayousterhout , I guess the issue still exists, but unfortunately there's no one reviewing this patch. I could rebase the code if someone could review it.

@kayousterhout
Copy link
Contributor

OK fine to leave this open then (I don't have the time or expertise to review this unfortunately)

@jerryshao
Copy link
Contributor Author

Ping @zsxwing , hopes you're the right person to review this very old PR, the issue still exists in the latest master, can you please take a review, thanks a lot.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73740 has finished for PR 10506 at commit 88b58eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73742 has finished for PR 10506 at commit f231aed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 23, 2017

I also don't feel like I know enough to review this, but if you're confident about the fix, i think you can go ahead. The change looks reasonable on its face.

@jerryshao
Copy link
Contributor Author

Thanks @srowen , I think the fix is OK, at least should be no worse than previous code.

@jiangxb1987
Copy link
Contributor

Could you rebase this? @jerryshao

@jerryshao
Copy link
Contributor Author

Sure, I will bring this to update.

@SparkQA
Copy link

SparkQA commented Jun 1, 2017

Test build #77627 has finished for PR 10506 at commit e2d6dbf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

driver.worker = Some(worker)
driver.state = DriverState.RUNNING
worker.drivers(driverId) = driver
worker.addDriver(driver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major question(though I haven't tested this) -- Won't we call schedule() after we completed recovery? I think we will handle the resource change correctly there.

Copy link
Contributor Author

@jerryshao jerryshao Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, schedule() will only handle waiting drivers, but here is trying to calculate the exiting drivers, so I don't think schedule() will save the issue here. Let me try to test on latest master and back to you the result.

apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication)

// Update the state of recovered apps to RUNNING
apps.filter(_.state == ApplicationState.WAITING).foreach(_.state = ApplicationState.RUNNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also been done later in schedule().

@jiangxb1987
Copy link
Contributor

I think this problem shouldn't have happen in general case, could you give more specific description on your integrated cluster?

@jerryshao
Copy link
Contributor Author

@jiangxb1987 , to reproduce this issue, you can:

  1. Configure to enable standalone HA, for example "spark.deploy.recoveryMode FILESYSTEM" and "spark.deploy.recoveryDirectory recovery"
  2. Start a local standalone cluster (master and worker on one the same machine).
  3. Submit a spark application with standalone cluster mode, for example "./bin/spark-submit --master spark://NT00022.local:6066 --deploy-mode cluster --class org.apache.spark.examples.SparkPi examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar 10000"
  4. During application running, stop the master process and restart it.
  5. Wait for application to finish, you will see the unexpected core/memory number in master UI.

screen shot 2017-06-09 at 1 53 58 pm

This is mainly because when Master recover Driver, Master don't count the resources (core/memory) used by Driver, so this part of resources are free, which will be used to allocate a new executor, when the application is finished, this over-occupied resource by new executor will make the worker resources to be negative.

Besides, in the current Master, only when new executor is allocated, then application state will be changed to "RUNNING", recovered application will never have the chance to change the state from "WAITING" to "RUNNING" because there's no new executor allocated.

Can you please take a try, this issue do exist and be reported in JIRA and mail list several times.

@jiangxb1987
Copy link
Contributor

@jerryshao Thank you for your effort, I'll try this tomorrow!

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Jun 12, 2017

I think the fix is right and the test case also looks good, we'd better merge this after add some new test cases over the application running state issue. @cloud-fan Could please have a look too?

@jiangxb1987
Copy link
Contributor

BTW, @jerryshao It would be great if we can add test framework to verify the states and statistics on the condition of Driver/Executor Lost/Join/Relaunch, is there any hope that you would invest some time on that?

@jerryshao
Copy link
Contributor Author

It would be great if we can add test framework to verify the states and statistics on the condition of Driver/Executor Lost/Join/Relaunch

@jiangxb1987 can you explain more about what you want?

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Jun 13, 2017

Currently we don't cover the Driver/Executor Lost/Relaunch cases in MasterSuite, and we have seen several issues related to relaunching drivers in standalone mode, so it would be great if we can write a test frame to verify the Driver/Worker states and statistics(memory/cores etc.) meets our expectations on Worker Join/Lost/ReJoin, and fix the inconsistencies in follow up PRs.

We don't need to do these in current PR, but it would be great if we can do as a follow up of this PR.

fakeWorkerInfo.coresFree should be(0)
fakeWorkerInfo.coresUsed should be(16)
// State of application should be RUNNING
fakeAppInfo.state should be(ApplicationState.RUNNING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also test these before the recovery? To show that we do change something when recovering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for review.

@cloud-fan
Copy link
Contributor

LGTM

Change-Id: I8eb01af5dc47cf57fcba459670704f481c3f8ac3
master.self.send(
WorkerSchedulerStateResponse(fakeWorkerInfo.id, fakeExecutors, Seq(fakeDriverInfo.id)))

eventually(timeout(1 second), interval(10 milliseconds)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm will this be flaky?

Copy link
Contributor Author

@jerryshao jerryshao Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because RPC send is asynchronous, if we check the app state immediately after send we will possibly get "UNKNOWN" state instead of "WAITING".


// If driver's resource is also counted, free cores should 0
fakeWorkerInfo.coresFree should be(0)
fakeWorkerInfo.coresUsed should be(16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also test these 2 before recovering

Change-Id: Ibe18dc34d629aca0bf2c1f405b8500ded9ce5b04
@SparkQA
Copy link

SparkQA commented Jun 13, 2017

Test build #77976 has finished for PR 10506 at commit c62889a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2017

Test build #77985 has finished for PR 10506 at commit 0bb82bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 13, 2017

Test build #77990 has finished for PR 10506 at commit 0bb82bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 14, 2017
…ng from failure for Master

Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.

So here fix this to also count the driver's resource usage.

Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered.

andrewor14 please help to review, thanks a lot.

Author: jerryshao <[email protected]>

Closes #10506 from jerryshao/SPARK-12552.

(cherry picked from commit 9eb0952)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2! The fix is only 2 lines so should be safe to backport

@asfgit asfgit closed this in 9eb0952 Jun 14, 2017
dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
…ng from failure for Master

Currently in Standalone HA mode, the resource usage of driver is not correctly counted in Master when recovering from failure, this will lead to some unexpected behaviors like negative value in UI.

So here fix this to also count the driver's resource usage.

Also changing the recovered app's state to `RUNNING` when fully recovered. Previously it will always be WAITING even fully recovered.

andrewor14 please help to review, thanks a lot.

Author: jerryshao <[email protected]>

Closes apache#10506 from jerryshao/SPARK-12552.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants