[SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode #12506

nirandaperera · 2016-04-19T21:28:19Z

What changes were proposed in this pull request?

this PR fixes the issue SPARK-14736 Deadlock in registering applications while the Master is in the RECOVERING mode.
Proposed solution is to keep the registering apps in a separate list when the Master is in the RECOVERING mode and once the recovery is complete, these apps will be registered back. Pls refer the JIRA for more information

How was this patch tested?

I have tested the patch manually

…ter is in the RECOVERING mode

mridulm · 2016-04-19T21:52:49Z

You will need to make this thread safe - the applications are added/re-registered from separate threads, right ?

HyukjinKwon · 2016-04-20T00:06:18Z

Maybe we should correct the title just like the others (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark). Also, the title looks truncated.

nirandaperera · 2016-04-20T05:01:48Z

@mridulm do you mean to say, the method private def registerApplication(app: ApplicationInfo): Unit = { needs to synchronized?

mridulm · 2016-04-20T18:28:05Z

No.
Updates to waitingAppsWhileRecovering happens from different threads right ? Hence you will need to protect it.

nirandaperera · 2016-04-21T10:04:56Z

@mridulm You are correct. But if we check the register application method,

 private def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }
    applicationMetricsSystem.registerSource(app.appSource)
    apps += app
    idToApp(app.id) = app
    endpointToApp(app.driver) = app
    addressToApp(appAddress) = app
    waitingApps += app
  }

there are similar array buffers which are getting updated without being synchronized. That is why I omitted making the waitingAppsWhileRecovering synchronized.
am I doing anything wrong there?

mridulm · 2016-04-21T18:19:07Z

No, you are right - this is called only from the event loop - which should ensure thread safety.
I misread where the re-registeration was happening as outside of the event loop.
Please ignore my comment.

nirandaperera · 2016-04-22T06:15:30Z

Great! can we get this PR merged then? Is there anything else I should do in order to get this merged?

BryanCutler · 2016-04-22T21:54:45Z

Hi @nirandaperera , I'm not too sure about the Master recovery process so I can't really comment on your code, but it would make a much stronger case for this PR if you could include a test that fails without this change.

nirandaperera · 2016-04-25T04:04:23Z

@BryanCutler I was looking for some unit tests which could simulate this scenario in the Master.scala class, but I couldn't find any. But i think I can reproduce this in an integration test environment. can you point me to the spark integration tests?

andrewor14 · 2016-05-09T18:54:55Z

ok to test

SparkQA · 2016-05-09T20:57:19Z

Test build #58162 has finished for PR 12506 at commit 17e7949.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-27T03:02:36Z

@andrewor14 did you review this?

srowen · 2016-06-12T08:01:01Z

Looks reasonable to me but I don't know this code very well. Also @kayousterhout maybe or @squito

kayousterhout · 2016-06-13T19:00:45Z

@aarondav might be the right person to look at this -- looks like he wrote most of the original code around the "RECOVERING" state way back in 2013.

squito · 2016-06-15T05:35:29Z

unfortunately I'm not very knowledgeable here either. I agree that this change looks reasonable, but also wish there was a test case for it. I don't think there is any good integration test framework, and it seems there aren't any tests for recovery state now, so you'd have to build that out yourself. One possibility -- "local-cluster" mode is closely related to a standalone cluster -- maybe that could be used to create a test?

nirandaperera · 2016-06-16T08:05:48Z

@squito hey. Thanks for the heads up. let me check on that and try to come up with a test

jiangxb1987 · 2017-06-01T00:00:55Z

Are you still working on this? @nirandaperera

fixing SPARK-14736 Deadlock in registering applications while the Mas…

17e7949

…ter is in the RECOVERING mode

nirandaperera changed the title ~~fixing SPARK-14736 Deadlock in registering applications while the Mas…~~ [SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode Apr 20, 2016

HyukjinKwon mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

[SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode #12506

[SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode #12506

Uh oh!

Conversation

nirandaperera commented Apr 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mridulm commented Apr 19, 2016

Uh oh!

HyukjinKwon commented Apr 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nirandaperera commented Apr 20, 2016

Uh oh!

mridulm commented Apr 20, 2016

Uh oh!

nirandaperera commented Apr 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Apr 21, 2016

Uh oh!

nirandaperera commented Apr 22, 2016

Uh oh!

BryanCutler commented Apr 22, 2016

Uh oh!

nirandaperera commented Apr 25, 2016

Uh oh!

andrewor14 commented May 9, 2016

Uh oh!

SparkQA commented May 9, 2016

Uh oh!

rxin commented May 27, 2016

Uh oh!

srowen commented Jun 12, 2016

Uh oh!

kayousterhout commented Jun 13, 2016

Uh oh!

squito commented Jun 15, 2016

Uh oh!

nirandaperera commented Jun 16, 2016

Uh oh!

jiangxb1987 commented Jun 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

nirandaperera commented Apr 19, 2016 •

edited

Loading

HyukjinKwon commented Apr 20, 2016 •

edited

Loading

nirandaperera commented Apr 21, 2016 •

edited

Loading