Skip to content

Conversation

@nirandaperera
Copy link
Contributor

@nirandaperera nirandaperera commented Apr 19, 2016

What changes were proposed in this pull request?

this PR fixes the issue SPARK-14736 Deadlock in registering applications while the Master is in the RECOVERING mode.
Proposed solution is to keep the registering apps in a separate list when the Master is in the RECOVERING mode and once the recovery is complete, these apps will be registered back. Pls refer the JIRA for more information

How was this patch tested?

I have tested the patch manually

@mridulm
Copy link
Contributor

mridulm commented Apr 19, 2016

You will need to make this thread safe - the applications are added/re-registered from separate threads, right ?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 20, 2016

Maybe we should correct the title just like the others (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark). Also, the title looks truncated.

@nirandaperera nirandaperera changed the title fixing SPARK-14736 Deadlock in registering applications while the Mas… [SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode Apr 20, 2016
@nirandaperera
Copy link
Contributor Author

@mridulm do you mean to say, the method private def registerApplication(app: ApplicationInfo): Unit = { needs to synchronized?

@mridulm
Copy link
Contributor

mridulm commented Apr 20, 2016

No.
Updates to waitingAppsWhileRecovering happens from different threads right ? Hence you will need to protect it.

@nirandaperera
Copy link
Contributor Author

nirandaperera commented Apr 21, 2016

@mridulm You are correct. But if we check the register application method,

 private def registerApplication(app: ApplicationInfo): Unit = {
    val appAddress = app.driver.address
    if (addressToApp.contains(appAddress)) {
      logInfo("Attempted to re-register application at same address: " + appAddress)
      return
    }
    applicationMetricsSystem.registerSource(app.appSource)
    apps += app
    idToApp(app.id) = app
    endpointToApp(app.driver) = app
    addressToApp(appAddress) = app
    waitingApps += app
  }

there are similar array buffers which are getting updated without being synchronized. That is why I omitted making the waitingAppsWhileRecovering synchronized.
am I doing anything wrong there?

@mridulm
Copy link
Contributor

mridulm commented Apr 21, 2016

No, you are right - this is called only from the event loop - which should ensure thread safety.
I misread where the re-registeration was happening as outside of the event loop.
Please ignore my comment.

@nirandaperera
Copy link
Contributor Author

Great! can we get this PR merged then? Is there anything else I should do in order to get this merged?

@BryanCutler
Copy link
Member

Hi @nirandaperera , I'm not too sure about the Master recovery process so I can't really comment on your code, but it would make a much stronger case for this PR if you could include a test that fails without this change.

@nirandaperera
Copy link
Contributor Author

@BryanCutler I was looking for some unit tests which could simulate this scenario in the Master.scala class, but I couldn't find any. But i think I can reproduce this in an integration test environment. can you point me to the spark integration tests?

@andrewor14
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented May 9, 2016

Test build #58162 has finished for PR 12506 at commit 17e7949.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 27, 2016

@andrewor14 did you review this?

@srowen
Copy link
Member

srowen commented Jun 12, 2016

Looks reasonable to me but I don't know this code very well. Also @kayousterhout maybe or @squito

@kayousterhout
Copy link
Contributor

@aarondav might be the right person to look at this -- looks like he wrote most of the original code around the "RECOVERING" state way back in 2013.

@squito
Copy link
Contributor

squito commented Jun 15, 2016

unfortunately I'm not very knowledgeable here either. I agree that this change looks reasonable, but also wish there was a test case for it. I don't think there is any good integration test framework, and it seems there aren't any tests for recovery state now, so you'd have to build that out yourself. One possibility -- "local-cluster" mode is closely related to a standalone cluster -- maybe that could be used to create a test?

@nirandaperera
Copy link
Contributor Author

@squito hey. Thanks for the heads up. let me check on that and try to come up with a test

@jiangxb1987
Copy link
Contributor

Are you still working on this? @nirandaperera

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.