-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14736][core] Deadlock in registering applications while the Master is in the RECOVERING mode #12506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ter is in the RECOVERING mode
|
You will need to make this thread safe - the applications are added/re-registered from separate threads, right ? |
|
Maybe we should correct the title just like the others (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark). Also, the title looks truncated. |
|
@mridulm do you mean to say, the method |
|
No. |
|
@mridulm You are correct. But if we check the register application method, there are similar array buffers which are getting updated without being synchronized. That is why I omitted making the waitingAppsWhileRecovering synchronized. |
|
No, you are right - this is called only from the event loop - which should ensure thread safety. |
|
Great! can we get this PR merged then? Is there anything else I should do in order to get this merged? |
|
Hi @nirandaperera , I'm not too sure about the Master recovery process so I can't really comment on your code, but it would make a much stronger case for this PR if you could include a test that fails without this change. |
|
@BryanCutler I was looking for some unit tests which could simulate this scenario in the Master.scala class, but I couldn't find any. But i think I can reproduce this in an integration test environment. can you point me to the spark integration tests? |
|
ok to test |
|
Test build #58162 has finished for PR 12506 at commit
|
|
@andrewor14 did you review this? |
|
Looks reasonable to me but I don't know this code very well. Also @kayousterhout maybe or @squito |
|
@aarondav might be the right person to look at this -- looks like he wrote most of the original code around the "RECOVERING" state way back in 2013. |
|
unfortunately I'm not very knowledgeable here either. I agree that this change looks reasonable, but also wish there was a test case for it. I don't think there is any good integration test framework, and it seems there aren't any tests for recovery state now, so you'd have to build that out yourself. One possibility -- "local-cluster" mode is closely related to a standalone cluster -- maybe that could be used to create a test? |
|
@squito hey. Thanks for the heads up. let me check on that and try to come up with a test |
|
Are you still working on this? @nirandaperera |
What changes were proposed in this pull request?
this PR fixes the issue SPARK-14736 Deadlock in registering applications while the Master is in the RECOVERING mode.
Proposed solution is to keep the registering apps in a separate list when the Master is in the RECOVERING mode and once the recovery is complete, these apps will be registered back. Pls refer the JIRA for more information
How was this patch tested?
I have tested the patch manually