-
Couldn't load subscription status.
- Fork 28.9k
[WIP][SPARK-20079][yarn] Re registration of AM hangs spark cluster in yarn-client mode. #17882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #76523 has finished for PR 17882 at commit
|
|
Test build #76533 has finished for PR 17882 at commit
|
|
ping @witgo how it is going? |
|
@jerryshao @vanzin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to change the first -1 to Utils#getDynamicAllocationInitialExecutors, and the second to be "0".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you remove this method, AFAIK it is still necessary, otherwise who will reset the state of ExecutorAllocationManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the current code, we reset the state of ExecutorAllocationManager is not correct. Iy causes the RetrieveLastRequestExecutors message to not work properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then if ExecutorAllocationManager#reset is not required, then why would you still keep that method?
Also should we clean this two fields during AM restart in ExecutorAllocationManager?
executorsPendingToRemove.clear()
removeTimes.clear()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ExecutorAllocationManager#reset is still necessary, but the following code should be removed
initializing = true
numExecutorsTarget = initialNumExecutors
numExecutorsToAdd = 1|
CC @mridulm , can you please review this PR? |
|
Could you remove |
|
@vanzin Done. |
|
Test build #77797 has finished for PR 17882 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make a new call for this? What I had in mind would re-use RetrieveLastAllocatedExecutorId adding new properties that tell the AM what's the initial state. Might be good to change the name of the message at that point to (e.g. GetAMInitialState or something).
That call is already made during initialization of YarnAllocator, so it should be a smaller change overall.
So, you'd have one already existing RPC instead of a new RPC from the AM to the driver that may cause the AM to send another RPC to itself.
|
Test build #78055 has finished for PR 17882 at commit
|
|
Test build #78052 has finished for PR 17882 at commit
|
| localResources, | ||
| executorIdCounter) | ||
| if (requestExecutors.requestedTotal != allocator.getTargetNumExecutors) { | ||
| amEndpoint.send(requestExecutors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work; because you're using send, the handler would have to be in the receive method of the endpoint, and it's actually in receiveAndReply (which is the handler for ask).
You have to call requestTotalExecutorsWithPreferredLocalities directly here, or make that method take a RequestExecutors message as an argument, but you shouldn't go through the RPC system just to make a method call.
| private val currentState = new CurrentAMState(0, | ||
| RequestExecutors(Utils.getDynamicAllocationInitialExecutors(conf), 0, Map.empty, Set.empty)) | ||
|
|
||
| protected class CurrentAMState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this class is used isn't really helping. You could have the two fields as mutable fields in the parent instead of declaring a separate type.
I'd suggest either getting rid of this class, or turning it into a proper type to be sent as a reply to the "get" RPC, defined in CoarseGrainedClusterMessage.scala (meaning you'd keep state in fields here, and the message itself would be instantiated only when replying to the RPC, and would be immutable).
Also, either suggestion means you don't need the changes around setCurrentExecutorIdCounter. The old code was fine.
Finally, I know I suggested the name, but making the names of the classes more generic would be better (since that file is not YARN-specific). e.g. GetExecutorAllocatorState and ExecutorAllocatorState for the reply.
|
gentle ping @witgo for review comments above. |
|
Thank you @jerryshao. |
|
I'm very sorry, I haven't taken the time to update this PR recently. |
…art. The main goal of this change is to avoid the situation described in the bug, where an AM restart in the middle of a job may cause no new executors to be allocated because of faulty logic in the reset path. The change does two things: - fixes the executor alloc manager's reset() so that it does not stop allocation after a reset() in the middle of a job - re-orders the initialization of the YarnAllocator class so that it fetches the current executor ID before triggering the reset() above. This ensures both that the new allocator gets new requests for executors, and that it starts from the correct executor id. Tested with unit tests and by manually causing AM restarts while running jobs using spark-shell in YARN mode. Closes #17882 Author: Marcelo Vanzin <[email protected]> Author: Guoqiang Li <[email protected]> Closes #18663 from vanzin/SPARK-20079.
When AM re-registers, the changes are::
lastNumExecutors.lastNumExecutorsexecutor instance.ExecutorAllocationManager#reset.See #17480