-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[P/D] Handle Abort and Make Lifecycle Explicit #18096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P/D] Handle Abort and Make Lifecycle Explicit #18096
Conversation
Signed-off-by: [email protected] <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat possible that I'm mistaken in my comments
vllm/v1/core/sched/scheduler.py
Outdated
| # If unfreed kv req_ids free the blocks. | ||
| if req_id in self.unfreed_kv_req_ids: | ||
| self.unfreed_kv_req_ids.remove(req_id) | ||
| self._free_blocks(req_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this is the behaviour we want in the current nixl p/d connector impl, I'm not sure that it's what we would want for generic async connector impls? When in this state the connector may be in the process of saving the kvcache to some external store and probably we would want that to complete regardless of whether an abort happens for a request that's already finished generating?
Although I guess we'd never get a "regular" abort from the API server in this case since the generating task would have already finished successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, if abort happens, we need to make sure that we free active memory associated with that request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the offload case, it won't be dependent on the secondary request to pull the blocks, it will just be running locally and return the finished event from the worker side of the connector as normal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we might need to let the connector know the request is being freed. I can imagine a case that the connector is executing some previous requests and storing some data into the blocks while the blocks has been already freed and probably used by another request. (Not sure whether this would happen or not tho)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LMK what you think of this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually NVM. Just realized there is a request_finished interface in the connector that will be called.
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous comment, I actually meant aligning (rather than decoupling) the request and kv cache's life style by delaying finishing the request. But, I think this PR makes more sense.
Co-authored-by: Woosuk Kwon <[email protected]>
| # Scheduler-side methods | ||
| # ============================== | ||
|
|
||
| def should_free_pending_on_abort(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline but for more flexibility and for proper multi-connector support, it would be good to add req_id: str arg here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-redhat we still might have an annoying edge case here for multi-connector, since we could still receive a finished event from the worker side of a connector which we also free here. So we might think two connectors were finished with the request in that case when really one is still transferring the blocks.
Since it's just counts that we propagate back from the worker side to the scheduler, we also don't have a way of keeping track of the completions on a per-connector basis.
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
Signed-off-by: [email protected] <[email protected]>
| to be finished in the worker. | ||
| """ | ||
| done_sending = self._get_new_notifs() | ||
| done_recving = self._pop_done_transfers(self._recving_transfers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of _pop_done_transfers may raise a RuntimeError. Is this exception being used to signal actual transmission failures (e.g., network errors, timeout errors)?
If so, should we add a failed_transfers return value to get_finished() to notify the scheduler? This would allow the scheduler to recover from transmission errors triggered in workers.
If this is a possible design direction, I can help implement a draft change.
SUMMARY: