Skip to content

Conversation

@robertgshaw2-redhat
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented May 13, 2025

SUMMARY:

  • further decouple block and request lifecycle
  • handle abort properly

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label May 13, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-redhat possible that I'm mistaken in my comments

Comment on lines 844 to 847
# If unfreed kv req_ids free the blocks.
if req_id in self.unfreed_kv_req_ids:
self.unfreed_kv_req_ids.remove(req_id)
self._free_blocks(req_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this is the behaviour we want in the current nixl p/d connector impl, I'm not sure that it's what we would want for generic async connector impls? When in this state the connector may be in the process of saving the kvcache to some external store and probably we would want that to complete regardless of whether an abort happens for a request that's already finished generating?

Although I guess we'd never get a "regular" abort from the API server in this case since the generating task would have already finished successfully.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if abort happens, we need to make sure that we free active memory associated with that request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the offload case, it won't be dependent on the secondary request to pull the blocks, it will just be running locally and return the finished event from the worker side of the connector as normal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we might need to let the connector know the request is being freed. I can imagine a case that the connector is executing some previous requests and storing some data into the blocks while the blocks has been already freed and probably used by another request. (Not sure whether this would happen or not tho)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK what you think of this API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually NVM. Just realized there is a request_finished interface in the connector that will be called.

@mergify
Copy link

mergify bot commented May 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 14, 2025
@mergify mergify bot removed the needs-rebase label May 14, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous comment, I actually meant aligning (rather than decoupling) the request and kv cache's life style by delaying finishing the request. But, I think this PR makes more sense.

# Scheduler-side methods
# ==============================

def should_free_pending_on_abort(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline but for more flexibility and for proper multi-connector support, it would be good to add req_id: str arg here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertgshaw2-redhat we still might have an annoying edge case here for multi-connector, since we could still receive a finished event from the worker side of a connector which we also free here. So we might think two connectors were finished with the request in that case when really one is still transferring the blocks.

Since it's just counts that we propagate back from the worker side to the scheduler, we also don't have a way of keeping track of the completions on a per-connector basis.

@robertgshaw2-redhat robertgshaw2-redhat changed the title [V1][P/D] Handle Abort and Make Lifecycle Explicit [P/D] Handle Abort and Make Lifecycle Explicit May 14, 2025
to be finished in the worker.
"""
done_sending = self._get_new_notifs()
done_recving = self._pop_done_transfers(self._recving_transfers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of _pop_done_transfers may raise a RuntimeError. Is this exception being used to signal actual transmission failures (e.g., network errors, timeout errors)?

If so, should we add a failed_transfers return value to get_finished() to notify the scheduler? This would allow the scheduler to recover from transmission errors triggered in workers.

If this is a possible design direction, I can help implement a draft change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants