Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Aug 3, 2024

There are a number of problems currently with how request cancellation works upon client disconnection in the openai api server front-end and AsyncLLMEngine:

  • Abort logic is missing for streaming chat requests - these continue to run after the client disconnects
  • For multi-prompt completion requests, only one of the n running sequences actually gets aborted
  • Cancelled requests that are queued will still be pre-filled before getting aborted
  • Some of the existing async generator logic may be contributing to clean-shutdown garbage collection issues

This is a problem for production resilience and has a compounding affect when the server is overloaded and client requests time-out, with the server continuing to do useless work.

This PR reworks how the cancellation is propagated to make it more robust and consistent:

  • Use native asyncio task/generator cancellation propagation as much as possible. This includes making use of explicitly cancellable async generators (via asyncio.aclose()) rather than async iterators in most cases
  • Re-implement merge_async_iterators to encapsulate polling for disconnection, even before any results have been produced (while the request is queued)
  • Add equivalent iterate_with_cancellation function used for the single-prompt request cases
  • In AsyncLLMEngine differentiate between cancelled requests and those that finish normally (generator returned from the generate() will now raise a CancelledError in the former case)
  • Don't abort requests that have already completed normally. This avoids some redundant work in the engine to look up sequence groups that have already gone

I also plan to add some new tests to cover these various cases.

@github-actions
Copy link

github-actions bot commented Aug 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@robertgshaw2-redhat
Copy link
Collaborator

Nice

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 5, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for keeping you waiting.

@DarkLight1337 DarkLight1337 merged commit 9a3f49a into vllm-project:main Aug 7, 2024
@njhill njhill deleted the rework-aborts branch August 7, 2024 13:07
njhill added a commit to njhill/vllm that referenced this pull request Aug 7, 2024
vllm-project#7111 made a change to the merge_async_iterators utils function to add an is_cancelled arg. It would be good for this new arg to be optional to retain backwards compatibility for other server front-ends that might already be using this utility function.
njhill added a commit to njhill/vllm that referenced this pull request Aug 9, 2024
Follow-on from vllm-project#7111, avoid unecessary enqueuing a final message after an exception and avoid aborting requests in the engine that were never started.
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants