Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Flink-ddd · 2025-10-23T15:02:35Z

Purpose

This PR addresses a bug where llm.chat() pollutes the scheduler queue if a batch of requests fails validation mid-way.

The Bug:
When a batch (e.g., 4 requests) is sent to llm.chat() and one request (3) is invalid (too long), the engine enqueues the valid requests (1, 2) and then throws an exception for 3. These enqueued requests (1, 2) become "orphaned" and are not cleaned up.

The Consequence:
The next call to llm.chat() (in a recursive retry logic) picks up these orphaned requests, leading to more outputs than inputs and breaking the API contract.

The Fix:
This PR makes the batch addition in _validate_and_add_requests transactional.

It wraps the request-adding loop in a try...except block.
It tracks the request_id of each successfully enqueued request from the current batch.
If an exception occurs, the except block iterates through the tracked IDs and calls self.llm_engine.abort_request() on each "orphan" before re-raising the exception.

Test Plan

Manual Reproduction (OP's Example): Used the recursive infer() function from [Bug]: .chat() does not clean up in case of validation failure #26081 to demonstrate the bug (orphaned requests causing incorrect output counts) and verify the fix (correct output counts). See results below.
Automated Regression Test: Added test_chat_batch_failure_cleanup to tests/entrypoints/llm/test_chat.py. This test (a) triggers a batch failure with an invalid prompt, (b) immediately sends a new, valid batch, and (c) asserts that the output count for the new batch is correct, confirming the queue was cleaned.

Test Result

The fix is confirmed by both the manual reproduction script and the new automated test.

1. Manual Reproduction (Before vs. After)

Before Fix (on main branch): Bug Reproduced
The recursive call processed 5 real outputs instead of the expected 3, due to 2 orphaned requests.

After Fix (on this branch): Fix Verified
The recursive call now correctly processes 3 real outputs and 1 error, as the orphaned requests were successfully aborted.

2. Automated Regression Test (Pytest)

The new pytest case passes, ensuring this bug does not regress.

commands:
pytest tests/entrypoints/llm/test_chat.py -k test_chat_batch_failure_cleanup

…-project#26081 Signed-off-by: vensenmu <[email protected]>

mergify bot added the frontend label Oct 23, 2025

Flink-ddd force-pushed the fix/26081-chat-batch-cleanup branch 3 times, most recently from 4a69509 to 31a6f39 Compare October 23, 2025 15:35

Fix(llm): Clean up enqueued requests on batch chat failure Fixes vllm…

bd01420

…-project#26081 Signed-off-by: vensenmu <[email protected]>

Flink-ddd force-pushed the fix/26081-chat-batch-cleanup branch from 31a6f39 to bd01420 Compare October 23, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Flink-ddd commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Are you sure you want to change the base?

Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 #27420

Conversation

Flink-ddd commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

1. Manual Reproduction (Before vs. After)

2. Automated Regression Test (Pytest)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flink-ddd commented Oct 23, 2025 •

edited by github-actions bot

Loading