feat: Refactor the fetching request logic #5786

Shunkangz · 2025-07-07T07:04:59Z

Refactor the fetching request logic

Description

In the existing implementation, the request fetching logic is getting longer and longer. In this case, I refactor the logic to ExecutorRequestQueue where it's much easier to do test and integrate more advanced fetching policy.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Introduced a dedicated request queue manager to handle request enqueueing, cancellation, shutdown, and distributed request balancing with performance metrics tracking.
- Added support for attaching and broadcasting Python-only metadata across distributed pipeline stages.
Refactor
- Centralized and streamlined request queue management by introducing a dedicated request queue manager, resulting in cleaner and more maintainable request handling.
- Simplified the executor logic by delegating all queue operations, cancellation, and shutdown processes to the new request queue manager.
Chores
- Removed redundant internal queue structures and related helper methods from the executor, reducing code complexity and improving maintainability.
- Added comprehensive unit and integration tests covering request queue functionality, concurrency, validation, and performance metrics.

tensorrt_llm/llmapi/llm.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

Shunkangz · 2025-07-14T07:30:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-14T07:35:41Z

PR_Github #11779 [ run ] triggered by Bot

Shunkangz · 2025-07-14T08:55:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-14T09:00:57Z

PR_Github #11794 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T09:00:59Z

PR_Github #11779 [ run ] completed with state ABORTED

Shunkangz · 2025-07-14T09:11:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-14T09:16:28Z

PR_Github #11796 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T09:16:30Z

PR_Github #11794 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-14T13:02:34Z

PR_Github #11796 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8740 completed with status: 'FAILURE'

Shunkangz · 2025-07-14T15:00:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-14T15:06:41Z

PR_Github #11824 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-14T17:41:24Z

PR_Github #11824 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8763 completed with status: 'FAILURE'

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt-cicd · 2025-07-18T16:30:50Z

PR_Github #12303 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9139 completed with status: 'FAILURE'

Shunkangz · 2025-07-21T01:56:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-07-21T02:02:00Z

PR_Github #12397 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-21T04:03:09Z

PR_Github #12397 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9216 completed with status: 'SUCCESS'

QiJune

LGTM

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

Signed-off-by: Shunkang <[email protected]>

Shunkangz · 2025-07-21T06:02:21Z

/bot run --disable-fail-fast

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

tests/unittest/_torch/test_executor_request_queue.py (1)

103-117: Test documents a critical bug in the implementation

This test correctly verifies that when a query parameter is passed to enqueue_request, it should be stored in the query field of the RequestQueueItem. However, this test will currently fail due to a bug in the implementation where the query parameter is incorrectly passed as a positional argument.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)

144-162: Fix critical bug in enqueue_request method

The query parameter is incorrectly passed as a positional argument, which makes it the is_canceled_request parameter instead of query.

164-168: Use try/finally pattern for lock handling

The current implementation could leave the lock held if an exception occurs between acquire and release.

490-493: Replace print statement with proper logging and fix line length

Use the logging framework instead of print statements for consistency and better control over output.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)
522-522: Fix line length violation

The assertion line exceeds the 120-character limit. Consider breaking it into multiple lines for better readability.
-        assert anchor_block_size <= block_size, f'cp_anchor_size {anchor_block_size} should be smaller than block_size {block_size}'
+        assert anchor_block_size <= block_size, (
+            f'cp_anchor_size {anchor_block_size} should be smaller than '
+            f'block_size {block_size}'
+        )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9469ede and f9060a9.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (16 hunks)
tests/unittest/_torch/test_executor_request_queue.py (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

tests/unittest/_torch/test_executor_request_queue.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

491-491: Line too long (133 > 120)

(E501)

522-522: Line too long (132 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (35)

tests/unittest/_torch/test_executor_request_queue.py (4)

1-11: LGTM: Well-structured test imports and setup

The imports are appropriate and the module correctly imports the target classes for testing.

14-31: LGTM: Comprehensive mock setup for distributed testing

The mock_dist fixture properly sets up all necessary distributed attributes for isolated testing.

319-347: Excellent thread safety test coverage

This test properly validates concurrent enqueue operations and ensures unique request ID generation without race conditions.

389-418: Comprehensive integration test workflow

The test effectively simulates a complete workflow from enqueueing requests through cancellation and validation, providing good end-to-end coverage.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (8)

22-36: LGTM: Well-designed RequestQueueItem dataclass

The dataclass design is clean with appropriate default values and useful property methods for request type checking.

38-67: LGTM: Comprehensive initialization with proper state management

The constructor properly initializes all necessary attributes for queue management, distributed coordination, and performance tracking.

113-126: LGTM: Proper lock handling in enqueue_requests

The method correctly uses try/finally to ensure the lock is always released, and properly tracks start times for performance metrics.

128-142: LGTM: Clean implementation of cancel and shutdown requests

Both methods properly handle locking and create appropriate RequestQueueItem instances for their respective operations.

208-215: LGTM: Clean routing logic for attention DP

The method properly delegates to the appropriate fetch method based on the attention DP configuration.

293-315: LGTM: Robust request validation and filtering

The method properly handles shutdown signals, canceled requests, and beam width validation with clear error messages.

550-569: LGTM: Well-documented scheduler overlap logic

The method includes excellent documentation explaining the complex interaction between overlap scheduling and different sampler types.

571-601: LGTM: Comprehensive performance metrics interface

The getter methods provide clean access to internal queue state and performance metrics with appropriate return types.

tensorrt_llm/_torch/pyexecutor/py_executor.py (23)

23-23: LGTM! Proper import addition for SpecDecodingStats.

The import of SpecDecodingStats is correctly added to support the refactored functionality.

31-31: LGTM! ExecutorRequestQueue import is properly added.

The import statement correctly brings in the new ExecutorRequestQueue and RequestQueueItem classes for the refactored queue management.

35-35: LGTM! Import statement is properly formatted.

The import for LlmResponse is correctly maintained.

37-37: LGTM! Sampler imports are properly maintained.

The import statement correctly includes all necessary sampler-related classes.

220-232: LGTM! ExecutorRequestQueue initialization is comprehensive and well-structured.

The initialization properly delegates all queue management to the new ExecutorRequestQueue class with appropriate configuration parameters. The call to set_exclude_last_generation_logits correctly configures the queue based on scheduler and sampler settings.

295-295: LGTM! Proper delegation to ExecutorRequestQueue.

The enqueue_requests method correctly delegates to the new queue implementation.

328-328: LGTM! Cancel request properly delegated.

The cancel request functionality is correctly delegated to ExecutorRequestQueue.

334-334: LGTM! Shutdown request properly delegated.

The shutdown functionality is correctly delegated to ExecutorRequestQueue.

349-349: LGTM! Can enqueue check properly delegated.

The can_enqueue_requests method correctly delegates to the queue implementation.

387-388: LGTM! Single request enqueue properly delegated.

The enqueue_request method correctly delegates to ExecutorRequestQueue and maintains the same interface.

396-397: LGTM! Should stop processing logic properly updated.

The property correctly uses the new queue's get_waiting_queue_size() method to determine if processing should stop.

536-536: LGTM! Request queue access properly delegated.

The statistics collection correctly accesses the request queue through the new interface.

553-554: LGTM! Queue size retrieval properly delegated.

The iteration statistics correctly retrieve the queue size from the new ExecutorRequestQueue.

667-668: LGTM! Queue latency metrics properly delegated.

The performance statistics correctly retrieve queue latency metrics from the new queue implementation.

828-829: LGTM! Queue latency retrieval in executor loop.

The executor loop correctly retrieves queue latency metrics from ExecutorRequestQueue.

948-951: LGTM! Benchmark queue size check properly delegated.

The benchmark logic correctly uses the new queue's size method for monitoring.

972-973: LGTM! Queue latency in overlap executor loop.

The overlap executor loop correctly retrieves queue latency from the new implementation.

1107-1115: LGTM! Fetch new requests properly refactored.

The _fetch_new_requests method is correctly refactored to delegate all operations to ExecutorRequestQueue. The method properly:

Fetches new requests from the queue

Extends the active requests list

Updates shutdown state

Retrieves expected number of active requests

1405-1405: LGTM! Canceled request IDs check properly delegated.

The check for canceled request IDs correctly uses the new queue's method.

1408-1409: LGTM! Waiting queue update properly delegated.

The waiting queue update operation is correctly delegated to ExecutorRequestQueue.

1413-1413: LGTM! Canceled request ID lookup properly delegated.

The canceled request ID check correctly uses the new queue's method.

1423-1423: LGTM! Clear canceled request IDs properly delegated.

The operation to clear canceled request IDs is correctly delegated to ExecutorRequestQueue.

1515-1516: LGTM! Active requests update pattern is correct.

The pattern of clearing and extending the active requests list is appropriate and maintains the same functionality while being more explicit about the operation.

tensorrt-cicd · 2025-07-21T06:07:19Z

PR_Github #12417 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-21T10:03:28Z

PR_Github #12417 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9233 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Shunkang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

Signed-off-by: Shunkang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

Shunkangz changed the title ~~Support routing request to specific attention DP rank~~ feat: Support routing request to specific attention DP rank Jul 7, 2025

Shunkangz marked this pull request as ready for review July 7, 2025 07:27

Shunkangz requested a review from a team as a code owner July 7, 2025 07:27

Shunkangz requested review from achartier and pcastonguay and removed request for achartier July 7, 2025 07:27

Shunkangz self-assigned this Jul 7, 2025

Shunkangz force-pushed the attention_dp_route branch from e727517 to 8df54fb Compare July 7, 2025 08:20

Shunkangz requested a review from jinyangyuan-nvidia July 7, 2025 08:43

pcastonguay requested changes Jul 7, 2025

View reviewed changes

tensorrt_llm/llmapi/llm.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

Shunkangz force-pushed the attention_dp_route branch from 8df54fb to ce88c8f Compare July 14, 2025 06:56

Shunkangz requested a review from QiJune July 14, 2025 07:28

Shunkangz force-pushed the attention_dp_route branch from ce88c8f to 38708ce Compare July 14, 2025 07:29

Shunkangz changed the title ~~feat: Support routing request to specific attention DP rank~~ feat: Refactor the fetching request logic Jul 14, 2025

Shunkangz force-pushed the attention_dp_route branch from 38708ce to b498216 Compare July 14, 2025 08:54

Shunkangz force-pushed the attention_dp_route branch from b498216 to 0d3f197 Compare July 14, 2025 09:10

Shunkangz force-pushed the attention_dp_route branch from 0d3f197 to 8cb76d1 Compare July 14, 2025 15:00

pcastonguay reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

pcastonguay requested a review from HuiGao-NV July 18, 2025 11:01

QiJune approved these changes Jul 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py Outdated Show resolved Hide resolved

Shunkangz requested a review from pcastonguay July 21, 2025 05:51

Shunkang added 5 commits July 21, 2025 05:59

Refactor fetching request logic

cd494f3

Signed-off-by: Shunkang <[email protected]>

Add executorRequestQueue

bf674d1

Signed-off-by: Shunkang <[email protected]>

Add unittest

35e25ef

Signed-off-by: Shunkang <[email protected]>

Fix rebase error

f31a381

Signed-off-by: Shunkang <[email protected]>

Remove redundant arguments

f9060a9

Signed-off-by: Shunkang <[email protected]>

Shunkangz force-pushed the attention_dp_route branch from 9469ede to f9060a9 Compare July 21, 2025 06:00

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

Shunkangz enabled auto-merge (squash) July 21, 2025 06:02

pcastonguay approved these changes Jul 21, 2025

View reviewed changes

Shunkangz merged commit ee45e0c into NVIDIA:main Jul 22, 2025
3 checks passed

This was referenced Jul 22, 2025

[None][feat] Add support of scheduling attention dp request #6246

Merged

[TRTLLM-5271][feat] best_of/n for pytorch workflow #5997

Merged

chore: remove unused variables in pyexecutor #6280

Merged

[None][fix] Remove lock related typo in py_executor #6319

Closed

NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025

feat: Refactor the fetching request logic (NVIDIA#5786)

35db6f3

Signed-off-by: Shunkang <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025

feat: Refactor the fetching request logic (NVIDIA#5786)

c33fdf6

Signed-off-by: Shunkang <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

This was referenced Aug 4, 2025

[TRTLLM-5508][feat] check input tokens + improve error handling #5170

Merged

[None][opt] ADP schedule balance optimization #6061

Merged

coderabbitai bot mentioned this pull request Aug 15, 2025

[None][opt] Add batch wait timeout in fetching requests #6923

Merged

coderabbitai bot mentioned this pull request Aug 23, 2025

[None][opt] Balance the request based on number of tokens in AttentionDP #7183

Merged

Uh oh!

feat: Refactor the fetching request logic #5786

feat: Refactor the fetching request logic #5786

Uh oh!

Conversation

Shunkangz commented Jul 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor the fetching request logic

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shunkangz commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

Shunkangz commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

Shunkangz commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

Shunkangz commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

tensorrt-cicd commented Jul 14, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

Shunkangz commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Shunkangz commented Jul 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shunkangz commented Jul 7, 2025 •

edited by coderabbitai bot

Loading