Skip to content

Conversation

@Shunkangz
Copy link
Collaborator

@Shunkangz Shunkangz commented Jul 7, 2025

Refactor the fetching request logic

Description

In the existing implementation, the request fetching logic is getting longer and longer. In this case, I refactor the logic to ExecutorRequestQueue where it's much easier to do test and integrate more advanced fetching policy.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Introduced a dedicated request queue manager to handle request enqueueing, cancellation, shutdown, and distributed request balancing with performance metrics tracking.
    • Added support for attaching and broadcasting Python-only metadata across distributed pipeline stages.
  • Refactor

    • Centralized and streamlined request queue management by introducing a dedicated request queue manager, resulting in cleaner and more maintainable request handling.
    • Simplified the executor logic by delegating all queue operations, cancellation, and shutdown processes to the new request queue manager.
  • Chores

    • Removed redundant internal queue structures and related helper methods from the executor, reducing code complexity and improving maintainability.
    • Added comprehensive unit and integration tests covering request queue functionality, concurrency, validation, and performance metrics.

@Shunkangz Shunkangz changed the title Support routing request to specific attention DP rank feat: Support routing request to specific attention DP rank Jul 7, 2025
@Shunkangz Shunkangz marked this pull request as ready for review July 7, 2025 07:27
@Shunkangz Shunkangz requested a review from a team as a code owner July 7, 2025 07:27
@Shunkangz Shunkangz requested review from achartier and pcastonguay and removed request for achartier July 7, 2025 07:27
@Shunkangz Shunkangz self-assigned this Jul 7, 2025
@Shunkangz Shunkangz force-pushed the attention_dp_route branch from e727517 to 8df54fb Compare July 7, 2025 08:20
@Shunkangz Shunkangz force-pushed the attention_dp_route branch from 8df54fb to ce88c8f Compare July 14, 2025 06:56
@Shunkangz Shunkangz requested a review from QiJune July 14, 2025 07:28
@Shunkangz Shunkangz force-pushed the attention_dp_route branch from ce88c8f to 38708ce Compare July 14, 2025 07:29
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11779 [ run ] triggered by Bot

@Shunkangz Shunkangz changed the title feat: Support routing request to specific attention DP rank feat: Refactor the fetching request logic Jul 14, 2025
@Shunkangz Shunkangz force-pushed the attention_dp_route branch from 38708ce to b498216 Compare July 14, 2025 08:54
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11794 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11779 [ run ] completed with state ABORTED

@Shunkangz Shunkangz force-pushed the attention_dp_route branch from b498216 to 0d3f197 Compare July 14, 2025 09:10
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11796 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11794 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11796 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8740 completed with status: 'FAILURE'

@Shunkangz Shunkangz force-pushed the attention_dp_route branch from 0d3f197 to 8cb76d1 Compare July 14, 2025 15:00
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11824 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11824 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8763 completed with status: 'FAILURE'

@pcastonguay pcastonguay requested a review from HuiGao-NV July 18, 2025 11:01
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12303 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9139 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12397 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12397 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9216 completed with status: 'SUCCESS'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Shunkangz Shunkangz requested a review from pcastonguay July 21, 2025 05:51
Shunkang added 5 commits July 21, 2025 05:59
Signed-off-by: Shunkang <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Signed-off-by: Shunkang <[email protected]>
Signed-off-by: Shunkang <[email protected]>
@Shunkangz Shunkangz force-pushed the attention_dp_route branch from 9469ede to f9060a9 Compare July 21, 2025 06:00
@Shunkangz
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tests/unittest/_torch/test_executor_request_queue.py (1)

103-117: Test documents a critical bug in the implementation

This test correctly verifies that when a query parameter is passed to enqueue_request, it should be stored in the query field of the RequestQueueItem. However, this test will currently fail due to a bug in the implementation where the query parameter is incorrectly passed as a positional argument.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (3)

144-162: Fix critical bug in enqueue_request method

The query parameter is incorrectly passed as a positional argument, which makes it the is_canceled_request parameter instead of query.


164-168: Use try/finally pattern for lock handling

The current implementation could leave the lock held if an exception occurs between acquire and release.


490-493: Replace print statement with proper logging and fix line length

Use the logging framework instead of print statements for consistency and better control over output.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

522-522: Fix line length violation

The assertion line exceeds the 120-character limit. Consider breaking it into multiple lines for better readability.

-        assert anchor_block_size <= block_size, f'cp_anchor_size {anchor_block_size} should be smaller than block_size {block_size}'
+        assert anchor_block_size <= block_size, (
+            f'cp_anchor_size {anchor_block_size} should be smaller than '
+            f'block_size {block_size}'
+        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9469ede and f9060a9.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (16 hunks)
  • tests/unittest/_torch/test_executor_request_queue.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
tests/unittest/_torch/test_executor_request_queue.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py

491-491: Line too long (133 > 120)

(E501)


522-522: Line too long (132 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (35)
tests/unittest/_torch/test_executor_request_queue.py (4)

1-11: LGTM: Well-structured test imports and setup

The imports are appropriate and the module correctly imports the target classes for testing.


14-31: LGTM: Comprehensive mock setup for distributed testing

The mock_dist fixture properly sets up all necessary distributed attributes for isolated testing.


319-347: Excellent thread safety test coverage

This test properly validates concurrent enqueue operations and ensures unique request ID generation without race conditions.


389-418: Comprehensive integration test workflow

The test effectively simulates a complete workflow from enqueueing requests through cancellation and validation, providing good end-to-end coverage.

tensorrt_llm/_torch/pyexecutor/executor_request_queue.py (8)

22-36: LGTM: Well-designed RequestQueueItem dataclass

The dataclass design is clean with appropriate default values and useful property methods for request type checking.


38-67: LGTM: Comprehensive initialization with proper state management

The constructor properly initializes all necessary attributes for queue management, distributed coordination, and performance tracking.


113-126: LGTM: Proper lock handling in enqueue_requests

The method correctly uses try/finally to ensure the lock is always released, and properly tracks start times for performance metrics.


128-142: LGTM: Clean implementation of cancel and shutdown requests

Both methods properly handle locking and create appropriate RequestQueueItem instances for their respective operations.


208-215: LGTM: Clean routing logic for attention DP

The method properly delegates to the appropriate fetch method based on the attention DP configuration.


293-315: LGTM: Robust request validation and filtering

The method properly handles shutdown signals, canceled requests, and beam width validation with clear error messages.


550-569: LGTM: Well-documented scheduler overlap logic

The method includes excellent documentation explaining the complex interaction between overlap scheduling and different sampler types.


571-601: LGTM: Comprehensive performance metrics interface

The getter methods provide clean access to internal queue state and performance metrics with appropriate return types.

tensorrt_llm/_torch/pyexecutor/py_executor.py (23)

23-23: LGTM! Proper import addition for SpecDecodingStats.

The import of SpecDecodingStats is correctly added to support the refactored functionality.


31-31: LGTM! ExecutorRequestQueue import is properly added.

The import statement correctly brings in the new ExecutorRequestQueue and RequestQueueItem classes for the refactored queue management.


35-35: LGTM! Import statement is properly formatted.

The import for LlmResponse is correctly maintained.


37-37: LGTM! Sampler imports are properly maintained.

The import statement correctly includes all necessary sampler-related classes.


220-232: LGTM! ExecutorRequestQueue initialization is comprehensive and well-structured.

The initialization properly delegates all queue management to the new ExecutorRequestQueue class with appropriate configuration parameters. The call to set_exclude_last_generation_logits correctly configures the queue based on scheduler and sampler settings.


295-295: LGTM! Proper delegation to ExecutorRequestQueue.

The enqueue_requests method correctly delegates to the new queue implementation.


328-328: LGTM! Cancel request properly delegated.

The cancel request functionality is correctly delegated to ExecutorRequestQueue.


334-334: LGTM! Shutdown request properly delegated.

The shutdown functionality is correctly delegated to ExecutorRequestQueue.


349-349: LGTM! Can enqueue check properly delegated.

The can_enqueue_requests method correctly delegates to the queue implementation.


387-388: LGTM! Single request enqueue properly delegated.

The enqueue_request method correctly delegates to ExecutorRequestQueue and maintains the same interface.


396-397: LGTM! Should stop processing logic properly updated.

The property correctly uses the new queue's get_waiting_queue_size() method to determine if processing should stop.


536-536: LGTM! Request queue access properly delegated.

The statistics collection correctly accesses the request queue through the new interface.


553-554: LGTM! Queue size retrieval properly delegated.

The iteration statistics correctly retrieve the queue size from the new ExecutorRequestQueue.


667-668: LGTM! Queue latency metrics properly delegated.

The performance statistics correctly retrieve queue latency metrics from the new queue implementation.


828-829: LGTM! Queue latency retrieval in executor loop.

The executor loop correctly retrieves queue latency metrics from ExecutorRequestQueue.


948-951: LGTM! Benchmark queue size check properly delegated.

The benchmark logic correctly uses the new queue's size method for monitoring.


972-973: LGTM! Queue latency in overlap executor loop.

The overlap executor loop correctly retrieves queue latency from the new implementation.


1107-1115: LGTM! Fetch new requests properly refactored.

The _fetch_new_requests method is correctly refactored to delegate all operations to ExecutorRequestQueue. The method properly:

  • Fetches new requests from the queue
  • Extends the active requests list
  • Updates shutdown state
  • Retrieves expected number of active requests

1405-1405: LGTM! Canceled request IDs check properly delegated.

The check for canceled request IDs correctly uses the new queue's method.


1408-1409: LGTM! Waiting queue update properly delegated.

The waiting queue update operation is correctly delegated to ExecutorRequestQueue.


1413-1413: LGTM! Canceled request ID lookup properly delegated.

The canceled request ID check correctly uses the new queue's method.


1423-1423: LGTM! Clear canceled request IDs properly delegated.

The operation to clear canceled request IDs is correctly delegated to ExecutorRequestQueue.


1515-1516: LGTM! Active requests update pattern is correct.

The pattern of clearing and extending the active requests list is appropriate and maintains the same functionality while being more explicit about the operation.

@Shunkangz Shunkangz enabled auto-merge (squash) July 21, 2025 06:02
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12417 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12417 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9233 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Shunkangz Shunkangz merged commit ee45e0c into NVIDIA:main Jul 22, 2025
3 checks passed
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants