Skip to content

Conversation

@reasonsolo
Copy link
Collaborator

@reasonsolo reasonsolo commented Jul 28, 2025

Summary by CodeRabbit

  • New Features

    • Added new integration test configurations and YAML files to support and validate disaggregated serving with various combinations of tensor and pipeline parallelism for context and generation servers.
    • Introduced a utility function for mapping dataset names to accuracy tasks in integration tests.
  • Bug Fixes

    • Improved error handling in the build script when removing files during cleanup.
  • Enhancements

    • Expanded integration tests to cover more parallelism scenarios, including symmetric and asymmetric configurations.
    • Improved logging and diagnostics for cache state mismatches and KV cache management.
    • Enhanced resource management and request handling in pipeline parallel executor logic.
  • Tests

    • Added new and parameterized test cases for disaggregated serving with different parallelism setups.
    • Included device availability checks to ensure tests only run when sufficient hardware is present.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@reasonsolo reasonsolo requested a review from a team as a code owner July 28, 2025 08:52
@reasonsolo reasonsolo requested a review from dongxuy04 July 28, 2025 08:52
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 28, 2025

📝 Walkthrough

Walkthrough

This update introduces new and extended integration tests for disaggregated serving, with explicit support for asymmetric tensor and pipeline parallelism in both context and generation servers. It adds supporting configuration files, enhances the resource manager and executor logic for KV cache handling, and improves logging and error handling in both Python and C++ components.

Changes

Cohort / File(s) Change Summary
Disaggregated Integration Test Configs
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml, .../disagg_config_ctxtp2_genpp2.yaml, .../disagg_config_ctxpp2_gentp2.yaml, .../disagg_config_ctxtp2pp2_gentp2pp2.yaml, .../disagg_config_ctxpp4_genpp4.yaml
Added five new YAML configuration files for integration tests, specifying various asymmetric/symmetric tensor and pipeline parallelism setups for context and generation servers.
Disaggregated Integration Test Functions
tests/integration/defs/disaggregated/test_disaggregated.py
Added six new test functions and corresponding test config mappings to support new disaggregated configurations, each requiring at least 4 devices.
Disaggregated Accuracy Test Extensions
tests/integration/defs/accuracy/test_disaggregated_serving.py
Enhanced tests to explicitly handle separate tensor/pipeline parallelism for context/generation servers, added helper for parallel tests, and new parameterized test cases for symmetric/asymmetric setups.
Accuracy Task Helper
tests/integration/defs/accuracy/accuracy_core.py
Added get_accuracy_task function to map dataset names to AccuracyTask subclasses with error handling.
KV Cache Manager Enhancements
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added total_num_kv_heads_per_layer attribute to KVCacheManager for tracking KV heads per layer globally; refactored head count initialization logic.
KV Cache Transceiver Update
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py
Updated BindKvCacheTransceiver to use total_num_kv_heads_per_layer instead of num_kv_heads_per_layer when initializing the C++ transceiver.
Pipeline Parallel Executor Loop Overhaul
tensorrt_llm/_torch/pyexecutor/py_executor.py
Integrated KV cache transceiver into pipeline parallel executor loop, added new state fields, resource management, request sorting, and context/generation cache transfer logic.
Cache Formatter Logging
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
Added detailed debug log for mismatched layer counts in CacheFormatter::inquireSupport.
Build Script Error Handling
scripts/build_wheel.py
Added error handling for file removal failures in clear_folder.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TestRunner
    participant ContextServer
    participant GenerationServer

    User->>TestRunner: Launch integration test with config (asymmetric/symmetric TP/PP)
    TestRunner->>ContextServer: Start with specified TP/PP, KV cache config
    TestRunner->>GenerationServer: Start with specified TP/PP, KV cache config
    ContextServer->>GenerationServer: Transmit KV cache (via transceiver)
    GenerationServer->>ContextServer: Acknowledge KV cache receipt
    TestRunner->>ContextServer: Send context requests
    ContextServer->>GenerationServer: Forward prepared requests
    GenerationServer->>TestRunner: Return generation results
Loading
sequenceDiagram
    participant PyExecutor
    participant KVCacheTransceiver
    participant BatchStatePP

    PyExecutor->>BatchStatePP: Initialize with scheduled_ctx_reqs
    PyExecutor->>KVCacheTransceiver: Check transfer status
    PyExecutor->>BatchStatePP: Prepare resources for fitting requests
    PyExecutor->>BatchStatePP: Sort and handle generation requests
    PyExecutor->>KVCacheTransceiver: Send context cache if needed
    PyExecutor->>KVCacheTransceiver: Terminate finished context requests
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • pcastonguay
  • Shixiaowei02
  • nv-guomingz
  • litaotju

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (7)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (1)

6-6: Avoid duplicating free_gpu_memory_fraction at multiple hierarchy levels

The same value is declared once at the root (line 6) and again inside each kv_cache_config block (lines 15-16). Keeping a single source of truth prevents silent drift when values are tuned later.

- free_gpu_memory_fraction: 0.2

(or drop the inner fields if the code really consumes only the nested value)

Also applies to: 15-16

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)

6-6: Consolidate free_gpu_memory_fraction to one location

As in the previous config, the value is declared twice. Remove the duplicate to avoid confusion during future tuning.

Also applies to: 15-16

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (1)

6-6: Single-source the GPU–memory reserve

Duplicated free_gpu_memory_fraction (root vs. kv_cache_config) increases maintenance overhead. Prefer defining it only once.

Also applies to: 15-16

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_genpp2.yaml (1)

6-6: Deduplicate free_gpu_memory_fraction

Same remark as for the other YAML configs: keep the setting in one place to avoid accidental divergence.

Also applies to: 15-16

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

681-684: Fix line length issue.

The warning message exceeds the 120-character limit. Consider breaking it into multiple lines.

-                        logger.warning(
-                            "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache"
-                        )
+                        logger.warning(
+                            "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, "
+                            "may not have enough kvCache"
+                        )
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

77-81: Consider using logging instead of print for the warning message.

For better integration with test frameworks and log management, consider using Python's logging module instead of print statements.

+import logging
+
+logger = logging.getLogger(__name__)
+
 if tensor_parallel_size > 1:
-    print(
-        f"Using unified tp parameter for testing is not recommended. Please use server configs instead."
-    )
+    logger.warning(
+        "Using unified tp parameter for testing is not recommended. Please use server configs instead."
+    )

337-377: Test method has inconsistent tensor parallelism configuration.

The test method test_ctxpp2_genpp2 sets only pipeline parallelism but doesn't explicitly set tensor parallelism, which defaults to 1. This might cause confusion given the method name suggests both parameters should be 2.

 ctx_server_config = {
+    "tensor_parallel_size": 1,
     "pipeline_parallel_size": 2,
     "disable_overlap_scheduler": True,
     "kv_cache_config": kv_cache_config,
     "cache_transceiver_config": {
         "backend": "default"
     }
 }
 gen_server_config = {
+    "tensor_parallel_size": 1,
     "pipeline_parallel_size": 2,
     "disable_overlap_scheduler": True,
     "kv_cache_config": kv_cache_config,
     "cache_transceiver_config": {
         "backend": "default"
     }
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 413a83f and 4e78593.

📒 Files selected for processing (13)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1 hunks)
  • docker/Makefile (0 hunks)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (9 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tests/integration/defs/accuracy/accuracy_core.py (1 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (4 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_genpp2.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)
💤 Files with no reviewable changes (1)
  • docker/Makefile
🧰 Additional context used
🧠 Learnings (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🧬 Code Graph Analysis (1)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2335-2340)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
  • llama_model_root (964-1039)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

682-682: Line too long (123 > 120)

(E501)

🔇 Additional comments (13)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)

99-105: LGTM! Clean update to use comprehensive KV head tracking.

The change from num_kv_heads_per_layer to total_num_kv_heads_per_layer correctly aligns with the enhanced KV cache manager that now tracks KV heads across all layers rather than just local ones. This is essential for disaggregated serving scenarios where the transceiver needs complete visibility into the KV head layout.

tests/integration/defs/accuracy/accuracy_core.py (1)

740-748: Well-implemented dynamic task retrieval function.

The get_accuracy_task function provides a clean programmatic interface for retrieving AccuracyTask subclasses. The implementation includes proper error handling for both missing datasets and invalid class types, with clear and informative error messages.

tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (1)

1-37: Well-structured configuration for disaggregated serving tests.

The YAML configuration appropriately defines a 2x2 tensor and pipeline parallelism setup for both context and generation servers. The conservative memory settings, disabled optimizations, and separate port allocations are suitable for integration testing environments.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

158-161: Good addition of comprehensive KV head tracking.

The new total_num_kv_heads_per_layer attribute correctly extends the existing logic to track KV heads across all layers rather than just local ones. This is essential for disaggregated serving scenarios where the KV cache transceiver needs complete visibility into the model structure.


165-171: Excellent refactoring with the helper function.

The append_to_kv_heads_per_layer helper function eliminates code duplication and provides a clean, reusable way to handle KV head calculations with tensor parallelism. The function properly handles edge cases where kv_head might be None.


173-184: Well-implemented usage of the helper function.

The code correctly uses the new helper function to populate both local and total KV head lists. The distinction between iterating over self.pp_layers for local heads and range(self.num_layers) for total heads is appropriate and maintains the semantic difference between local and comprehensive tracking.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

125-125: LGTM!

The addition of scheduled_ctx_reqs field follows the established pattern and is properly integrated with the KV cache transceiver functionality.


647-647: Good addition of startup logging.

This logging statement helps with debugging and monitoring the executor loop initialization.


714-719: Well-integrated KV cache transceiver handling.

The additions for preparing disaggregated generation transmission, handling first token responses, and storing scheduled context requests are properly integrated with the existing pipeline parallel flow.

Also applies to: 732-735, 766-766


845-847: LGTM!

The context termination logic is correctly placed and follows the expected flow.

tests/integration/defs/disaggregated/test_disaggregated.py (2)

62-71: Good coverage of parallelism configurations.

The new test configurations provide comprehensive coverage of different tensor and pipeline parallelism combinations for disaggregated serving scenarios.


553-571: Well-structured integration tests following established patterns.

All five new test functions are properly implemented with:

  • Correct GPU requirement decorators matching the parallelism configurations
  • Consistent setup and execution following the existing test pattern
  • Appropriate test descriptors corresponding to the config map entries

Also applies to: 573-591, 593-611, 613-631, 633-651

tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

421-468: Good implementation of the flexible parallel test runner.

The run_parallel_test method provides excellent flexibility for testing various parallelism configurations with proper device count validation.

Comment on lines 816 to +818
TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers");
TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect printf specifier for size_t — will trigger warnings or UB

size() returns size_t, yet the format string uses %d, which expects int. On LP64/LLP64 targets this causes a type/width mismatch.

-        TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
-            destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+        TLLM_LOG_WARNING("self: %zu dest %zu",
+            selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
+            destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers");
TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers");
TLLM_LOG_WARNING("self: %zu dest %zu",
selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 816 to 818,
the format specifier %d is incorrectly used for size_t values returned by
size(), which can cause warnings or undefined behavior. Replace %d with %zu in
the TLLM_LOG_WARNING calls to correctly format size_t types.

Comment on lines +832 to +836
if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
self._send_disagg_ctx_cache(
previous_batch.scheduled_ctx_reqs
) if self.kv_cache_transceiver else []

Copy link
Contributor

@coderabbitai coderabbitai bot Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove redundant conditional check.

The ternary operator on line 835 is redundant since the code is already inside an if self.kv_cache_transceiver block.

-                        if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
-                            self._send_disagg_ctx_cache(
-                                previous_batch.scheduled_ctx_reqs
-                            ) if self.kv_cache_transceiver else []
+                        if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
+                            self._send_disagg_ctx_cache(
+                                previous_batch.scheduled_ctx_reqs
+                            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
self._send_disagg_ctx_cache(
previous_batch.scheduled_ctx_reqs
) if self.kv_cache_transceiver else []
if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
self._send_disagg_ctx_cache(
previous_batch.scheduled_ctx_reqs
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 832 to 836, remove
the redundant ternary conditional on line 835 that checks
self.kv_cache_transceiver again inside an if block that already ensures
self.kv_cache_transceiver is truthy. Simply call
self._send_disagg_ctx_cache(previous_batch.scheduled_ctx_reqs) directly without
the ternary operator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonsolo @raayandhar can you apply this suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +480 to +398
def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset):
return self.run_parallel_test(ctx_pp, 1, gen_tp,
get_accuracy_task(testset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect method signature in test_ctx_pp_gen_tp_asymmetric.

The method is missing the gen_pp parameter - it's passing gen_tp where gen_pp is expected in run_parallel_test.

 def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset):
-    return self.run_parallel_test(ctx_pp, 1, gen_tp,
+    return self.run_parallel_test(ctx_pp, 1, 1, gen_tp,
                                   get_accuracy_task(testset))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset):
return self.run_parallel_test(ctx_pp, 1, gen_tp,
get_accuracy_task(testset))
def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset):
return self.run_parallel_test(ctx_pp, 1, 1, gen_tp,
get_accuracy_task(testset))
🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
480 to 482, the method test_ctx_pp_gen_tp_asymmetric is missing the gen_pp
parameter in its signature but is passing gen_tp where gen_pp is expected in the
call to run_parallel_test. Fix this by adding the gen_pp parameter to the method
signature and passing gen_pp instead of gen_tp to run_parallel_test.

Signed-off-by: Lizhi Zhou <[email protected]>
@coderabbitai coderabbitai bot requested a review from litaotju July 28, 2025 09:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

722-731: Add explicit ordering validation after sorting.

The stable sort ensures requests without py_batch_idx come first to satisfy model_engine.py::_forward_step assumptions. To prevent future regressions, insert a runtime assertion immediately after the sort as suggested in the previous review.

                     scheduled_batch.generation_requests = sorted(  # stable sort
                         scheduled_batch.generation_requests,
                         key=lambda req: int(req.py_batch_idx is not None),
                     )
+                    # Validate that all non-batched (py_batch_idx is None) requests precede batched ones
+                    first_batched = next(
+                        (i for i, r in enumerate(scheduled_batch.generation_requests) if r.py_batch_idx is not None),
+                        len(scheduled_batch.generation_requests)
+                    )
+                    assert all(
+                        r.py_batch_idx is None
+                        for r in scheduled_batch.generation_requests[:first_batched]
+                    ) and all(
+                        r.py_batch_idx is not None
+                        for r in scheduled_batch.generation_requests[first_batched:]
+                    ), "generation_requests ordering violated: non-batched must come first"

832-836: Remove redundant conditional check.

The ternary operator on line 835 is redundant since the code is already inside an if self.kv_cache_transceiver block.

-                        if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
-                            self._send_disagg_ctx_cache(
-                                previous_batch.scheduled_ctx_reqs
-                            ) if self.kv_cache_transceiver else []
+                        if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
+                            self._send_disagg_ctx_cache(
+                                previous_batch.scheduled_ctx_reqs
+                            )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

682-682: Fix line length violation.

The line exceeds the 120-character limit as flagged by static analysis.

-                            "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache"
+                            "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, "
+                            "may not have enough kvCache"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e78593 and 5997e07.

📒 Files selected for processing (4)
  • scripts/build_wheel.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (9 hunks)
  • tests/integration/defs/accuracy/accuracy_core.py (1 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/defs/accuracy/accuracy_core.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: yiqingy0
PR: #5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

🪛 Ruff (0.12.2)
scripts/build_wheel.py

76-76: Do not use bare except

(E722)

tensorrt_llm/_torch/pyexecutor/py_executor.py

682-682: Line too long (123 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11)

125-125: LGTM: Clean addition of context request tracking field.

The new scheduled_ctx_reqs field follows the established pattern and properly supports the KV cache transceiver functionality by tracking scheduled context requests in the pipeline parallel batch state.


647-647: LGTM: Useful debugging information.

The logging statement provides valuable information for debugging pipeline parallel execution by clearly identifying which rank is starting its executor loop.


661-663: LGTM: Consistent transceiver status checking.

The addition of the KV cache transceiver status check follows the established pattern used in other executor loop methods and properly guards the operation behind the transceiver availability check.


672-674: LGTM: Enhanced scheduling for disaggregated serving.

The updated scheduling call properly returns additional information needed for disaggregated serving functionality, including fitting requests and their count.


675-686: LGTM: Comprehensive resource preparation for transceiver.

The resource preparation logic for fitting disaggregated generation initialization requests is well-structured, with appropriate warning logging when no requests are fitting and proper status checking.


687-689: LGTM: Conditional assertion properly addresses previous feedback.

The assertion is now appropriately conditioned on the transceiver being inactive, which addresses the previous review comment about preserving the assertion with appropriate conditions.


713-713: LGTM: Proper inflight request tracking.

Adding inflight IDs for the scheduled batch ensures proper request state tracking throughout the execution pipeline.


715-718: LGTM: Consistent preparation pattern for transmission complete requests.

The preparation of disaggregated generation transmission complete requests follows the established pattern and is appropriately guarded by the transceiver availability check.


732-734: LGTM: Consistent first token response handling.

The first token response handling for the transceiver case follows the established pattern and ensures proper client communication in disaggregated serving scenarios.


766-766: LGTM: Proper batch state field population.

The addition of scheduled_ctx_reqs=scheduled_batch.context_requests correctly populates the new batch state field with the relevant context requests for tracking.


845-846: LGTM: Proper cleanup of transmission requests.

The termination of finished context requests still in transmission ensures proper resource cleanup and state management in the disaggregated serving pipeline.

Comment on lines +74 to +77
try:
os.remove(item_path)
except:
print(f"failed to remove {item_path}")
Copy link
Contributor

@coderabbitai coderabbitai bot Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve error handling specificity and reporting.

The bare except clause catches all exceptions including system errors and keyboard interrupts, which can mask critical issues. Additionally, the current implementation doesn't provide information about what went wrong.

Apply this diff to improve error handling:

-            try:
-                os.remove(item_path)
-            except:
-                print(f"failed to remove {item_path}")
+            try:
+                os.remove(item_path)
+            except (OSError, IOError) as e:
+                print(f"Failed to remove {item_path}: {e}", file=sys.stderr)

This change:

  • Catches specific filesystem-related exceptions
  • Provides detailed error information including the underlying cause
  • Uses sys.stderr for error output following best practices
  • Allows critical exceptions like KeyboardInterrupt to propagate normally
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
os.remove(item_path)
except:
print(f"failed to remove {item_path}")
try:
os.remove(item_path)
except (OSError, IOError) as e:
print(f"Failed to remove {item_path}: {e}", file=sys.stderr)
🧰 Tools
🪛 Ruff (0.12.2)

76-76: Do not use bare except

(E722)

🤖 Prompt for AI Agents
In scripts/build_wheel.py around lines 74 to 77, replace the bare except clause
with a more specific exception handler that catches only filesystem-related
errors such as OSError. Modify the except block to capture the exception
instance and print a detailed error message including the exception details to
sys.stderr. This will improve error reporting and allow critical exceptions like
KeyboardInterrupt to propagate normally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonsolo can we apply this suggestion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +74 to +77
try:
os.remove(item_path)
except:
print(f"failed to remove {item_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonsolo can we apply this suggestion?

Comment on lines +832 to +836
if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
self._send_disagg_ctx_cache(
previous_batch.scheduled_ctx_reqs
) if self.kv_cache_transceiver else []

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reasonsolo @raayandhar can you apply this suggestion.

Comment on lines +74 to +77
try:
os.remove(item_path)
except:
print(f"failed to remove {item_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcastonguay
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13201 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13201 [ run ] completed with state FAILURE

@reasonsolo
Copy link
Collaborator Author

Close this since #6369 already cherry-picked the last commit.

@reasonsolo reasonsolo closed this Jul 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants