-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6759] Disagg-serving + PP tests on top of #6369 #6398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
📝 WalkthroughWalkthroughThis update introduces new and extended integration tests for disaggregated serving, with explicit support for asymmetric tensor and pipeline parallelism in both context and generation servers. It adds supporting configuration files, enhances the resource manager and executor logic for KV cache handling, and improves logging and error handling in both Python and C++ components. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TestRunner
participant ContextServer
participant GenerationServer
User->>TestRunner: Launch integration test with config (asymmetric/symmetric TP/PP)
TestRunner->>ContextServer: Start with specified TP/PP, KV cache config
TestRunner->>GenerationServer: Start with specified TP/PP, KV cache config
ContextServer->>GenerationServer: Transmit KV cache (via transceiver)
GenerationServer->>ContextServer: Acknowledge KV cache receipt
TestRunner->>ContextServer: Send context requests
ContextServer->>GenerationServer: Forward prepared requests
GenerationServer->>TestRunner: Return generation results
sequenceDiagram
participant PyExecutor
participant KVCacheTransceiver
participant BatchStatePP
PyExecutor->>BatchStatePP: Initialize with scheduled_ctx_reqs
PyExecutor->>KVCacheTransceiver: Check transfer status
PyExecutor->>BatchStatePP: Prepare resources for fitting requests
PyExecutor->>BatchStatePP: Sort and handle generation requests
PyExecutor->>KVCacheTransceiver: Send context cache if needed
PyExecutor->>KVCacheTransceiver: Terminate finished context requests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (7)
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml (1)
6-6: Avoid duplicatingfree_gpu_memory_fractionat multiple hierarchy levelsThe same value is declared once at the root (line 6) and again inside each
kv_cache_configblock (lines 15-16). Keeping a single source of truth prevents silent drift when values are tuned later.- free_gpu_memory_fraction: 0.2(or drop the inner fields if the code really consumes only the nested value)
Also applies to: 15-16
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml (1)
6-6: Consolidatefree_gpu_memory_fractionto one locationAs in the previous config, the value is declared twice. Remove the duplicate to avoid confusion during future tuning.
Also applies to: 15-16
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml (1)
6-6: Single-source the GPU–memory reserveDuplicated
free_gpu_memory_fraction(root vs.kv_cache_config) increases maintenance overhead. Prefer defining it only once.Also applies to: 15-16
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_genpp2.yaml (1)
6-6: Deduplicatefree_gpu_memory_fractionSame remark as for the other YAML configs: keep the setting in one place to avoid accidental divergence.
Also applies to: 15-16
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
681-684: Fix line length issue.The warning message exceeds the 120-character limit. Consider breaking it into multiple lines.
- logger.warning( - "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache" - ) + logger.warning( + "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, " + "may not have enough kvCache" + )tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
77-81: Consider using logging instead of print for the warning message.For better integration with test frameworks and log management, consider using Python's logging module instead of print statements.
+import logging + +logger = logging.getLogger(__name__) + if tensor_parallel_size > 1: - print( - f"Using unified tp parameter for testing is not recommended. Please use server configs instead." - ) + logger.warning( + "Using unified tp parameter for testing is not recommended. Please use server configs instead." + )
337-377: Test method has inconsistent tensor parallelism configuration.The test method
test_ctxpp2_genpp2sets only pipeline parallelism but doesn't explicitly set tensor parallelism, which defaults to 1. This might cause confusion given the method name suggests both parameters should be 2.ctx_server_config = { + "tensor_parallel_size": 1, "pipeline_parallel_size": 2, "disable_overlap_scheduler": True, "kv_cache_config": kv_cache_config, "cache_transceiver_config": { "backend": "default" } } gen_server_config = { + "tensor_parallel_size": 1, "pipeline_parallel_size": 2, "disable_overlap_scheduler": True, "kv_cache_config": kv_cache_config, "cache_transceiver_config": { "backend": "default" } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp(1 hunks)docker/Makefile(0 hunks)tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(9 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tests/integration/defs/accuracy/accuracy_core.py(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(4 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_genpp2.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp2_gentp2.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxpp4_genpp4.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2_genpp2.yaml(1 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(2 hunks)
💤 Files with no reviewable changes (1)
- docker/Makefile
🧰 Additional context used
🧠 Learnings (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2335-2340)llm_venv(707-723)disaggregated_example_root(270-275)llama_model_root(964-1039)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
682-682: Line too long (123 > 120)
(E501)
🔇 Additional comments (13)
tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py (1)
99-105: LGTM! Clean update to use comprehensive KV head tracking.The change from
num_kv_heads_per_layertototal_num_kv_heads_per_layercorrectly aligns with the enhanced KV cache manager that now tracks KV heads across all layers rather than just local ones. This is essential for disaggregated serving scenarios where the transceiver needs complete visibility into the KV head layout.tests/integration/defs/accuracy/accuracy_core.py (1)
740-748: Well-implemented dynamic task retrieval function.The
get_accuracy_taskfunction provides a clean programmatic interface for retrieving AccuracyTask subclasses. The implementation includes proper error handling for both missing datasets and invalid class types, with clear and informative error messages.tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2pp2_gentp2pp2.yaml (1)
1-37: Well-structured configuration for disaggregated serving tests.The YAML configuration appropriately defines a 2x2 tensor and pipeline parallelism setup for both context and generation servers. The conservative memory settings, disabled optimizations, and separate port allocations are suitable for integration testing environments.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
158-161: Good addition of comprehensive KV head tracking.The new
total_num_kv_heads_per_layerattribute correctly extends the existing logic to track KV heads across all layers rather than just local ones. This is essential for disaggregated serving scenarios where the KV cache transceiver needs complete visibility into the model structure.
165-171: Excellent refactoring with the helper function.The
append_to_kv_heads_per_layerhelper function eliminates code duplication and provides a clean, reusable way to handle KV head calculations with tensor parallelism. The function properly handles edge cases wherekv_headmight be None.
173-184: Well-implemented usage of the helper function.The code correctly uses the new helper function to populate both local and total KV head lists. The distinction between iterating over
self.pp_layersfor local heads andrange(self.num_layers)for total heads is appropriate and maintains the semantic difference between local and comprehensive tracking.tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
125-125: LGTM!The addition of
scheduled_ctx_reqsfield follows the established pattern and is properly integrated with the KV cache transceiver functionality.
647-647: Good addition of startup logging.This logging statement helps with debugging and monitoring the executor loop initialization.
714-719: Well-integrated KV cache transceiver handling.The additions for preparing disaggregated generation transmission, handling first token responses, and storing scheduled context requests are properly integrated with the existing pipeline parallel flow.
Also applies to: 732-735, 766-766
845-847: LGTM!The context termination logic is correctly placed and follows the expected flow.
tests/integration/defs/disaggregated/test_disaggregated.py (2)
62-71: Good coverage of parallelism configurations.The new test configurations provide comprehensive coverage of different tensor and pipeline parallelism combinations for disaggregated serving scenarios.
553-571: Well-structured integration tests following established patterns.All five new test functions are properly implemented with:
- Correct GPU requirement decorators matching the parallelism configurations
- Consistent setup and execution following the existing test pattern
- Appropriate test descriptors corresponding to the config map entries
Also applies to: 573-591, 593-611, 613-631, 633-651
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
421-468: Good implementation of the flexible parallel test runner.The
run_parallel_testmethod provides excellent flexibility for testing various parallelism configurations with proper device count validation.
| TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers"); | ||
| TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(), | ||
| destConfig.getModelConfig().mNbKvHeadsPerLayer.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect printf specifier for size_t — will trigger warnings or UB
size() returns size_t, yet the format string uses %d, which expects int. On LP64/LLP64 targets this causes a type/width mismatch.
- TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
- destConfig.getModelConfig().mNbKvHeadsPerLayer.size());
+ TLLM_LOG_WARNING("self: %zu dest %zu",
+ selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(),
+ destConfig.getModelConfig().mNbKvHeadsPerLayer.size());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers"); | |
| TLLM_LOG_WARNING("self: %d dest %d", selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(), | |
| destConfig.getModelConfig().mNbKvHeadsPerLayer.size()); | |
| TLLM_LOG_WARNING("CacheFormatter::inquireSupport: only support same number of layers"); | |
| TLLM_LOG_WARNING("self: %zu dest %zu", | |
| selfConfig.getModelConfig().mNbKvHeadsPerLayer.size(), | |
| destConfig.getModelConfig().mNbKvHeadsPerLayer.size()); |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around lines 816 to 818,
the format specifier %d is incorrectly used for size_t values returned by
size(), which can cause warnings or undefined behavior. Replace %d with %zu in
the TLLM_LOG_WARNING calls to correctly format size_t types.
| if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: | ||
| self._send_disagg_ctx_cache( | ||
| previous_batch.scheduled_ctx_reqs | ||
| ) if self.kv_cache_transceiver else [] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove redundant conditional check.
The ternary operator on line 835 is redundant since the code is already inside an if self.kv_cache_transceiver block.
- if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
- self._send_disagg_ctx_cache(
- previous_batch.scheduled_ctx_reqs
- ) if self.kv_cache_transceiver else []
+ if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs:
+ self._send_disagg_ctx_cache(
+ previous_batch.scheduled_ctx_reqs
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: | |
| self._send_disagg_ctx_cache( | |
| previous_batch.scheduled_ctx_reqs | |
| ) if self.kv_cache_transceiver else [] | |
| if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: | |
| self._send_disagg_ctx_cache( | |
| previous_batch.scheduled_ctx_reqs | |
| ) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 832 to 836, remove
the redundant ternary conditional on line 835 that checks
self.kv_cache_transceiver again inside an if block that already ensures
self.kv_cache_transceiver is truthy. Simply call
self._send_disagg_ctx_cache(previous_batch.scheduled_ctx_reqs) directly without
the ternary operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonsolo @raayandhar can you apply this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset): | ||
| return self.run_parallel_test(ctx_pp, 1, gen_tp, | ||
| get_accuracy_task(testset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incorrect method signature in test_ctx_pp_gen_tp_asymmetric.
The method is missing the gen_pp parameter - it's passing gen_tp where gen_pp is expected in run_parallel_test.
def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset):
- return self.run_parallel_test(ctx_pp, 1, gen_tp,
+ return self.run_parallel_test(ctx_pp, 1, 1, gen_tp,
get_accuracy_task(testset))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset): | |
| return self.run_parallel_test(ctx_pp, 1, gen_tp, | |
| get_accuracy_task(testset)) | |
| def test_ctx_pp_gen_tp_asymmetric(self, ctx_pp, gen_tp, testset): | |
| return self.run_parallel_test(ctx_pp, 1, 1, gen_tp, | |
| get_accuracy_task(testset)) |
🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
480 to 482, the method test_ctx_pp_gen_tp_asymmetric is missing the gen_pp
parameter in its signature but is passing gen_tp where gen_pp is expected in the
call to run_parallel_test. Fix this by adding the gen_pp parameter to the method
signature and passing gen_pp instead of gen_tp to run_parallel_test.
Signed-off-by: Lizhi Zhou <[email protected]>
4e78593 to
5997e07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
722-731: Add explicit ordering validation after sorting.The stable sort ensures requests without
py_batch_idxcome first to satisfymodel_engine.py::_forward_stepassumptions. To prevent future regressions, insert a runtime assertion immediately after the sort as suggested in the previous review.scheduled_batch.generation_requests = sorted( # stable sort scheduled_batch.generation_requests, key=lambda req: int(req.py_batch_idx is not None), ) + # Validate that all non-batched (py_batch_idx is None) requests precede batched ones + first_batched = next( + (i for i, r in enumerate(scheduled_batch.generation_requests) if r.py_batch_idx is not None), + len(scheduled_batch.generation_requests) + ) + assert all( + r.py_batch_idx is None + for r in scheduled_batch.generation_requests[:first_batched] + ) and all( + r.py_batch_idx is not None + for r in scheduled_batch.generation_requests[first_batched:] + ), "generation_requests ordering violated: non-batched must come first"
832-836: Remove redundant conditional check.The ternary operator on line 835 is redundant since the code is already inside an
if self.kv_cache_transceiverblock.- if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: - self._send_disagg_ctx_cache( - previous_batch.scheduled_ctx_reqs - ) if self.kv_cache_transceiver else [] + if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: + self._send_disagg_ctx_cache( + previous_batch.scheduled_ctx_reqs + )
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
682-682: Fix line length violation.The line exceeds the 120-character limit as flagged by static analysis.
- "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache" + "num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, " + "may not have enough kvCache"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
scripts/build_wheel.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(9 hunks)tests/integration/defs/accuracy/accuracy_core.py(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/defs/accuracy/accuracy_core.py
- tests/integration/defs/accuracy/test_disaggregated_serving.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: yiqingy0
PR: #5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
🪛 Ruff (0.12.2)
scripts/build_wheel.py
76-76: Do not use bare except
(E722)
tensorrt_llm/_torch/pyexecutor/py_executor.py
682-682: Line too long (123 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tensorrt_llm/_torch/pyexecutor/py_executor.py (11)
125-125: LGTM: Clean addition of context request tracking field.The new
scheduled_ctx_reqsfield follows the established pattern and properly supports the KV cache transceiver functionality by tracking scheduled context requests in the pipeline parallel batch state.
647-647: LGTM: Useful debugging information.The logging statement provides valuable information for debugging pipeline parallel execution by clearly identifying which rank is starting its executor loop.
661-663: LGTM: Consistent transceiver status checking.The addition of the KV cache transceiver status check follows the established pattern used in other executor loop methods and properly guards the operation behind the transceiver availability check.
672-674: LGTM: Enhanced scheduling for disaggregated serving.The updated scheduling call properly returns additional information needed for disaggregated serving functionality, including fitting requests and their count.
675-686: LGTM: Comprehensive resource preparation for transceiver.The resource preparation logic for fitting disaggregated generation initialization requests is well-structured, with appropriate warning logging when no requests are fitting and proper status checking.
687-689: LGTM: Conditional assertion properly addresses previous feedback.The assertion is now appropriately conditioned on the transceiver being inactive, which addresses the previous review comment about preserving the assertion with appropriate conditions.
713-713: LGTM: Proper inflight request tracking.Adding inflight IDs for the scheduled batch ensures proper request state tracking throughout the execution pipeline.
715-718: LGTM: Consistent preparation pattern for transmission complete requests.The preparation of disaggregated generation transmission complete requests follows the established pattern and is appropriately guarded by the transceiver availability check.
732-734: LGTM: Consistent first token response handling.The first token response handling for the transceiver case follows the established pattern and ensures proper client communication in disaggregated serving scenarios.
766-766: LGTM: Proper batch state field population.The addition of
scheduled_ctx_reqs=scheduled_batch.context_requestscorrectly populates the new batch state field with the relevant context requests for tracking.
845-846: LGTM: Proper cleanup of transmission requests.The termination of finished context requests still in transmission ensures proper resource cleanup and state management in the disaggregated serving pipeline.
| try: | ||
| os.remove(item_path) | ||
| except: | ||
| print(f"failed to remove {item_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve error handling specificity and reporting.
The bare except clause catches all exceptions including system errors and keyboard interrupts, which can mask critical issues. Additionally, the current implementation doesn't provide information about what went wrong.
Apply this diff to improve error handling:
- try:
- os.remove(item_path)
- except:
- print(f"failed to remove {item_path}")
+ try:
+ os.remove(item_path)
+ except (OSError, IOError) as e:
+ print(f"Failed to remove {item_path}: {e}", file=sys.stderr)This change:
- Catches specific filesystem-related exceptions
- Provides detailed error information including the underlying cause
- Uses
sys.stderrfor error output following best practices - Allows critical exceptions like
KeyboardInterruptto propagate normally
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| os.remove(item_path) | |
| except: | |
| print(f"failed to remove {item_path}") | |
| try: | |
| os.remove(item_path) | |
| except (OSError, IOError) as e: | |
| print(f"Failed to remove {item_path}: {e}", file=sys.stderr) |
🧰 Tools
🪛 Ruff (0.12.2)
76-76: Do not use bare except
(E722)
🤖 Prompt for AI Agents
In scripts/build_wheel.py around lines 74 to 77, replace the bare except clause
with a more specific exception handler that catches only filesystem-related
errors such as OSError. Modify the except block to capture the exception
instance and print a detailed error message including the exception details to
sys.stderr. This will improve error reporting and allow critical exceptions like
KeyboardInterrupt to propagate normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonsolo can we apply this suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| try: | ||
| os.remove(item_path) | ||
| except: | ||
| print(f"failed to remove {item_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonsolo can we apply this suggestion?
| if self.kv_cache_transceiver and previous_batch.scheduled_ctx_reqs: | ||
| self._send_disagg_ctx_cache( | ||
| previous_batch.scheduled_ctx_reqs | ||
| ) if self.kv_cache_transceiver else [] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reasonsolo @raayandhar can you apply this suggestion.
| try: | ||
| os.remove(item_path) | ||
| except: | ||
| print(f"failed to remove {item_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/bot run |
|
PR_Github #13201 [ run ] triggered by Bot |
|
PR_Github #13201 [ run ] completed with state |
|
Close this since #6369 already cherry-picked the last commit. |
Summary by CodeRabbit
New Features
Bug Fixes
Enhancements
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.