-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically #6363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis change introduces dynamic runtime control for speculative decoding in the TensorRT-LLM PyTorch executor. A new utility function, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PyExecutor
participant Drafter
participant ModelEngine
User->>PyExecutor: submit requests
PyExecutor->>Drafter: should_use_spec_decode(requests)
Drafter-->>PyExecutor: True/False
PyExecutor->>ModelEngine: set enable_spec_decode (runtime)
PyExecutor->>PyExecutor: prepare draft tokens if enabled
PyExecutor->>ModelEngine: forward(requests)
ModelEngine->>ModelEngine: select CUDA graph by batch/draft length
ModelEngine-->>PyExecutor: output
PyExecutor-->>User: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
731-784
: Well-designed multi-draft-length CUDA graph capture logic.The implementation correctly:
- Captures graphs for draft length 0 to support runtime disabling
- Manages runtime state appropriately during warmup
- Resets state after warmup completion
Consider breaking line 531 to respect the 120-character limit:
- available_tokens = kv_cache_manager.get_num_available_tokens(draft_len) + available_tokens = kv_cache_manager.get_num_available_tokens( + draft_len)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(23 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py
(2 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(3 hunks)tensorrt_llm/_torch/speculative/ngram.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/test-db/l0_b200.yml
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/speculative/ngram.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/speculative/model_drafter.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/speculative/ngram.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/speculative/model_drafter.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/speculative/ngram.py
202-202: List
may be undefined, or defined from star imports
(F405)
202-202: LlmRequest
may be undefined, or defined from star imports
(F405)
tests/integration/defs/accuracy/test_llm_api_pytorch.py
104-104: Undefined name _with_ids
(F821)
tensorrt_llm/_torch/pyexecutor/model_engine.py
531-531: Line too long (130 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (37)
tensorrt_llm/_torch/speculative/drafter.py (2)
2-4
: LGTM: Clean import additions for new abstract method.The import additions for
List
andLlmRequest
are correctly placed and necessary for the new method signature.
26-29
: LGTM: Well-designed abstract method for dynamic spec decode control.The
should_use_spec_decode
method follows the established pattern for abstract methods in this class, with proper typing and documentation. This provides a clean interface for drafter implementations to control spec decode usage dynamically.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
482-493
: LGTM: Well-implemented utility function with proper defensive programming.The
get_draft_token_length
function is cleanly implemented with:
- Proper type hints and Google-style docstring
- Defensive handling of None values
- Clear, single-responsibility design
- Consistent naming following snake_case convention
This centralizes draft token length access and improves code maintainability across the speculative decoding pipeline.
tests/integration/test_lists/test-db/l0_b200.yml (1)
62-62
: LGTM: Appropriate test configuration for dynamic spec decode feature.The addition of the dynamic spec decode test to the B100 GPU test suite is well-placed, providing integration test coverage for the new runtime spec decode control functionality.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
18-19
: LGTM! Clean import addition.The import of
get_draft_token_length
is properly added to the existing import statement, maintaining namespace consistency.
357-358
: Excellent refactoring for safer draft token handling.Replacing direct access to
len(req.py_draft_tokens)
withget_draft_token_length(req)
improves safety by using a centralized utility that handles edge cases where draft tokens may be absent.
362-363
: Consistent application of the draft token utility.This change maintains consistency with the context batch processing by using the same safe
get_draft_token_length(req)
utility function.tensorrt_llm/_torch/pyexecutor/sampler.py (3)
26-26
: Clean import addition.The import of
get_draft_token_length
is properly integrated into the existing import statement.
340-340
: Safer conditional check for draft tokens.Using
get_draft_token_length(req) > 0
instead of direct length access prevents potential AttributeError exceptions when draft tokens are absent.
404-404
: Consistent application of draft token utility in calculations.The change maintains the same logic while using the safer
get_draft_token_length(req)
utility for calculating processing steps.tensorrt_llm/_torch/speculative/model_drafter.py (3)
11-12
: Proper import addition.The import of
get_draft_token_length
is correctly added to the existing import statement, following the established pattern.
322-322
: Safe draft token length access in padding logic.Using
get_draft_token_length(req)
ensures safe access to draft token length when calculating padding requirements.
402-403
: New method implements dynamic speculative decoding interface.The
should_use_spec_decode
method correctly implements the abstract method from the baseDrafter
class. The unconditionalTrue
return makes sense forModelDrafter
since model-based drafting should generally always be used when available.tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
829-832
: LGTM! Dynamic speculative decoding control implemented correctly.The implementation properly queries the drafter to determine whether speculative decoding should be enabled and notifies the model engine accordingly. This enables runtime toggling of speculative decoding as intended by the PR objectives.
875-875
: Good conditional guarding for draft token preparation.The additional check for
self.use_spec_decode
ensures draft tokens are only prepared when speculative decoding is actually enabled, preventing unnecessary work when the feature is dynamically disabled.
931-931
: Appropriate conditional logic for draft token allocation.The check for
self.use_spec_decode
ensures draft tokens are only allocated when speculative decoding is enabled, completing the runtime control implementation by preventing unnecessary resource allocation.tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
8-8
: Good addition of utility function import.The import of
get_draft_token_length
supports the safer draft token length handling pattern being adopted across the codebase.
188-188
: Improved safety with utility function usage.Replacing direct length access with
get_draft_token_length(request)
provides safer handling of draft token lengths, likely including proper None checking and edge case handling.tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (4)
17-21
: Excellent resource checking for memory-intensive test.The memory check with clear skip message prevents test failures on systems with insufficient GPU memory. The 35GB requirement is appropriate for loading both target and draft models.
54-63
: Well-designed mock for testing dynamic behavior.The mock implementation with call counting effectively simulates dynamic toggling of speculative decoding, enabling validation of the runtime control feature. The logic is clear and deterministic.
61-78
: Well-structured test execution with proper resource management.The test properly uses context manager for mocking, ensures deterministic output with temperature=0, and includes proper resource cleanup with explicit shutdown calls.
85-87
: Appropriate correctness validation for speculative decoding.The direct comparison of outputs between speculative and non-speculative modes effectively validates that the dynamic toggling maintains correctness, with clear documentation of the expectation.
tensorrt_llm/_torch/models/modeling_speculative.py (3)
133-135
: Good defensive programming with spec_metadata None check.The None check prevents AttributeError when speculative metadata is not available, which is important for the dynamic speculative decoding feature where metadata may be conditionally present.
253-254
: Appropriate early return for missing speculative metadata.The early None return prevents attempting to access speculative metadata methods when unavailable, which is essential for supporting dynamic speculative decoding where metadata may not always be present.
387-387
: Enhanced conditional check for speculative decoding prerequisites.The additional
spec_metadata is not None
check ensures both the draft model and speculative metadata are available before proceeding, preventing runtime errors in dynamic speculative decoding scenarios.tensorrt_llm/_torch/pyexecutor/model_engine.py (12)
92-95
: LGTM!The abstract method is properly defined and follows the established pattern for abstract methods in the base class.
284-287
: LGTM!The runtime state fields are properly initialized from their static counterparts, enabling dynamic control while preserving the original configuration.
493-500
: LGTM!The implementation correctly manages the runtime speculative decoding state, with appropriate handling of both enabled and disabled states.
516-558
: LGTM!The parameterization of
get_cuda_graph_warmup_request
withdraft_len
correctly enables CUDA graph capture for different draft lengths.
828-844
: LGTM!Correctly uses runtime configuration (
spec_config_rt
) for spec metadata setup, enabling dynamic control.
935-980
: LGTM!The CUDA graph management correctly handles multiple draft lengths with a proper nested dictionary structure and conditional spec metadata creation based on runtime state.
1131-1137
: LGTM!The cleanup logic correctly handles the nested dictionary structure for multi-draft-length CUDA graphs.
1150-1169
: LGTM!Correctly uses runtime flag (
is_spec_decode_rt
) for conditional preprocessing logic.
1249-1290
: LGTM!Correctly uses the
get_draft_token_length
utility function for safe access to draft token lengths.
1268-1270
: LGTM!All runtime state checks and configuration accesses correctly use the runtime fields (
is_spec_decode_rt
andspec_config_rt
).Also applies to: 1293-1294, 1487-1490, 1518-1520, 1525-1525, 1599-1599
1643-1646
: LGTM!Correctly uses runtime flag for conditional logic in the no-cache path.
2099-2142
: LGTM!The forward method correctly handles spec metadata setup based on runtime state for both CUDA graph and regular execution paths.
ba7ee74
to
e742c1b
Compare
/bot run |
PR_Github #13004 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/speculative/ngram.py (1)
203-204
: LGTM! Method implementation follows established pattern.The
should_use_spec_decode
method correctly implements the abstract interface from the baseDrafter
class. ReturningTrue
is appropriate for NGram drafter, as it should consistently enable speculative decoding when available.The static analysis warning about
LlmRequest
being undefined is a false positive, as discussed in previous reviews -LlmRequest
is available through the star import.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
744-783
: LGTM: Nested warmup loop properly implemented.The nested iteration over batch sizes and draft lengths ensures CUDA graphs are captured for all necessary combinations. The runtime spec decode toggling during warmup is correctly implemented.
Note: Line 531 exceeds the 120-character limit as flagged by static analysis. Consider breaking it into multiple lines:
- # Should only need one more page per request. + # Should only need one more page per request.However, this appears to be a comment formatting issue that doesn't affect functionality.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(23 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py
(2 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(3 hunks)tensorrt_llm/_torch/speculative/ngram.py
(2 hunks)tests/integration/test_lists/test-db/l0_b200.yml
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tests/integration/test_lists/test-db/l0_b200.yml
- tensorrt_llm/_torch/pyexecutor/sampler.py
🚧 Files skipped from review as they are similar to previous changes (7)
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- tensorrt_llm/_torch/speculative/drafter.py
- tensorrt_llm/_torch/speculative/model_drafter.py
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
get_draft_token_length
(482-493)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_num_available_tokens
(545-547)tensorrt_llm/logger.py (1)
info
(137-138)tensorrt_llm/_torch/speculative/interface.py (2)
support_overlap_scheduler
(55-56)extend_ctx
(83-94)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
DecodingCUDAGraphRunner
(28-139)tensorrt_llm/_torch/speculative/utils.py (1)
get_num_extra_kv_tokens
(158-168)
tensorrt_llm/_torch/speculative/ngram.py (3)
tensorrt_llm/_torch/speculative/drafter.py (1)
should_use_spec_decode
(27-29)tensorrt_llm/_torch/speculative/model_drafter.py (1)
should_use_spec_decode
(402-403)tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(264-352)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
531-531: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/speculative/ngram.py
203-203: LlmRequest
may be undefined, or defined from star imports
(F405)
🔇 Additional comments (14)
tensorrt_llm/_torch/speculative/ngram.py (1)
2-2
: LGTM! Import addition addresses previous feedback.This explicit import of
List
improves code clarity and resolves the static analysis concerns discussed in previous reviews.tensorrt_llm/_torch/pyexecutor/model_engine.py (13)
59-59
: LGTM: Import supports centralized draft token length handling.The import of
get_draft_token_length
aligns with the refactoring to centralize draft token length retrieval logic.
92-94
: LGTM: Abstract method properly defines the interface.The abstract method
use_runtime_spec_decode
establishes a clear contract for implementations to enable/disable speculative decoding at runtime.
284-286
: LGTM: Runtime flags properly initialized.The runtime flags are correctly initialized from their static counterparts, providing the expected baseline behavior.
493-499
: LGTM: Runtime spec decode toggling correctly implemented.The method properly toggles runtime flags and configuration, enabling dynamic control over speculative decoding behavior.
516-516
: LGTM: Function signature updated to support draft length.The addition of
draft_len
parameter enables generating warmup requests with specific draft token lengths.
528-542
: LGTM: Draft length parameter properly utilized.The function correctly uses the
draft_len
parameter when adding dummy requests, ensuring proper warmup for different draft token configurations.
731-738
: LGTM: Draft length array logic is sound.The logic correctly determines which draft lengths to capture CUDA graphs for:
- Always include 0 for non-draft models (enables runtime disabling)
- Include max_draft_len when speculative decoding is configured
782-783
: LGTM: Runtime flags properly restored after warmup.The runtime flags are correctly reset to their original values after warmup completion, ensuring the engine returns to its initial state.
935-935
: LGTM: Runtime config used for draft length calculation.The code correctly uses the runtime configuration to determine the current draft length for CUDA graph selection.
953-979
: LGTM: CUDA graph nested dictionary structure properly implemented.The transition to a nested dictionary structure (batch_size -> draft_len -> graph) correctly supports multiple CUDA graphs per batch size for different draft lengths. The initialization and access patterns are consistent.
1131-1134
: LGTM: CUDA graph cleanup handles nested structure.The cleanup method correctly iterates through the nested dictionary structure to delete all CUDA graphs.
828-838
: LGTM: Runtime flags consistently used throughout.All references to speculative decoding configuration and flags correctly use the runtime versions (
is_spec_decode_rt
,spec_config_rt
), ensuring dynamic behavior works consistently across metadata creation, input preparation, and conditional logic.Also applies to: 935-935, 962-971, 1099-1141, 1487-1487, 1518-1525, 1599-1599, 1643-1643
1249-1249
: LGTM: Centralized draft token length function properly utilized.The code correctly uses
get_draft_token_length(request)
instead of directly accessingrequest.py_draft_tokens
, supporting consistent draft token length handling across the codebase.Also applies to: 1289-1289
PR_Github #13004 [ run ] completed with state |
/bot run |
PR_Github #13019 [ run ] triggered by Bot |
PR_Github #13019 [ run ] completed with state |
e742c1b
to
d4771a5
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tensorrt_llm/_torch/speculative/ngram.py (1)
203-204
: Implementation looks correct and follows established patterns.The method correctly implements the abstract interface from the base
Drafter
class and follows the same pattern asModelDrafter
. The logic of always returningTrue
is appropriate for NGram drafter.Note: Static analysis still flags
LlmRequest
as potentially undefined due to the star import. While functionally correct, you could optionally add an explicit import to silence the warning:from ..pyexecutor.llm_request import * +from ..pyexecutor.llm_request import LlmRequest # Explicit import for type hints
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
531-531
: Fix line length violationThe line exceeds the 120-character limit as flagged by static analysis.
- available_tokens = kv_cache_manager.get_num_available_tokens( - draft_len) + available_tokens = kv_cache_manager.get_num_available_tokens( + draft_len)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(1 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(23 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py
(2 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(3 hunks)tensorrt_llm/_torch/speculative/ngram.py
(2 hunks)tests/integration/test_lists/test-db/l0_b200.yml
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tests/integration/test_lists/test-db/l0_b200.yml
🚧 Files skipped from review as they are similar to previous changes (9)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- tensorrt_llm/_torch/speculative/model_drafter.py
- tensorrt_llm/_torch/speculative/drafter.py
- tensorrt_llm/_torch/pyexecutor/sampler.py
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
- tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/pyexecutor/model_engine.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
get_draft_token_length
(482-493)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_num_available_tokens
(545-547)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
DecodingCUDAGraphRunner
(28-139)tensorrt_llm/_torch/speculative/utils.py (1)
get_num_extra_kv_tokens
(158-168)
tensorrt_llm/_torch/speculative/ngram.py (3)
tensorrt_llm/_torch/speculative/drafter.py (1)
should_use_spec_decode
(27-29)tensorrt_llm/_torch/speculative/model_drafter.py (1)
should_use_spec_decode
(402-403)tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(264-352)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
531-531: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/speculative/ngram.py
203-203: LlmRequest
may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tensorrt_llm/_torch/speculative/ngram.py (1)
2-2
: LGTM! Import addition addresses previous feedback.The explicit
List
import resolves the static analysis warning and improves code clarity, as discussed in previous reviews.tensorrt_llm/_torch/pyexecutor/model_engine.py (15)
92-94
: LGTM: Abstract method addition follows proper interface patternThe addition of the abstract
use_runtime_spec_decode
method to the baseModelEngine
class is correctly implemented and ensures all concrete implementations must provide this functionality.
59-59
: LGTM: Import of helper function improves code modularityThe import of
get_draft_token_length
from.llm_request
centralizes draft token length handling and improves code maintainability.
284-286
: LGTM: Runtime flag initializationThe initialization of runtime flags
spec_config_rt
andis_spec_decode_rt
from their static counterparts provides a clean foundation for dynamic control.
493-499
: LGTM: Runtime spec decode toggle implementationThe
use_runtime_spec_decode
method correctly updates both runtime flags atomically. The implementation properly handles both enabling and disabling cases.
516-528
: LGTM: Updated warmup request generation with draft length parameterThe addition of
draft_len
parameter toget_cuda_graph_warmup_request
enables proper CUDA graph capture for different draft lengths while maintaining backward compatibility.
731-758
: LGTM: Enhanced warmup with multi-draft-length CUDA graph captureThe changes properly implement CUDA graph capture for multiple draft lengths:
- Correctly handles non-draft models by capturing draft_len=0 graphs
- Properly sets runtime spec decode flags during warmup
- Follows good practices by iterating through both batch sizes and draft lengths
760-783
: LGTM: Runtime flag restoration after warmupThe code correctly restores the original runtime flag values after warmup completion, ensuring the engine returns to its pre-warmup state.
935-935
: LGTM: Runtime draft length calculationUsing
self.spec_config_rt.max_draft_len if self.is_spec_decode_rt else 0
correctly determines the draft length based on the current runtime state.
953-979
: LGTM: Nested CUDA graph structure with proper initializationThe restructuring of
_cuda_graphs
from a flat dictionary to a nested dictionary{batch_size: {draft_len: graph}}
is well-implemented:
- Proper nested dictionary initialization
- Correct key structure for batch_size and draft_len
- Maintains existing API compatibility
1131-1134
: LGTM: Updated CUDA graph cleanup for nested structureThe
_release_cuda_graphs
method correctly handles the new nested dictionary structure when cleaning up CUDA graphs.
1249-1250
: LGTM: Consistent use of helper functionUsing
get_draft_token_length(request)
instead of directly accessingrequest.py_draft_tokens
improves code consistency and maintainability.
1268-1270
: LGTM: Runtime flag usage in overlap scheduler logicThe code correctly uses
self.is_spec_decode_rt
andself.spec_config_rt
for runtime decision making in the overlap scheduler logic.
828-828
: LGTM: Spec metadata setup using runtime configThe
_set_up_spec_metadata
method correctly usesself.spec_config_rt
instead of the static config, ensuring spec metadata reflects the current runtime state.Also applies to: 838-838
1150-1150
: LGTM: Consistent runtime flag usage throughout the codebaseAll references to speculative decoding flags have been correctly updated to use the runtime variants (
is_spec_decode_rt
,spec_config_rt
) instead of static ones. This ensures dynamic behavior works correctly across all code paths.Also applies to: 1487-1487, 1518-1518, 1525-1525, 1599-1599, 1643-1643
2099-2099
: LGTM: Forward method uses runtime flagsThe forward method correctly uses
self.is_spec_decode_rt
to determine whether to set up spec metadata and CUDA graph selection, ensuring the runtime state is properly respected during inference.Also applies to: 2135-2141
PR_Github #13052 [ run ] triggered by Bot |
PR_Github #13052 [ run ] completed with state |
/bot run |
PR_Github #13069 [ run ] triggered by Bot |
PR_Github #13069 [ run ] completed with state |
d4771a5
to
a2a042f
Compare
PR_Github #13605 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
518-518
: Consider breaking long line for better readability.The line exceeds the 120-character limit. Consider breaking it for improved readability:
- available_tokens = kv_cache_manager.get_num_available_tokens( - draft_len) + available_tokens = kv_cache_manager.get_num_available_tokens( + draft_len)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(21 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py
(2 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(2 hunks)tests/integration/test_lists/test-db/l0_b200.yml
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/integration/test_lists/test-db/l0_b200.yml
- tensorrt_llm/_torch/speculative/drafter.py
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- tensorrt_llm/_torch/speculative/model_drafter.py
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/pyexecutor/sampler.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
append
(77-96)append
(123-140)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
518-518: Line too long (130 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (16)
tensorrt_llm/_torch/pyexecutor/model_engine.py (16)
59-59
: LGTM! Good refactoring to centralize draft token length logic.The import of
get_draft_token_length
helps standardize how draft token lengths are retrieved across the codebase.
281-281
: LGTM! Good separation of static configuration from runtime control.The introduction of
enable_spec_decode
as a runtime flag while keepingis_spec_decode
as a static configuration flag is well-designed for dynamic speculative decoding control.
503-530
: LGTM! Good enhancement to support dynamic draft lengths.The changes to
get_cuda_graph_warmup_request
properly parameterize the draft length, enabling CUDA graph capture for different speculative decoding configurations.
718-726
: LGTM! Smart approach to support runtime spec decode toggling.The logic to capture CUDA graphs for both
max_draft_len
and 0 (when not in one-engine mode) is well-designed. This allows the system to switch between speculative decoding enabled/disabled at runtime without requiring new graph captures.
732-747
: LGTM! Proper management of runtime flag during warmup.The nested loop structure correctly handles both batch sizes and draft lengths, and the dynamic setting of
enable_spec_decode
during warmup ensures the correct CUDA graphs are captured for each configuration.
769-770
: LGTM! Proper restoration of original flag value.Resetting
enable_spec_decode
tois_spec_decode
after warmup ensures the runtime flag returns to its original state.
813-832
: LGTM! Proper use of runtime flag in spec metadata setup.The conditional spec_config assignment based on
enable_spec_decode
correctly ensures that spec metadata is only created when speculative decoding is enabled at runtime.
934-934
: LGTM! Correct use of runtime flag for draft length calculation.The dynamic calculation of
draft_len
based onenable_spec_decode
ensures the correct CUDA graph is selected based on the current runtime configuration.
952-978
: LGTM! Well-designed nested dictionary structure for CUDA graphs.The enhancement to support a nested dictionary structure indexed by both batch_size and draft_len is excellent. This allows efficient CUDA graph lookup based on the current runtime configuration while maintaining good performance.
1131-1134
: LGTM! Proper cleanup of nested CUDA graph structure.The enhanced cleanup logic correctly iterates through both batch_size and draft_len dimensions of the nested dictionary, ensuring all CUDA graph resources are properly released.
1249-1250
: LGTM! Consistent use of centralized helper function.The use of
get_draft_token_length()
instead of direct length checks improves code consistency and maintainability across the codebase.
1268-1271
: LGTM! Proper runtime flag usage in spec config logic.The conditional spec_config assignment based on
enable_spec_decode
ensures that speculative decoding logic is only applied when it's enabled at runtime.
1488-1488
: LGTM! Consistent application of runtime flag throughout input preparation.The systematic replacement of
is_spec_decode
withenable_spec_decode
throughout the input preparation logic ensures that all speculative decoding behavior respects the runtime configuration.Also applies to: 1519-1526, 1600-1600, 1644-1644
2100-2115
: LGTM! Proper conditional spec metadata setup in forward pass.The conditional setup of spec_metadata based on
enable_spec_decode
ensures that speculative decoding resources are only allocated and configured when needed at runtime.
2134-2143
: LGTM! Consistent runtime flag usage in CUDA graph and metadata handling.The use of
enable_spec_decode
for both CUDA graph selection and spec_metadata handling ensures consistent behavior throughout the forward pass based on the runtime configuration.
281-281
: Excellent implementation of dynamic speculative decoding feature.This implementation demonstrates excellent software engineering practices:
- Clean separation of concerns: Static configuration (
is_spec_decode
) vs. runtime control (enable_spec_decode
)- Performance preservation: CUDA graphs are pre-captured for different configurations to avoid runtime overhead
- Comprehensive coverage: The runtime flag is consistently applied throughout input preparation, metadata setup, and forward execution
- Smart resource management: The nested dictionary structure for CUDA graphs allows efficient lookup while supporting multiple configurations
The complexity introduced is well-justified by the functionality gained, and the implementation maintains backward compatibility while adding powerful dynamic control capabilities.
Also applies to: 718-747, 934-978
PR_Github #13605 [ run ] completed with state |
Signed-off-by: ziyixiong-nv <[email protected]>
Signed-off-by: ziyixiong-nv <[email protected]>
Signed-off-by: ziyixiong-nv <[email protected]>
fae7a0f
to
fb786ed
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
518-518
: Address line length violation.Line 518 exceeds the 120-character limit. Consider breaking the function parameters across multiple lines for better readability.
- max_num_draft_tokens=draft_len, + max_num_draft_tokens=draft_len, use_mrope=use_mrope,Note: The actual fix would depend on the full context of the line, but the principle is to break long parameter lists across multiple lines.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(21 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/scheduler.py
(2 hunks)tensorrt_llm/_torch/speculative/drafter.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(2 hunks)tests/integration/test_lists/test-db/l0_b200.yml
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- tests/integration/test_lists/test-db/l0_b200.yml
- tensorrt_llm/_torch/speculative/model_drafter.py
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/speculative/drafter.py
- tensorrt_llm/_torch/pyexecutor/scheduler.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/pyexecutor/sampler.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline, and attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. The block of text should be prepended to the top of all files, including .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (6)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
get_draft_token_length
(482-493)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_num_available_tokens
(560-562)tensorrt_llm/logger.py (1)
info
(137-138)tensorrt_llm/_torch/attention_backend/interface.py (4)
forward
(575-598)create_cuda_graph_metadata
(282-322)num_contexts
(202-203)num_contexts
(206-209)tensorrt_llm/_torch/speculative/utils.py (2)
get_spec_metadata
(14-57)get_num_extra_kv_tokens
(158-168)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
DecodingCUDAGraphRunner
(28-139)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/model_engine.py
518-518: Line too long (130 > 120)
(E501)
🔇 Additional comments (13)
tensorrt_llm/_torch/pyexecutor/model_engine.py (13)
59-59
: LGTM! Good centralization of draft token length logic.The import of
get_draft_token_length
utility function supports the centralized approach to retrieving draft token lengths, improving code consistency and maintainability.
281-281
: LGTM! Runtime flag initialization is well-designed.The
enable_spec_decode
flag properly initializes fromis_spec_decode
and provides the foundation for dynamic runtime control of speculative decoding.
503-545
: LGTM! Parameterization of warmup request generation.The addition of
draft_len
parameter toget_cuda_graph_warmup_request
properly supports generating warmup requests for different draft token lengths, enabling the multi-draft-length CUDA graph approach.
718-726
: Excellent logic for supporting dynamic speculative decoding.The logic to capture CUDA graphs for both
max_draft_len
and0
draft lengths enables runtime toggling of speculative decoding. The conditional exclusion of one-engine mode is correct since those implementations cannot disable spec decode at runtime.
732-747
: LGTM! Proper runtime flag management during warmup.The nested loop structure correctly captures CUDA graphs for each batch size and draft length combination. Setting
enable_spec_decode
based ondraft_len > 0 or self.is_draft_model
ensures the correct behavior during graph capture.
769-770
: Important restoration of original runtime flag.Correctly restores
enable_spec_decode
to its original value after warmup, ensuring the runtime state is not affected by the warmup process.
934-934
: Correct use of runtime flag for draft length determination.Using
self.enable_spec_decode
instead of the static configuration properly supports dynamic runtime control of speculative decoding.
952-954
: Well-implemented nested CUDA graph structure.The change from flat dictionary to nested dictionary structure (
batch_size -> draft_length -> graph
) properly supports storing multiple CUDA graph instances per batch size for different draft token lengths. The initialization logic correctly handles the nested structure.Also applies to: 971-978
813-832
: Clean pattern for conditional spec config usage.The use of local
spec_config
variable set fromself.spec_config if self.enable_spec_decode else None
provides a clean, consistent way to conditionally enable speculative decoding throughout the method based on the runtime flag.
1249-1250
: Consistent use of centralized draft token length utility.The replacement of direct
py_draft_tokens
length checks withget_draft_token_length()
calls improves consistency and safety by using the centralized utility function that properly handles None cases.Also applies to: 1290-1290
1150-1150
: Consistent runtime flag usage throughout input preparation.The consistent use of
self.enable_spec_decode
instead ofself.is_spec_decode
throughout input preparation methods properly supports dynamic runtime control of speculative decoding.Also applies to: 1488-1488, 1519-1519, 1644-1644
2100-2100
: Proper runtime flag usage in forward method.The forward method correctly uses
self.enable_spec_decode
to conditionally set up spec metadata and determine CUDA graph usage, supporting the dynamic speculative decoding feature.Also applies to: 2140-2143
1131-1134
: Correctly updated CUDA graph cleanup for nested structure.The
_release_cuda_graphs
method properly handles the new nested dictionary structure, iterating through both batch sizes and draft lengths to clean up all graph instances.
PR_Github #13669 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unittest takes ~1 minute, which is ok, so I'm approving CI-time wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think coderabbit re-requested review, which revoked my previous approval
PR_Github #13669 [ run ] completed with state |
…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]>
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
Description
Main changes:
use_runtime_spec_decode
to enable or disable speculative decoding in the ModelEngine.should_use_spec_decode
to determine whether to enable speculative decoding or not.Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.