[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically #6363

ziyixiong-nv · 2025-07-25T09:43:55Z

Summary by CodeRabbit

New Features
- Added support for dynamic enabling and disabling of speculative decoding at runtime.
- Introduced a helper for safely retrieving draft token counts in requests.
Bug Fixes
- Improved handling of draft token length checks across various components.
Tests
- Added new tests to validate dynamic speculative decoding, including correctness checks against reference outputs.
Chores
- Updated test suites to include dynamic speculative decoding tests.

Description

Main changes:

When speculative decoding is enabled, TRTLLM will capture CUDA graph instances for both draft_token_length and 0.
Add a function use_runtime_spec_decode to enable or disable speculative decoding in the ModelEngine.
Add a function should_use_spec_decode to determine whether to enable speculative decoding or not.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-07-25T09:44:02Z

📝 Walkthrough

Walkthrough

This change introduces dynamic runtime control for speculative decoding in the TensorRT-LLM PyTorch executor. A new utility function, get_draft_token_length, is integrated across modules to standardize draft token length checks. The CUDA graph warmup and execution logic now handle multiple draft lengths and batch sizes. A test validates dynamic speculative decoding behavior with runtime toggling and output verification.

Changes

Cohort / File(s)	Change Summary
Draft Token Length Utility `tensorrt_llm/_torch/pyexecutor/llm_request.py`	Adds `get_draft_token_length` function to safely retrieve draft token count from an `LlmRequest`.
Model Engine Dynamic Spec Decode `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Adds `enable_spec_decode` runtime flag, manages CUDA graphs per draft length, updates warmup and forward logic for dynamic speculative decoding.
Executor Dynamic Spec Decode Control `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Uses drafter's runtime decision to enable/disable speculative decoding, controls draft token preparation and model engine flag accordingly.
Resource Manager Draft Length Integration `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Imports and uses `get_draft_token_length` for cache management, replacing direct list length checks.
Sampler Draft Length Integration `tensorrt_llm/_torch/pyexecutor/sampler.py`	Imports and uses `get_draft_token_length` in draft-related logic within `TorchSampler`.
Scheduler Draft Length Integration `tensorrt_llm/_torch/pyexecutor/scheduler.py`	Imports and uses `get_draft_token_length` in `BindMicroBatchScheduler.schedule`.
Drafter API Extension `tensorrt_llm/_torch/speculative/drafter.py`	Adds `should_use_spec_decode` method to `Drafter` with default `True` implementation.
Model Drafter Draft Length Integration `tensorrt_llm/_torch/speculative/model_drafter.py`	Imports and uses `get_draft_token_length` for padding logic.
Dynamic Spec Decode Test `tests/unittest/_torch/speculative/test_dynamic_spec_decode.py`	Adds a test for dynamic speculative decoding, including runtime toggling and output validation.
Test List Update `tests/integration/test_lists/test-db/l0_b200.yml`	Registers the new dynamic speculative decoding test for relevant test suites.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PyExecutor
    participant Drafter
    participant ModelEngine

    User->>PyExecutor: submit requests
    PyExecutor->>Drafter: should_use_spec_decode(requests)
    Drafter-->>PyExecutor: True/False
    PyExecutor->>ModelEngine: set enable_spec_decode (runtime)
    PyExecutor->>PyExecutor: prepare draft tokens if enabled
    PyExecutor->>ModelEngine: forward(requests)
    ModelEngine->>ModelEngine: select CUDA graph by batch/draft length
    ModelEngine-->>PyExecutor: output
    PyExecutor-->>User: return result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Community want to contribute

Suggested reviewers

mikeiovine
Naveassaf
yuxianq
litaotju

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
731-784: Well-designed multi-draft-length CUDA graph capture logic.

The implementation correctly:

Captures graphs for draft length 0 to support runtime disabling

Manages runtime state appropriately during warmup

Resets state after warmup completion

Consider breaking line 531 to respect the 120-character limit:
-                available_tokens = kv_cache_manager.get_num_available_tokens(draft_len)
+                available_tokens = kv_cache_manager.get_num_available_tokens(
+                    draft_len)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0aecf0 and 0669ad1.

📒 Files selected for processing (13)

tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (23 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
tensorrt_llm/_torch/speculative/ngram.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/speculative/ngram.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/speculative/model_drafter.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/speculative/ngram.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/speculative/model_drafter.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/speculative/ngram.py

202-202: List may be undefined, or defined from star imports

(F405)

202-202: LlmRequest may be undefined, or defined from star imports

(F405)

tests/integration/defs/accuracy/test_llm_api_pytorch.py

104-104: Undefined name _with_ids

(F821)

tensorrt_llm/_torch/pyexecutor/model_engine.py

531-531: Line too long (130 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (37)

tensorrt_llm/_torch/speculative/drafter.py (2)

2-4: LGTM: Clean import additions for new abstract method.

The import additions for List and LlmRequest are correctly placed and necessary for the new method signature.

26-29: LGTM: Well-designed abstract method for dynamic spec decode control.

The should_use_spec_decode method follows the established pattern for abstract methods in this class, with proper typing and documentation. This provides a clean interface for drafter implementations to control spec decode usage dynamically.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

482-493: LGTM: Well-implemented utility function with proper defensive programming.

The get_draft_token_length function is cleanly implemented with:

Proper type hints and Google-style docstring

Defensive handling of None values

Clear, single-responsibility design

Consistent naming following snake_case convention

This centralizes draft token length access and improves code maintainability across the speculative decoding pipeline.

tests/integration/test_lists/test-db/l0_b200.yml (1)

62-62: LGTM: Appropriate test configuration for dynamic spec decode feature.

The addition of the dynamic spec decode test to the B100 GPU test suite is well-placed, providing integration test coverage for the new runtime spec decode control functionality.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

18-19: LGTM! Clean import addition.

The import of get_draft_token_length is properly added to the existing import statement, maintaining namespace consistency.

357-358: Excellent refactoring for safer draft token handling.

Replacing direct access to len(req.py_draft_tokens) with get_draft_token_length(req) improves safety by using a centralized utility that handles edge cases where draft tokens may be absent.

362-363: Consistent application of the draft token utility.

This change maintains consistency with the context batch processing by using the same safe get_draft_token_length(req) utility function.

tensorrt_llm/_torch/pyexecutor/sampler.py (3)

26-26: Clean import addition.

The import of get_draft_token_length is properly integrated into the existing import statement.

340-340: Safer conditional check for draft tokens.

Using get_draft_token_length(req) > 0 instead of direct length access prevents potential AttributeError exceptions when draft tokens are absent.

404-404: Consistent application of draft token utility in calculations.

The change maintains the same logic while using the safer get_draft_token_length(req) utility for calculating processing steps.

tensorrt_llm/_torch/speculative/model_drafter.py (3)

11-12: Proper import addition.

The import of get_draft_token_length is correctly added to the existing import statement, following the established pattern.

322-322: Safe draft token length access in padding logic.

Using get_draft_token_length(req) ensures safe access to draft token length when calculating padding requirements.

402-403: New method implements dynamic speculative decoding interface.

The should_use_spec_decode method correctly implements the abstract method from the base Drafter class. The unconditional True return makes sense for ModelDrafter since model-based drafting should generally always be used when available.

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

829-832: LGTM! Dynamic speculative decoding control implemented correctly.

The implementation properly queries the drafter to determine whether speculative decoding should be enabled and notifies the model engine accordingly. This enables runtime toggling of speculative decoding as intended by the PR objectives.

875-875: Good conditional guarding for draft token preparation.

The additional check for self.use_spec_decode ensures draft tokens are only prepared when speculative decoding is actually enabled, preventing unnecessary work when the feature is dynamically disabled.

931-931: Appropriate conditional logic for draft token allocation.

The check for self.use_spec_decode ensures draft tokens are only allocated when speculative decoding is enabled, completing the runtime control implementation by preventing unnecessary resource allocation.

tensorrt_llm/_torch/pyexecutor/scheduler.py (2)

8-8: Good addition of utility function import.

The import of get_draft_token_length supports the safer draft token length handling pattern being adopted across the codebase.

188-188: Improved safety with utility function usage.

Replacing direct length access with get_draft_token_length(request) provides safer handling of draft token lengths, likely including proper None checking and edge case handling.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (4)

17-21: Excellent resource checking for memory-intensive test.

The memory check with clear skip message prevents test failures on systems with insufficient GPU memory. The 35GB requirement is appropriate for loading both target and draft models.

54-63: Well-designed mock for testing dynamic behavior.

The mock implementation with call counting effectively simulates dynamic toggling of speculative decoding, enabling validation of the runtime control feature. The logic is clear and deterministic.

61-78: Well-structured test execution with proper resource management.

The test properly uses context manager for mocking, ensures deterministic output with temperature=0, and includes proper resource cleanup with explicit shutdown calls.

85-87: Appropriate correctness validation for speculative decoding.

The direct comparison of outputs between speculative and non-speculative modes effectively validates that the dynamic toggling maintains correctness, with clear documentation of the expectation.

tensorrt_llm/_torch/models/modeling_speculative.py (3)

133-135: Good defensive programming with spec_metadata None check.

The None check prevents AttributeError when speculative metadata is not available, which is important for the dynamic speculative decoding feature where metadata may be conditionally present.

253-254: Appropriate early return for missing speculative metadata.

The early None return prevents attempting to access speculative metadata methods when unavailable, which is essential for supporting dynamic speculative decoding where metadata may not always be present.

387-387: Enhanced conditional check for speculative decoding prerequisites.

The additional spec_metadata is not None check ensures both the draft model and speculative metadata are available before proceeding, preventing runtime errors in dynamic speculative decoding scenarios.

tensorrt_llm/_torch/pyexecutor/model_engine.py (12)

92-95: LGTM!

The abstract method is properly defined and follows the established pattern for abstract methods in the base class.

284-287: LGTM!

The runtime state fields are properly initialized from their static counterparts, enabling dynamic control while preserving the original configuration.

493-500: LGTM!

The implementation correctly manages the runtime speculative decoding state, with appropriate handling of both enabled and disabled states.

516-558: LGTM!

The parameterization of get_cuda_graph_warmup_request with draft_len correctly enables CUDA graph capture for different draft lengths.

828-844: LGTM!

Correctly uses runtime configuration (spec_config_rt) for spec metadata setup, enabling dynamic control.

935-980: LGTM!

The CUDA graph management correctly handles multiple draft lengths with a proper nested dictionary structure and conditional spec metadata creation based on runtime state.

1131-1137: LGTM!

The cleanup logic correctly handles the nested dictionary structure for multi-draft-length CUDA graphs.

1150-1169: LGTM!

Correctly uses runtime flag (is_spec_decode_rt) for conditional preprocessing logic.

1249-1290: LGTM!

Correctly uses the get_draft_token_length utility function for safe access to draft token lengths.

1268-1270: LGTM!

All runtime state checks and configuration accesses correctly use the runtime fields (is_spec_decode_rt and spec_config_rt).

Also applies to: 1293-1294, 1487-1490, 1518-1520, 1525-1525, 1599-1599

1643-1646: LGTM!

Correctly uses runtime flag for conditional logic in the no-cache path.

2099-2142: LGTM!

The forward method correctly handles spec metadata setup based on runtime state for both CUDA graph and regular execution paths.

tensorrt_llm/_torch/speculative/ngram.py

tests/integration/defs/accuracy/test_llm_api_pytorch.py

ziyixiong-nv · 2025-07-25T10:23:51Z

/bot run

tensorrt-cicd · 2025-07-25T10:29:17Z

PR_Github #13004 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/speculative/ngram.py (1)

203-204: LGTM! Method implementation follows established pattern.

The should_use_spec_decode method correctly implements the abstract interface from the base Drafter class. Returning True is appropriate for NGram drafter, as it should consistently enable speculative decoding when available.

The static analysis warning about LlmRequest being undefined is a false positive, as discussed in previous reviews - LlmRequest is available through the star import.

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
744-783: LGTM: Nested warmup loop properly implemented.

The nested iteration over batch sizes and draft lengths ensures CUDA graphs are captured for all necessary combinations. The runtime spec decode toggling during warmup is correctly implemented.

Note: Line 531 exceeds the 120-character limit as flagged by static analysis. Consider breaking it into multiple lines:
-            # Should only need one more page per request.
+            # Should only need one more page per request.
However, this appears to be a comment formatting issue that doesn't affect functionality.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0669ad1 and e742c1b.

📒 Files selected for processing (12)

tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (23 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
tensorrt_llm/_torch/speculative/ngram.py (2 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

✅ Files skipped from review due to trivial changes (3)

tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/integration/test_lists/test-db/l0_b200.yml
tensorrt_llm/_torch/pyexecutor/sampler.py

🚧 Files skipped from review as they are similar to previous changes (7)

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/models/modeling_speculative.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py

🧬 Code Graph Analysis (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

get_draft_token_length (482-493)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_num_available_tokens (545-547)

tensorrt_llm/logger.py (1)

info (137-138)

tensorrt_llm/_torch/speculative/interface.py (2)

support_overlap_scheduler (55-56)

extend_ctx (83-94)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

DecodingCUDAGraphRunner (28-139)

tensorrt_llm/_torch/speculative/utils.py (1)

get_num_extra_kv_tokens (158-168)

tensorrt_llm/_torch/speculative/ngram.py (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

should_use_spec_decode (27-29)

tensorrt_llm/_torch/speculative/model_drafter.py (1)

should_use_spec_decode (402-403)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (264-352)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

531-531: Line too long (130 > 120)

(E501)

tensorrt_llm/_torch/speculative/ngram.py

203-203: LlmRequest may be undefined, or defined from star imports

(F405)

🔇 Additional comments (14)

tensorrt_llm/_torch/speculative/ngram.py (1)

2-2: LGTM! Import addition addresses previous feedback.

This explicit import of List improves code clarity and resolves the static analysis concerns discussed in previous reviews.

tensorrt_llm/_torch/pyexecutor/model_engine.py (13)

59-59: LGTM: Import supports centralized draft token length handling.

The import of get_draft_token_length aligns with the refactoring to centralize draft token length retrieval logic.

92-94: LGTM: Abstract method properly defines the interface.

The abstract method use_runtime_spec_decode establishes a clear contract for implementations to enable/disable speculative decoding at runtime.

284-286: LGTM: Runtime flags properly initialized.

The runtime flags are correctly initialized from their static counterparts, providing the expected baseline behavior.

493-499: LGTM: Runtime spec decode toggling correctly implemented.

The method properly toggles runtime flags and configuration, enabling dynamic control over speculative decoding behavior.

516-516: LGTM: Function signature updated to support draft length.

The addition of draft_len parameter enables generating warmup requests with specific draft token lengths.

528-542: LGTM: Draft length parameter properly utilized.

The function correctly uses the draft_len parameter when adding dummy requests, ensuring proper warmup for different draft token configurations.

731-738: LGTM: Draft length array logic is sound.

The logic correctly determines which draft lengths to capture CUDA graphs for:

Always include 0 for non-draft models (enables runtime disabling)

Include max_draft_len when speculative decoding is configured

782-783: LGTM: Runtime flags properly restored after warmup.

The runtime flags are correctly reset to their original values after warmup completion, ensuring the engine returns to its initial state.

935-935: LGTM: Runtime config used for draft length calculation.

The code correctly uses the runtime configuration to determine the current draft length for CUDA graph selection.

953-979: LGTM: CUDA graph nested dictionary structure properly implemented.

The transition to a nested dictionary structure (batch_size -> draft_len -> graph) correctly supports multiple CUDA graphs per batch size for different draft lengths. The initialization and access patterns are consistent.

1131-1134: LGTM: CUDA graph cleanup handles nested structure.

The cleanup method correctly iterates through the nested dictionary structure to delete all CUDA graphs.

828-838: LGTM: Runtime flags consistently used throughout.

All references to speculative decoding configuration and flags correctly use the runtime versions (is_spec_decode_rt, spec_config_rt), ensuring dynamic behavior works consistently across metadata creation, input preparation, and conditional logic.

Also applies to: 935-935, 962-971, 1099-1141, 1487-1487, 1518-1525, 1599-1599, 1643-1643

1249-1249: LGTM: Centralized draft token length function properly utilized.

The code correctly uses get_draft_token_length(request) instead of directly accessing request.py_draft_tokens, supporting consistent draft token length handling across the codebase.

Also applies to: 1289-1289

tensorrt-cicd · 2025-07-25T11:49:52Z

PR_Github #13004 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9708 completed with status: 'FAILURE'

ziyixiong-nv · 2025-07-25T14:23:17Z

/bot run

tensorrt-cicd · 2025-07-25T14:29:22Z

PR_Github #13019 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-25T15:15:58Z

PR_Github #13019 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9724 completed with status: 'FAILURE'

ziyixiong-nv · 2025-07-26T02:28:44Z

/bot run

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/speculative/ngram.py (1)
203-204: Implementation looks correct and follows established patterns.

The method correctly implements the abstract interface from the base Drafter class and follows the same pattern as ModelDrafter. The logic of always returning True is appropriate for NGram drafter.

Note: Static analysis still flags LlmRequest as potentially undefined due to the star import. While functionally correct, you could optionally add an explicit import to silence the warning:
 from ..pyexecutor.llm_request import *
+from ..pyexecutor.llm_request import LlmRequest  # Explicit import for type hints
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
531-531: Fix line length violation

The line exceeds the 120-character limit as flagged by static analysis.
-                available_tokens = kv_cache_manager.get_num_available_tokens(
-                    draft_len)
+                available_tokens = kv_cache_manager.get_num_available_tokens(
+                    draft_len)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e742c1b and d4771a5.

📒 Files selected for processing (13)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (23 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (3 hunks)
tensorrt_llm/_torch/speculative/ngram.py (2 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tests/integration/test_lists/test-db/l0_b200.yml

🚧 Files skipped from review as they are similar to previous changes (9)

tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/models/modeling_speculative.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/ngram.py

🧬 Code Graph Analysis (2)

tensorrt_llm/_torch/pyexecutor/model_engine.py (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

get_draft_token_length (482-493)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_num_available_tokens (545-547)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

DecodingCUDAGraphRunner (28-139)

tensorrt_llm/_torch/speculative/utils.py (1)

get_num_extra_kv_tokens (158-168)

tensorrt_llm/_torch/speculative/ngram.py (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

should_use_spec_decode (27-29)

tensorrt_llm/_torch/speculative/model_drafter.py (1)

should_use_spec_decode (402-403)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (264-352)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

531-531: Line too long (130 > 120)

(E501)

tensorrt_llm/_torch/speculative/ngram.py

203-203: LlmRequest may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (16)

tensorrt_llm/_torch/speculative/ngram.py (1)

2-2: LGTM! Import addition addresses previous feedback.

The explicit List import resolves the static analysis warning and improves code clarity, as discussed in previous reviews.

tensorrt_llm/_torch/pyexecutor/model_engine.py (15)

92-94: LGTM: Abstract method addition follows proper interface pattern

The addition of the abstract use_runtime_spec_decode method to the base ModelEngine class is correctly implemented and ensures all concrete implementations must provide this functionality.

59-59: LGTM: Import of helper function improves code modularity

The import of get_draft_token_length from .llm_request centralizes draft token length handling and improves code maintainability.

284-286: LGTM: Runtime flag initialization

The initialization of runtime flags spec_config_rt and is_spec_decode_rt from their static counterparts provides a clean foundation for dynamic control.

493-499: LGTM: Runtime spec decode toggle implementation

The use_runtime_spec_decode method correctly updates both runtime flags atomically. The implementation properly handles both enabling and disabling cases.

516-528: LGTM: Updated warmup request generation with draft length parameter

The addition of draft_len parameter to get_cuda_graph_warmup_request enables proper CUDA graph capture for different draft lengths while maintaining backward compatibility.

731-758: LGTM: Enhanced warmup with multi-draft-length CUDA graph capture

The changes properly implement CUDA graph capture for multiple draft lengths:

Correctly handles non-draft models by capturing draft_len=0 graphs

Properly sets runtime spec decode flags during warmup

Follows good practices by iterating through both batch sizes and draft lengths

760-783: LGTM: Runtime flag restoration after warmup

The code correctly restores the original runtime flag values after warmup completion, ensuring the engine returns to its pre-warmup state.

935-935: LGTM: Runtime draft length calculation

Using self.spec_config_rt.max_draft_len if self.is_spec_decode_rt else 0 correctly determines the draft length based on the current runtime state.

953-979: LGTM: Nested CUDA graph structure with proper initialization

The restructuring of _cuda_graphs from a flat dictionary to a nested dictionary {batch_size: {draft_len: graph}} is well-implemented:

Proper nested dictionary initialization

Correct key structure for batch_size and draft_len

Maintains existing API compatibility

1131-1134: LGTM: Updated CUDA graph cleanup for nested structure

The _release_cuda_graphs method correctly handles the new nested dictionary structure when cleaning up CUDA graphs.

1249-1250: LGTM: Consistent use of helper function

Using get_draft_token_length(request) instead of directly accessing request.py_draft_tokens improves code consistency and maintainability.

1268-1270: LGTM: Runtime flag usage in overlap scheduler logic

The code correctly uses self.is_spec_decode_rt and self.spec_config_rt for runtime decision making in the overlap scheduler logic.

828-828: LGTM: Spec metadata setup using runtime config

The _set_up_spec_metadata method correctly uses self.spec_config_rt instead of the static config, ensuring spec metadata reflects the current runtime state.

Also applies to: 838-838

1150-1150: LGTM: Consistent runtime flag usage throughout the codebase

All references to speculative decoding flags have been correctly updated to use the runtime variants (is_spec_decode_rt, spec_config_rt) instead of static ones. This ensures dynamic behavior works correctly across all code paths.

Also applies to: 1487-1487, 1518-1518, 1525-1525, 1599-1599, 1643-1643

2099-2099: LGTM: Forward method uses runtime flags

The forward method correctly uses self.is_spec_decode_rt to determine whether to set up spec metadata and CUDA graph selection, ensuring the runtime state is properly respected during inference.

Also applies to: 2135-2141

tensorrt-cicd · 2025-07-26T02:33:43Z

PR_Github #13052 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-26T04:40:49Z

PR_Github #13052 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9751 completed with status: 'FAILURE'

ziyixiong-nv · 2025-07-26T07:33:25Z

/bot run

tensorrt-cicd · 2025-07-26T07:38:34Z

PR_Github #13069 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-26T09:20:38Z

PR_Github #13069 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9768 completed with status: 'FAILURE'

tensorrt-cicd · 2025-07-31T04:19:18Z

PR_Github #13605 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
518-518: Consider breaking long line for better readability.

The line exceeds the 120-character limit. Consider breaking it for improved readability:
-                available_tokens = kv_cache_manager.get_num_available_tokens(
-                    draft_len)
+                available_tokens = kv_cache_manager.get_num_available_tokens(
+                    draft_len)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05f7667 and fae7a0f.

📒 Files selected for processing (10)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (21 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (2 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

tests/integration/test_lists/test-db/l0_b200.yml
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🧬 Code Graph Analysis (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

append (77-96)

append (123-140)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

518-518: Line too long (130 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (16)

tensorrt_llm/_torch/pyexecutor/model_engine.py (16)

59-59: LGTM! Good refactoring to centralize draft token length logic.

The import of get_draft_token_length helps standardize how draft token lengths are retrieved across the codebase.

281-281: LGTM! Good separation of static configuration from runtime control.

The introduction of enable_spec_decode as a runtime flag while keeping is_spec_decode as a static configuration flag is well-designed for dynamic speculative decoding control.

503-530: LGTM! Good enhancement to support dynamic draft lengths.

The changes to get_cuda_graph_warmup_request properly parameterize the draft length, enabling CUDA graph capture for different speculative decoding configurations.

718-726: LGTM! Smart approach to support runtime spec decode toggling.

The logic to capture CUDA graphs for both max_draft_len and 0 (when not in one-engine mode) is well-designed. This allows the system to switch between speculative decoding enabled/disabled at runtime without requiring new graph captures.

732-747: LGTM! Proper management of runtime flag during warmup.

The nested loop structure correctly handles both batch sizes and draft lengths, and the dynamic setting of enable_spec_decode during warmup ensures the correct CUDA graphs are captured for each configuration.

769-770: LGTM! Proper restoration of original flag value.

Resetting enable_spec_decode to is_spec_decode after warmup ensures the runtime flag returns to its original state.

813-832: LGTM! Proper use of runtime flag in spec metadata setup.

The conditional spec_config assignment based on enable_spec_decode correctly ensures that spec metadata is only created when speculative decoding is enabled at runtime.

934-934: LGTM! Correct use of runtime flag for draft length calculation.

The dynamic calculation of draft_len based on enable_spec_decode ensures the correct CUDA graph is selected based on the current runtime configuration.

952-978: LGTM! Well-designed nested dictionary structure for CUDA graphs.

The enhancement to support a nested dictionary structure indexed by both batch_size and draft_len is excellent. This allows efficient CUDA graph lookup based on the current runtime configuration while maintaining good performance.

1131-1134: LGTM! Proper cleanup of nested CUDA graph structure.

The enhanced cleanup logic correctly iterates through both batch_size and draft_len dimensions of the nested dictionary, ensuring all CUDA graph resources are properly released.

1249-1250: LGTM! Consistent use of centralized helper function.

The use of get_draft_token_length() instead of direct length checks improves code consistency and maintainability across the codebase.

1268-1271: LGTM! Proper runtime flag usage in spec config logic.

The conditional spec_config assignment based on enable_spec_decode ensures that speculative decoding logic is only applied when it's enabled at runtime.

1488-1488: LGTM! Consistent application of runtime flag throughout input preparation.

The systematic replacement of is_spec_decode with enable_spec_decode throughout the input preparation logic ensures that all speculative decoding behavior respects the runtime configuration.

Also applies to: 1519-1526, 1600-1600, 1644-1644

2100-2115: LGTM! Proper conditional spec metadata setup in forward pass.

The conditional setup of spec_metadata based on enable_spec_decode ensures that speculative decoding resources are only allocated and configured when needed at runtime.

2134-2143: LGTM! Consistent runtime flag usage in CUDA graph and metadata handling.

The use of enable_spec_decode for both CUDA graph selection and spec_metadata handling ensures consistent behavior throughout the forward pass based on the runtime configuration.

281-281: Excellent implementation of dynamic speculative decoding feature.

This implementation demonstrates excellent software engineering practices:

Clean separation of concerns: Static configuration (is_spec_decode) vs. runtime control (enable_spec_decode)

Performance preservation: CUDA graphs are pre-captured for different configurations to avoid runtime overhead

Comprehensive coverage: The runtime flag is consistently applied throughout input preparation, metadata setup, and forward execution

Smart resource management: The nested dictionary structure for CUDA graphs allows efficient lookup while supporting multiple configurations

The complexity introduced is well-justified by the functionality gained, and the implementation maintains backward compatibility while adding powerful dynamic control capabilities.

Also applies to: 718-747, 934-978

tensorrt-cicd · 2025-07-31T06:02:34Z

PR_Github #13605 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10204 completed with status: 'FAILURE'

Signed-off-by: ziyixiong-nv <[email protected]>

ziyixiong-nv · 2025-07-31T11:33:32Z

/bot run

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
518-518: Address line length violation.

Line 518 exceeds the 120-character limit. Consider breaking the function parameters across multiple lines for better readability.
-                    max_num_draft_tokens=draft_len,
+                    max_num_draft_tokens=draft_len,
                     use_mrope=use_mrope,
Note: The actual fix would depend on the full context of the line, but the principle is to break long parameter lists across multiple lines.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fae7a0f and fb786ed.

📒 Files selected for processing (10)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (21 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/sampler.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (2 hunks)
tensorrt_llm/_torch/speculative/model_drafter.py (2 hunks)
tests/integration/test_lists/test-db/l0_b200.yml (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

tests/integration/test_lists/test-db/l0_b200.yml
tensorrt_llm/_torch/speculative/model_drafter.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/_torch/pyexecutor/scheduler.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/sampler.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline, and attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. The block of text should be prepended to the top of all files, including .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🧬 Code Graph Analysis (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py (6)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

get_draft_token_length (482-493)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_num_available_tokens (560-562)

tensorrt_llm/logger.py (1)

info (137-138)

tensorrt_llm/_torch/attention_backend/interface.py (4)

forward (575-598)

create_cuda_graph_metadata (282-322)

num_contexts (202-203)

num_contexts (206-209)

tensorrt_llm/_torch/speculative/utils.py (2)

get_spec_metadata (14-57)

get_num_extra_kv_tokens (158-168)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

DecodingCUDAGraphRunner (28-139)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/model_engine.py

518-518: Line too long (130 > 120)

(E501)

🔇 Additional comments (13)

tensorrt_llm/_torch/pyexecutor/model_engine.py (13)

59-59: LGTM! Good centralization of draft token length logic.

The import of get_draft_token_length utility function supports the centralized approach to retrieving draft token lengths, improving code consistency and maintainability.

281-281: LGTM! Runtime flag initialization is well-designed.

The enable_spec_decode flag properly initializes from is_spec_decode and provides the foundation for dynamic runtime control of speculative decoding.

503-545: LGTM! Parameterization of warmup request generation.

The addition of draft_len parameter to get_cuda_graph_warmup_request properly supports generating warmup requests for different draft token lengths, enabling the multi-draft-length CUDA graph approach.

718-726: Excellent logic for supporting dynamic speculative decoding.

The logic to capture CUDA graphs for both max_draft_len and 0 draft lengths enables runtime toggling of speculative decoding. The conditional exclusion of one-engine mode is correct since those implementations cannot disable spec decode at runtime.

732-747: LGTM! Proper runtime flag management during warmup.

The nested loop structure correctly captures CUDA graphs for each batch size and draft length combination. Setting enable_spec_decode based on draft_len > 0 or self.is_draft_model ensures the correct behavior during graph capture.

769-770: Important restoration of original runtime flag.

Correctly restores enable_spec_decode to its original value after warmup, ensuring the runtime state is not affected by the warmup process.

934-934: Correct use of runtime flag for draft length determination.

Using self.enable_spec_decode instead of the static configuration properly supports dynamic runtime control of speculative decoding.

952-954: Well-implemented nested CUDA graph structure.

The change from flat dictionary to nested dictionary structure (batch_size -> draft_length -> graph) properly supports storing multiple CUDA graph instances per batch size for different draft token lengths. The initialization logic correctly handles the nested structure.

Also applies to: 971-978

813-832: Clean pattern for conditional spec config usage.

The use of local spec_config variable set from self.spec_config if self.enable_spec_decode else None provides a clean, consistent way to conditionally enable speculative decoding throughout the method based on the runtime flag.

1249-1250: Consistent use of centralized draft token length utility.

The replacement of direct py_draft_tokens length checks with get_draft_token_length() calls improves consistency and safety by using the centralized utility function that properly handles None cases.

Also applies to: 1290-1290

1150-1150: Consistent runtime flag usage throughout input preparation.

The consistent use of self.enable_spec_decode instead of self.is_spec_decode throughout input preparation methods properly supports dynamic runtime control of speculative decoding.

Also applies to: 1488-1488, 1519-1519, 1644-1644

2100-2100: Proper runtime flag usage in forward method.

The forward method correctly uses self.enable_spec_decode to conditionally set up spec metadata and determine CUDA graph usage, supporting the dynamic speculative decoding feature.

Also applies to: 2140-2143

1131-1134: Correctly updated CUDA graph cleanup for nested structure.

The _release_cuda_graphs method properly handles the new nested dictionary structure, iterating through both batch sizes and draft lengths to clean up all graph instances.

tensorrt-cicd · 2025-07-31T11:39:07Z

PR_Github #13669 [ run ] triggered by Bot

omera-nv

Added unittest takes ~1 minute, which is ok, so I'm approving CI-time wise.

mikeiovine

I think coderabbit re-requested review, which revoked my previous approval

tensorrt-cicd · 2025-07-31T19:31:25Z

PR_Github #13669 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10261 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]>

ziyixiong-nv requested review from a team as code owners July 25, 2025 09:43

ziyixiong-nv requested review from byshiue and suyoggupta July 25, 2025 09:43

ziyixiong-nv requested a review from mikeiovine July 25, 2025 09:46

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

tensorrt_llm/_torch/speculative/ngram.py Outdated Show resolved Hide resolved

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

ziyixiong-nv requested a review from SimengLiu-nv July 25, 2025 09:52

ziyixiong-nv force-pushed the dev-fxiong-trtllm6392 branch 2 times, most recently from ba7ee74 to e742c1b Compare July 25, 2025 10:02

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

ziyixiong-nv force-pushed the dev-fxiong-trtllm6392 branch from e742c1b to d4771a5 Compare July 26, 2025 02:28

ziyixiong-nv requested a review from a team as a code owner July 26, 2025 02:28

coderabbitai bot requested review from HuiGao-NV and juney-nvidia July 26, 2025 02:28

coderabbitai bot added the Community want to contribute PRs initiated from Community label Jul 26, 2025

coderabbitai bot reviewed Jul 26, 2025

View reviewed changes

ziyixiong-nv force-pushed the dev-fxiong-trtllm6392 branch from d4771a5 to a2a042f Compare July 28, 2025 06:27

coderabbitai bot requested review from Naveassaf and omera-nv July 31, 2025 04:14

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

ziyixiong-nv added 3 commits July 31, 2025 19:33

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically

e63aba5

Signed-off-by: ziyixiong-nv <[email protected]>

Rename variables and remove unnecessary functions

9e67e7a

Signed-off-by: ziyixiong-nv <[email protected]>

Skip capturing CUDA graph instances for draft length 0 in some cases

fb786ed

Signed-off-by: ziyixiong-nv <[email protected]>

ziyixiong-nv force-pushed the dev-fxiong-trtllm6392 branch from fae7a0f to fb786ed Compare July 31, 2025 11:33

ziyixiong-nv enabled auto-merge (squash) July 31, 2025 11:33

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

omera-nv approved these changes Jul 31, 2025

View reviewed changes

mikeiovine approved these changes Jul 31, 2025

View reviewed changes

ziyixiong-nv merged commit 8062e0f into NVIDIA:main Jul 31, 2025
3 checks passed

coderabbitai bot mentioned this pull request Aug 1, 2025

[None][feat] Add test for speculative rejection sampler (2-model) #6542

Merged

lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically (N…

ba2dff2

…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically (N…

067c6d8

…VIDIA#6363) Signed-off-by: ziyixiong-nv <[email protected]>

coderabbitai bot mentioned this pull request Aug 11, 2025

Draft: LoRA CUDA graph support #6689

Closed

This was referenced Aug 28, 2025

Feat/support lora cuda graph #7335

Draft

[TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference #7363

Merged

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically #6363

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically #6363

Uh oh!

Conversation

ziyixiong-nv commented Jul 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ziyixiong-nv commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

ziyixiong-nv commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

ziyixiong-nv commented Jul 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 26, 2025

Uh oh!

tensorrt-cicd commented Jul 26, 2025

Uh oh!

ziyixiong-nv commented Jul 26, 2025

Uh oh!

tensorrt-cicd commented Jul 26, 2025

Uh oh!

tensorrt-cicd commented Jul 26, 2025

Uh oh!

tensorrt-cicd commented Jul 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 31, 2025

Uh oh!

ziyixiong-nv commented Jul 31, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 31, 2025

Uh oh!

omera-nv left a comment

Choose a reason for hiding this comment

Uh oh!

ziyixiong-nv commented Jul 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 25, 2025 •

edited

Loading