-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6853][feat] refactor deepseekv3 model #6698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-6853][feat] refactor deepseekv3 model #6698
Conversation
📝 WalkthroughWalkthroughReplace DeepseekV3's bespoke MTP/MoE wiring with a SpecDecOneEngineForCausalLM-based speculative decoding path: introduce SpecMetadata, centralize draft_model for MTP/Eagle drafting, add MTP/spec classes and workers, update signatures to accept Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DeepseekV3ForCausalLM
participant SpecDecOneEngineForCausalLM
participant SpecWorker
participant DraftModel
User->>DeepseekV3ForCausalLM: forward(input_ids, spec_metadata, ...)
DeepseekV3ForCausalLM->>SpecDecOneEngineForCausalLM: forward(..., spec_metadata)
SpecDecOneEngineForCausalLM->>SpecWorker: invoke(spec_metadata, draft_model)
SpecWorker->>DraftModel: use draft_model.mtp_layers / embed_tokens / lm_head
SpecWorker-->>SpecDecOneEngineForCausalLM: speculative logits/metadata
SpecDecOneEngineForCausalLM-->>DeepseekV3ForCausalLM: final logits
DeepseekV3ForCausalLM-->>User: output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
1-6
: Missing NVIDIA copyright header – mandatory for OSS filesAll TensorRT-LLM *.py sources must begin with the standard NVIDIA copyright & Apache-2.0 license boilerplate (see coding-guidelines).
Add it before the first import to stay policy-compliant.+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/_torch/models/modeling_speculative.py (1)
1288-1301
: Fix comparison style and exception type.Two issues in this function:
- Line 1289: Use
is not None
instead of!= None
(PEP 8 style)- Line 1298: Should raise
NotImplementedError
instead ofNotImplemented
- assert getattr(model_config, 'spec_config', None) != None + assert getattr(model_config, 'spec_config', None) is not None spec_dec_mode = model_config.spec_config.spec_dec_mode if spec_dec_mode.is_eagle3_one_model(): return Eagle3ForCausalLM( draft_config, model_config.pretrained_config.num_hidden_layers) elif spec_dec_mode.is_mtp(): return MTPForCausalLM(model, draft_config, model_config.pretrained_config.num_hidden_layers) else: - raise NotImplemented( + raise NotImplementedError( f"get_draft_model does not support speculative decoding mode {spec_dec_mode}." )
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/attention.py (1)
519-542
: New FP8 batched GEMM implementation for SM 100The previously commented-out SM 100 implementation is now active with DeepSeek-specific optimizations. The implementation:
- Pads input to tile boundaries for optimal performance
- Uses specialized FP8 batched GEMM with DeepSeek optimizations
- Removes the unused
mat2_dequant
parameterThe hardcoded parameters (
tile_size=8
,epilogue_tile_m=64
) appear to be tuned for specific use cases. Consider making these configurable if different models require different settings.tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
303-307
: Remove or use unused functioncheck_weight_dtype
.The function
check_weight_dtype
is defined but never called in the code.Either remove this unused function or use it where appropriate for dtype validation:
- def check_weight_dtype(module_name: str, dtype): - weight_name = "weight" - w_dtype = weights[f"{module_name}.{weight_name}"].dtype - return w_dtype == dtype -
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
examples/llm-api/quickstart_advanced.py
(1 hunks)requirements.txt
(1 hunks)tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(12 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/modules/attention.py
(3 hunks)tensorrt_llm/_torch/modules/fused_moe/create_moe.py
(0 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
(0 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py
(0 hunks)tensorrt_llm/_torch/modules/linear.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(0 hunks)tensorrt_llm/_torch/speculative/mtp.py
(5 hunks)tensorrt_llm/_torch/speculative/utils.py
(1 hunks)tensorrt_llm/_utils.py
(0 hunks)tensorrt_llm/llmapi/llm_args.py
(1 hunks)tensorrt_llm/quantization/utils/__init__.py
(1 hunks)tensorrt_llm/quantization/utils/fp8_utils.py
(0 hunks)tests/unittest/_torch/helpers.py
(0 hunks)tests/unittest/_torch/modules/test_fused_moe.py
(1 hunks)tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
(1 hunks)tests/unittest/test_pip_install.py
(0 hunks)
💤 Files with no reviewable changes (8)
- tests/unittest/test_pip_install.py
- tests/unittest/_torch/helpers.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
- tensorrt_llm/_torch/modules/fused_moe/create_moe.py
- tensorrt_llm/_utils.py
- tensorrt_llm/quantization/utils/fp8_utils.py
- tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/quantization/utils/__init__.py
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
tensorrt_llm/_torch/speculative/utils.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/modules/test_fused_moe.py
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/quantization/utils/__init__.py
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
tensorrt_llm/_torch/speculative/utils.py
examples/llm-api/quickstart_advanced.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/modules/test_fused_moe.py
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (5)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/speculative/mtp.py
1049-1049: Undefined name mapping
(F821)
tensorrt_llm/_torch/models/modeling_speculative.py
418-418: Undefined name get_sm_version
(F821)
435-435: Line too long (135 > 120)
(E501)
487-487: Undefined name F
(F821)
491-491: Undefined name enable_llm_debug
(F821)
494-494: Undefined name warnings
(F821)
580-580: Line too long (122 > 120)
(E501)
712-712: Line too long (125 > 120)
(E501)
718-718: Line too long (136 > 120)
(E501)
730-730: Undefined name TRTLLMGenFusedMoE
(F821)
1289-1289: Comparison to None
should be cond is not None
Replace with cond is not None
(E711)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
40-40: Redefinition of unused ModelConfig
from line 39
(F811)
🔇 Additional comments (16)
requirements.txt (1)
63-66
: Do not remove thedeep_gemm
dependency – it’s still in useA search for “deep_gemm” reveals active references in the codebase, so the VCS-pinned package is still required:
• tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (calls
test_deep_gemm_in_subprocess
)
• tests/unittest/_torch/thop/deep_gemm_tests.py (actual test suite)
• scripts/build_wheel.py (copies thedeep_gemm
include directory into the wheel)Please keep the existing line in
requirements.txt
.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py (1)
15-15
: Synchronization Removal in HfWeightLoader.prefetch_files Is Intentional and SafeThe
prefetch_files
method shards files across ranks usinglocal_mpi_rank()
/local_mpi_size()
and each rank only prefetches its own subset. No cross‐rank dependency exists, so an explicit barrier after prefetch isn’t required for correctness—each rank independently reads its files into OS cache before loading.• No
mpi_barrier
calls remain or are needed in this file.
• Other MPI synchronization sites (e.g., in allreduce, executor, build) are unaffected and still present where inter‐rank ordering matters.Please ignore the suggestion to reintroduce a barrier here.
Likely an incorrect or invalid review comment.
tensorrt_llm/quantization/utils/__init__.py (1)
1-3
: LGTM — allfp8_utils
references removed.Verified via ripgrep searches that there are no remaining import statements or direct usages of
fp8_utils
in the codebase.tensorrt_llm/_torch/speculative/utils.py (1)
150-150
: MTPWorker constructor correctly updated to acceptmapping
The
__init__
intensorrt_llm/_torch/speculative/mtp.py
now definesmapping=None
, so usingMTPWorker(spec_config, model_config.mapping)
is valid and remains backward-compatible with tests that omit the mapping. No further changes required.
- Constructor signature in
tensorrt_llm/_torch/speculative/mtp.py
- Usage in
tensorrt_llm/_torch/speculative/utils.py
updated to passmodel_config.mapping
- Tests in
tests/unittest/_torch/speculative/test_mtp.py
still callMTPWorker(spec_config)
, which works with the defaultmapping=None
tensorrt_llm/llmapi/llm_args.py (1)
170-172
: No leftoverDEEPGEMM
references detected – changes approvedA repo-wide search for
"DEEPGEMM"
returned no matches, confirming the literal has been fully pruned.
No further action required.tests/unittest/_torch/modules/test_fused_moe.py (1)
12-12
: LGTM! Clean import cleanup.The removal of unused FP8 quantization helpers (
per_block_cast_to_fp8_e8m0
andper_token_cast_to_fp8_e8m0
) aligns with the DeepGEMM backend removal. The remaining importper_block_cast_to_fp8
is still needed for other tests.tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)
21-21
: LGTM! Clean import cleanup.The removal of unused FP8 quantization helpers (
per_block_cast_to_fp8_e8m0
andper_token_cast_to_fp8_e8m0
) is consistent with the DeepGEMM test removal. The remaining imports are still needed for other tests.tensorrt_llm/_torch/speculative/mtp.py (5)
321-321
: LGTM! Constructor signature simplification.The change from
model_config
tomapping
parameter simplifies the constructor interface and reduces dependencies.
324-324
: LGTM! Attribute assignment update.The assignment properly stores the mapping parameter as an instance attribute.
335-335
: LGTM! Forward method signature consolidation.Consolidating model components into a single
draft_model
parameter provides a cleaner interface than separatelm_head
,embed_tokens
, andmtp_layers
parameters.
471-474
: LGTM! Proper utilization of consolidated draft_model parameter.The code correctly accesses MTP layers and model components through the consolidated
draft_model
object structure.
1091-1094
: LGTM! Consistent mapping access pattern.The changes properly use
self.mapping
instead ofself.model_config.mapping
, consistent with the constructor refactoring.tensorrt_llm/_torch/modules/linear.py (3)
573-576
: Code simplified by removing DeepGEMM backendThe removal of SM 100-specific deep_gemm logic simplifies the FP8 GEMM computation path. This aligns with the broader removal of DeepGEMM support from the codebase.
601-617
: Critical fix: Missing weight copy in fused QKV loadingGood catch! The fused weight tensor is now properly copied to the module's weight parameter before copying the scales. This fixes a critical bug where the concatenated QKV weights were not being assigned to the module.
619-632
: Critical fix: Missing weight copy in fused gate/up loadingSame critical fix as in the QKV method - the fused weight tensor is now properly copied to the module's weight parameter. This ensures gate and up weights are correctly loaded.
tensorrt_llm/_torch/modules/attention.py (1)
1351-1352
: Updated function calls to match new signatureThe
fp8_block_scaling_bmm_out
calls have been correctly updated to remove themat2_dequant
argument, matching the simplified function signature.Also applies to: 1401-1403
a07584d
to
2e064cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (7)
tensorrt_llm/_torch/speculative/mtp.py (1)
1049-1049
: Fix undefined variable reference.The code references
mapping.tp_rank
butmapping
is not defined in the local scope. This should beself.mapping.tp_rank
.Apply this fix:
- torch.int32) + (mapping.tp_rank * vocab_per_rank) + torch.int32) + (self.mapping.tp_rank * vocab_per_rank)tensorrt_llm/_torch/models/modeling_speculative.py (4)
1-1
: Add required NVIDIA copyright header.Per the coding guidelines and retrieved learnings, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the current year.
418-418
: Fix undefined functionget_sm_version()
.The function
get_sm_version()
is used but not imported, which will cause aNameError
at runtime.
487-494
: Fix multiple undefined names in routing implementation.The
noaux_tc
method uses several undefined names that will cause runtime errors:
- Line 487:
F.sigmoid
-F
is not imported- Line 491:
enable_llm_debug()
- function not imported- Line 494:
warnings
- module not imported
730-730
: Fix undefined classTRTLLMGenFusedMoE
.The class
TRTLLMGenFusedMoE
is used in isinstance check but not imported.tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
49-51
: Remove duplicate import ofModelConfig
.Line 50 redefines
ModelConfig
that was already imported on line 49, causing confusion and potential bugs.
402-402
: Remove debug print statement.The print statement appears to be debug code that should not be in production.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
468-546
: Review routing implementation for potential performance and correctness issues.The
Deepseekv3RoutingImpl
class has a complex routing algorithm with both fused and non-fused paths. Consider the following:
- The non-fused path (lines 498-533) has nested operations that could be optimized
- The fused path relies on a custom operator
torch.ops.trtllm.noaux_tc_op
which may not be available in all environments- Error handling for the debug path uses undefined names (already flagged)
Consider adding fallback logic if the fused operator is not available:
else: + try: topk_values, topk_indices = torch.ops.trtllm.noaux_tc_op( scores, scores_with_bias, n_group, self.topk_group, self.top_k, self.routed_scaling_factor) + except (AttributeError, RuntimeError): + # Fallback to non-fused implementation + return self._non_fused_routing(scores, scores_with_bias) return topk_values, topk_indices
1288-1300
: Fix comparison operator for None check.Use
is not None
instead of!= None
for consistency with Python best practices.- assert getattr(model_config, 'spec_config', None) != None + assert getattr(model_config, 'spec_config', None) is not None
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
(10 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/speculative/mtp.py
(5 hunks)tensorrt_llm/_torch/speculative/utils.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/speculative/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/speculative/mtp.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/speculative/mtp.py
🧠 Learnings (5)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py
418-418: Undefined name get_sm_version
(F821)
435-435: Line too long (135 > 120)
(E501)
487-487: Undefined name F
(F821)
491-491: Undefined name enable_llm_debug
(F821)
494-494: Undefined name warnings
(F821)
580-580: Line too long (122 > 120)
(E501)
712-712: Line too long (125 > 120)
(E501)
718-718: Line too long (136 > 120)
(E501)
730-730: Undefined name TRTLLMGenFusedMoE
(F821)
1289-1289: Comparison to None
should be cond is not None
Replace with cond is not None
(E711)
tensorrt_llm/_torch/speculative/mtp.py
1049-1049: Undefined name mapping
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (14)
tensorrt_llm/_torch/speculative/mtp.py (5)
321-321
: LGTM! Constructor signature updated correctly.The constructor signature change from accepting
model_config
tomapping
aligns with the broader refactoring to use more direct configuration objects.
324-324
: LGTM! Mapping assignment is correct.The assignment of
mapping
toself.mapping
properly stores the configuration for later use in tensor parallelism operations.
335-335
: LGTM! Forward method signature simplified appropriately.The change from individual parameters (
lm_head
,embed_tokens
,mtp_layers
) to a singledraft_model
parameter simplifies the interface and aligns with the new draft model encapsulation approach.
471-475
: LGTM! Draft model usage is implemented correctly.The code properly accesses the MTP layers and components through the draft model:
draft_model.model.mtp_layers
for iterating over layersdraft_model.model.embed_tokens
for embeddingsdraft_model.lm_head
for the language model headThis aligns with the new draft model encapsulation pattern.
1091-1094
: LGTM! Mapping usage updated correctly.The code properly uses
self.mapping
instead ofself.model_config.mapping
for tensor parallelism operations, which is consistent with the constructor changes and the new configuration approach.tensorrt_llm/_torch/models/modeling_speculative.py (4)
334-372
: LGTM: Well-implemented MTP head with proper logits processing.The
DeepseekV3MTPHead
class correctly handles last token extraction using cumulative sum and provides proper gather control for distributed attention. The compiledget_last_token_states
method should provide good performance for the token extraction logic.
617-794
: LGTM: Comprehensive MoE implementation with proper optimization patterns.The
Deepseekv3MoE
class demonstrates good engineering practices:
- Proper handling of distributed parallelism with attention DP vs tensor parallelism
- Smart tensor parallelism size computation based on quantization constraints
- Effective use of auxiliary streams for overlapping computation
- Correct all-reduce handling for different execution modes
The implementation properly balances performance optimizations with correctness.
796-1106
: Well-structured decoder layer with comprehensive fusion optimizations.The
DeepseekV3DecoderLayer
implementation shows excellent attention to performance:
- Intelligent fusion configuration based on hardware capabilities and model configuration
- Proper handling of different MLP vs MoE execution paths
- Smart tensor parallelism size computation considering quantization block sizes
- Effective use of eager fusion for reducing communication overhead
The separation between
forward_MoE
andforward_mlp
methods maintains code clarity while handling the complexity of different execution modes.
1312-1335
: LGTM: Proper integration with speculative decoding framework.The updated constructor properly handles both Eagle3 and MTP speculative decoding modes, with appropriate configuration handling and draft model instantiation. The conditional logic correctly sets up the different speculative decoding paths.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (5)
153-154
: LGTM: Proper integration with speculative decoding metadata.Adding the
spec_metadata
parameter aligns theDeepseekV3Model
forward method with the speculative decoding framework requirements. The parameter is correctly typed and positioned.
179-197
: LGTM: Clean integration with speculative decoding engine.The inheritance change from
DecoderModelForCausalLM
toSpecDecOneEngineForCausalLM
and the updated constructor properly integrate with the new speculative decoding architecture. The use of keyword arguments in the superclass call is more explicit and maintainable.
214-233
: LGTM: Proper draft model integration.The changes correctly integrate with the new draft model architecture:
- Lines 214-215: Properly extend layers from
self.draft_model.model.mtp_layers
- Line 233: Correctly append
self.spec_worker
instead of the previousself.mtp_worker
This aligns with the new speculative decoding framework where draft models encapsulate MTP layers.
246-252
: LGTM: Clean delegation to speculative decoding engine.The simplified forward method correctly delegates all processing to the superclass, which now handles the speculative decoding logic. This is a good example of separation of concerns - the base model focuses on its core functionality while the speculative decoding engine handles the advanced features.
406-407
: LGTM: Proper filtering of draft model weights.The addition of draft model filtering in weight loading is appropriate since draft models are handled separately in the new architecture. This prevents conflicts during weight loading.
4b20324
to
aa6a243
Compare
aa6a243
to
694e973
Compare
ae87923
to
d9d16b6
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
449-455
: Guard against AttributeError when draft_model has no weights API (MTP path).
MTPForCausalLM
is a lightweight wrapper withoutload_weights
/load_weights_from_target_model
. Unconditional calls will raise AttributeError. Make this method resilient.- self.draft_model.load_weights(weights=weights, - weight_mapper=weight_mapper) - self.draft_model.load_weights_from_target_model(self) + # Some draft models (e.g., MTPForCausalLM) are weightless. + if hasattr(self.draft_model, "load_weights"): + self.draft_model.load_weights(weights=weights, + weight_mapper=weight_mapper) + if hasattr(self.draft_model, "load_weights_from_target_model"): + self.draft_model.load_weights_from_target_model(self)tensorrt_llm/_torch/speculative/mtp.py (3)
103-110
: Trailing comma makes mtp_num_modules default a tuple, not int.
mtp_num_modules: int = 1,
evaluates to a tuple(1,)
. This will break consumers (e.g.,torch.arange(self.mtp_num_modules)
). Remove the comma.- mtp_num_modules: int = 1, + mtp_num_modules: int = 1
495-496
: Fix None check order to avoid AttributeError.
if attn_metadata.is_cuda_graph and attn_metadata is not None
dereferencesattn_metadata
before checking None. Reverse the order.- if attn_metadata.is_cuda_graph and attn_metadata is not None: + if attn_metadata is not None and attn_metadata.is_cuda_graph:
1166-1198
: Bug: MTPEagleWorker always uses mtp_layers[0] instead of layer i.Within the loop over
self.mtp_num_modules
, calls to the MTP layer and shared_head are hard-coded to index 0, breaking multi-layer MTP. Use the loop index.- if i == 0: - hidden_states = draft_model.mtp_layers[0]( + if i == 0: + hidden_states = draft_model.mtp_layers[i]( embed_tokens=draft_model.embed_tokens, all_rank_num_tokens=spec_metadata.all_rank_num_tokens, all_rank_max_num_tokens=spec_metadata. all_rank_max_num_tokens, **inputs) ... - else: - hidden_states = draft_model.mtp_layers[0]( + else: + hidden_states = draft_model.mtp_layers[i]( embed_tokens=draft_model.embed_tokens, all_rank_num_tokens=spec_metadata. subseq_all_rank_num_tokens, all_rank_max_num_tokens=max( spec_metadata.subseq_all_rank_num_tokens) if spec_metadata.subseq_all_rank_num_tokens is not None else None, **inputs) ... - logits = draft_model.mtp_layers[0].shared_head( + logits = draft_model.mtp_layers[i].shared_head( hidden_states[gather_ids], draft_model.lm_head, attn_metadata, True)tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
1-1
: Add required NVIDIA copyright header (current year).This production source file lacks the NVIDIA SPDX header. Retain the third-party MIT notice, but prepend the NVIDIA Apache-2.0 header at the very top.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + # -------------------------------------------------- # Portions of this code were derived from DeepSeek‑V3:
214-221
: Fused A GEMM path ignores LoRA delta — gate fast path when LoRA is active.When
lora_params
is provided, the fused path will drop LoRA contributions. Only take the fused path whenlora_params is None
.- if (not self.has_any_quant and 1 <= num_tokens <= 16 - and get_sm_version() != 120): + if (lora_params is None + and not self.has_any_quant + and 1 <= num_tokens <= 16 + and get_sm_version() != 120): output = torch.ops.trtllm.dsv3_fused_a_gemm_op( input, self.weight.t(), bias, None) else: output = super().apply_linear(input, bias, lora_params, layer_idx)
1128-1157
: Fix: self.model_nextn never updated from spec_config.
forward()
setsattn_metadata.num_generations_per_batch
based onself.model_nextn
, but__init__
never assigns it fromspec_config
. Set it tomodel_nextn
once computed.self.model_nextn = 0 if model_config.spec_config is not None: model_nextn = model_config.spec_config.num_nextn_predict_layers + self.model_nextn = model_nextn ckpt_nextn = self.config.num_nextn_predict_layers
♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
1-1
: Add required NVIDIA copyright header (current year).Per repo guidelines, all OSS source files must include the NVIDIA copyright header. Please add it at the very top of the file, before any imports. This was previously flagged and remains unresolved.
Apply this patch:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (5)
351-353
: Use identity comparison for None (fix Ruff E711).Replace
!= None
withis not None
.- assert getattr(model_config, 'spec_config', None) != None + assert getattr(model_config, 'spec_config', None) is not None
69-75
: Fix incorrect return annotation on init.
__init__
must return None. Current annotation is a Tuple, which is incorrect.- ) -> Tuple[torch.Tensor, torch.Tensor]: + ) -> None:
104-112
: Correct forward() return type annotation (returns a tuple).
Eagle3DecoderLayer.forward
returns(hidden_states, residual)
but the annotation saystorch.Tensor
.- ) -> torch.Tensor: + ) -> Tuple[torch.Tensor, torch.Tensor]:
187-196
: Correct forward() return type annotation (returns a tuple).
Eagle3DraftModel.forward
returns a tuple(hidden_states, hidden_states_to_save)
but the annotation istorch.Tensor
.- ) -> torch.Tensor: + ) -> Tuple[torch.Tensor, torch.Tensor]:
323-349
: MTPForCausalLM relies on non-None model/lm_head — assert early.Given
model.aux_stream_dict
,model.embed_tokens
, andlm_head
are dereferenced, assert they are provided to fail-fast with a clear message.def __init__( self, model_config: ModelConfig[PretrainedConfig], start_layer_idx: int = 0, lm_head: nn.Module = None, model: nn.Module = None, ): super().__init__() + assert model is not None and lm_head is not None, \ + "MTPForCausalLM requires non-None `model` and `lm_head` from the target model."tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
969-971
: Nit: fix comment typo.Minor typo in comment.
- return self.enorm(embed_tokens(input_ids)) #emdedding + return self.enorm(embed_tokens(input_ids)) # embedding
1076-1105
: Unused spec_metadata in model forward — either use or document.
DeepseekV3Model.forward()
acceptsspec_metadata
but never uses it. If this is solely for signature-unification with the engine, add a short docstring note or a comment; otherwise, remove it to avoid confusion.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
(13 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/speculative/mtp.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py
351-351: Comparison to None
should be cond is not None
Replace with cond is not None
(E711)
🔇 Additional comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
375-393
: LGTM: unified draft_model construction and one-engine wiring.Using
get_draft_model(...)
and preserving kv_cache quant algo for Eagle3 is clean and minimizes special cases. Good separation of concerns.tensorrt_llm/_torch/speculative/mtp.py (1)
467-475
: LGTM: refactor to use draft_model encapsulation.Passing a single
draft_model
into MTP workers simplifies wiring and reduces parameter surface. The use ofdraft_model.lm_head
andembed_tokens
is consistent.tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
536-545
: LGTM: replace boolean-or on tensor with explicit None check.The
hidden_states_fp4 if hidden_states_fp4 is not None else hidden_states
pattern avoids PyTorch boolean evaluation pitfalls. Good fix.
PR_Github #14893 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/speculative/mtp.py (1)
1-1
: Add required NVIDIA copyright header.Per the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header with the current year.
Add the following copyright header at the beginning of the file:
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.
♻️ Duplicate comments (4)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
1-1
: Add required NVIDIA copyright header.Per the coding guidelines, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the current year.
Add the following copyright header at the beginning of the file:
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.tensorrt_llm/_torch/speculative/mtp.py (1)
1041-1048
: Fix undefined variable reference.The code references
self.model_config.mapping.tp_rank
but based on the context and line 217 whereself.mapping = None
is initialized, this should beself.mapping.tp_rank
after proper initialization.- torch.int32) + (self.model_config.mapping.tp_rank * vocab_per_rank) + torch.int32) + (self.mapping.tp_rank * vocab_per_rank)tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)
382-382
: Remove debug print statement.The print statement on line 382 appears to be debug code that should not be in production.
for name, module in tqdm(all_named_modules.items(), desc="Loading weights"): - print(f"name: {name}")
1082-1084
: Remove unused parameters or handle them appropriately.The
spec_metadata
and**kwargs
parameters are accepted but never used in the forward method. Either remove them from the signature or pass them through to the layers that might need them.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
351-351
: Useis not None
for None comparison.Replace equality check with identity check for better Python practice.
- assert getattr(model_config, 'spec_config', None) != None + assert getattr(model_config, 'spec_config', None) is not None
350-363
: Consider simplifying the conditional logic.The function has three branches but might be clearer with early returns.
def get_draft_model(model_config, draft_config, lm_head, model): - assert getattr(model_config, 'spec_config', None) != None + spec_config = getattr(model_config, 'spec_config', None) + assert spec_config is not None, "spec_config is required for draft model creation" + spec_dec_mode = model_config.spec_config.spec_dec_mode + if spec_dec_mode.is_eagle3_one_model(): return Eagle3ForCausalLM( draft_config, model_config.pretrained_config.num_hidden_layers) - elif spec_dec_mode.is_mtp(): + + if spec_dec_mode.is_mtp(): return MTPForCausalLM(model_config, model_config.pretrained_config.num_hidden_layers, lm_head, model) - else: - raise NotImplementedError( - f"get_draft_model does not support speculative decoding mode {spec_dec_mode}." - ) + + raise NotImplementedError( + f"get_draft_model does not support speculative decoding mode {spec_dec_mode}." + )tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
747-748
: Update return type annotations for consistency.Multiple methods now return
Tuple[torch.Tensor, torch.Tensor]
but the model's forward method returns justtorch.Tensor
. Ensure all return type annotations accurately reflect what's being returned.Also applies to: 779-780, 863-864
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
(12 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(3 hunks)tensorrt_llm/_torch/speculative/mtp.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/models/modeling_speculative.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py
351-351: Comparison to None
should be cond is not None
Replace with cond is not None
(E711)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
323-348
: Well-structured MTP draft model initialization.The MTPForCausalLM class properly integrates with the DeepseekV3 model by:
- Correctly creating MTP layers based on configuration
- Sharing the auxiliary stream dictionary from the target model
- Properly exposing lm_head and embed_tokens for shared use
tensorrt_llm/_torch/speculative/mtp.py (2)
327-336
: Clean refactor to use draft_model parameter.The forward method signature has been properly updated to accept a centralized
draft_model
parameter, eliminating the need for separate lm_head, embed_tokens, and mtp_layers parameters. This improves encapsulation and maintainability.
471-476
: Verify shared_head signature across all call sitesI wasn’t able to locate the
shared_head
definition in the repo, but it’s invoked with three args at L474 and four at L1195—this may be an optional flag or overload. Please:• Check the
shared_head
method declaration in tensorrt_llm/_torch/speculative/mtp.py
• Ensure it accepts both
– 3 parameters (hidden_states, lm_head, attn_metadata)
– 4 parameters (plus the boolean flag)
• Update either the signature or the call sites for consistency:– Line 474:
mtp_layer.shared_head(hidden_states, draft_model.lm_head, attn_metadata)
– Line 1195:
draft_model.mtp_layers[0].shared_head(hidden_states[gather_ids], draft_model.lm_head, attn_metadata, True)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)
1-26
: Copyright header properly includes DeepSeek attribution.The file correctly includes both the DeepSeek copyright notice and the required attribution for derived code.
1108-1109
: Excellent refactoring to use SpecDecOneEngineForCausalLM base class.The DeepseekV3ForCausalLM class has been cleanly refactored to inherit from SpecDecOneEngineForCausalLM, which centralizes the speculative decoding logic and eliminates code duplication.
536-537
: Consistent use of ternary operator for tensor selection.Good use of explicit None checks to avoid issues with tensor boolean evaluation.
Also applies to: 562-563
955-956
: Well-designed MTP head implementation.The DeepseekV3MTPHead class provides a clean abstraction for the MTP-specific head operations with proper normalization and context-aware logit extraction.
/bot run |
PR_Github #14918 [ run ] triggered by Bot |
PR_Github #14893 [ run ] completed with state |
/bot run |
PR_Github #14931 [ run ] triggered by Bot |
PR_Github #14918 [ run ] completed with state |
PR_Github #14931 [ run ] completed with state |
@mikeiovine The pipeline is passed. Could you please review the PR again? Thanks. |
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: linquanh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.