Skip to content

Conversation

kris1025
Copy link
Collaborator

@kris1025 kris1025 commented Aug 7, 2025

Summary by CodeRabbit

  • New Features

    • Introduces a unified speculative decoding engine supporting both drafting modes (vanilla MTP and Eagle3) with integrated draft-model drafting path for faster speculative generation.
  • Refactor

    • Model inference and initialization now delegate to the speculative engine; drafting layers, routing, and gating are integrated into the model graph and forward flow.
    • Draft workers now consume a single draft-model interface for embeddings, layers, and head.
  • Bug Fixes

    • More precise speculative-mode detection to ensure correct worker selection.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@kris1025 kris1025 requested review from a team as code owners August 7, 2025 09:08
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

📝 Walkthrough

Walkthrough

Replace DeepseekV3's bespoke MTP/MoE wiring with a SpecDecOneEngineForCausalLM-based speculative decoding path: introduce SpecMetadata, centralize draft_model for MTP/Eagle drafting, add MTP/spec classes and workers, update signatures to accept spec_metadata, and refactor worker APIs to consume a single draft_model.

Changes

Cohort / File(s) Change Summary
DeepseekV3 model & speculative MTP classes
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Replace MTP/MoE worker wiring with a speculative engine: subclass SpecDecOneEngineForCausalLM, add spec_metadata parameters, introduce new MTP/spec classes (DeepseekV3MTPHead, DeepseekV3Linear, DeepseekV3Attention, Deepseekv3RoutingImpl, DeepseekV3Gate, Deepseekv3MoE, DeepseekV3DecoderLayer, DeepseekV3MTP), and update init/forward signatures and return types.
Speculative engine & draft-model selection
tensorrt_llm/_torch/models/modeling_speculative.py
Add MTPForCausalLM draft-model path; change get_draft_model(model_config, draft_config, lm_head, model) to return Eagle3 or MTP drafts; wire draft_model creation in SpecDecOneEngineForCausalLM; preserve kv-cache quant handling; raise NotImplementedError for unsupported modes.
MTP worker API consolidation
tensorrt_llm/_torch/speculative/mtp.py
Centralize worker inputs by passing a single draft_model (holding mtp_layers, embed_tokens, lm_head); update MTPWorker / MTPEagleWorker forward and skip_forward signatures and internals to use draft_model.
Speculative utils refactor
tensorrt_llm/_torch/speculative/utils.py
Use local spec_dec_mode = spec_config.spec_dec_mode and switch conditional to spec_dec_mode.is_mtp_vanilla() in get_spec_worker to distinguish vanilla MTP vs MTP_EAGLE.
Speculative mode enum
tensorrt_llm/_torch/speculative/interface.py
Add is_mtp_vanilla() to SpeculativeDecodingMode to detect exact MTP (distinct from MTP_EAGLE).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DeepseekV3ForCausalLM
    participant SpecDecOneEngineForCausalLM
    participant SpecWorker
    participant DraftModel

    User->>DeepseekV3ForCausalLM: forward(input_ids, spec_metadata, ...)
    DeepseekV3ForCausalLM->>SpecDecOneEngineForCausalLM: forward(..., spec_metadata)
    SpecDecOneEngineForCausalLM->>SpecWorker: invoke(spec_metadata, draft_model)
    SpecWorker->>DraftModel: use draft_model.mtp_layers / embed_tokens / lm_head
    SpecWorker-->>SpecDecOneEngineForCausalLM: speculative logits/metadata
    SpecDecOneEngineForCausalLM-->>DeepseekV3ForCausalLM: final logits
    DeepseekV3ForCausalLM-->>User: output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • dongxuy04
  • yechank-nvidia
  • juney-nvidia
  • nv-guomingz
  • byshiue
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kris1025 kris1025 marked this pull request as draft August 7, 2025 09:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (2)
tensorrt_llm/llmapi/llm_args.py (1)

1-6: Missing NVIDIA copyright header – mandatory for OSS files

All TensorRT-LLM *.py sources must begin with the standard NVIDIA copyright & Apache-2.0 license boilerplate (see coding-guidelines).
Add it before the first import to stay policy-compliant.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/models/modeling_speculative.py (1)

1288-1301: Fix comparison style and exception type.

Two issues in this function:

  1. Line 1289: Use is not None instead of != None (PEP 8 style)
  2. Line 1298: Should raise NotImplementedError instead of NotImplemented
-    assert getattr(model_config, 'spec_config', None) != None
+    assert getattr(model_config, 'spec_config', None) is not None
     spec_dec_mode = model_config.spec_config.spec_dec_mode
     if spec_dec_mode.is_eagle3_one_model():
         return Eagle3ForCausalLM(
             draft_config, model_config.pretrained_config.num_hidden_layers)
     elif spec_dec_mode.is_mtp():
         return MTPForCausalLM(model, draft_config,
                               model_config.pretrained_config.num_hidden_layers)
     else:
-        raise NotImplemented(
+        raise NotImplementedError(
             f"get_draft_model does not support speculative decoding mode {spec_dec_mode}."
         )
🧹 Nitpick comments (2)
tensorrt_llm/_torch/modules/attention.py (1)

519-542: New FP8 batched GEMM implementation for SM 100

The previously commented-out SM 100 implementation is now active with DeepSeek-specific optimizations. The implementation:

  • Pads input to tile boundaries for optimal performance
  • Uses specialized FP8 batched GEMM with DeepSeek optimizations
  • Removes the unused mat2_dequant parameter

The hardcoded parameters (tile_size=8, epilogue_tile_m=64) appear to be tuned for specific use cases. Consider making these configurable if different models require different settings.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

303-307: Remove or use unused function check_weight_dtype.

The function check_weight_dtype is defined but never called in the code.

Either remove this unused function or use it where appropriate for dtype validation:

-        def check_weight_dtype(module_name: str, dtype):
-            weight_name = "weight"
-            w_dtype = weights[f"{module_name}.{weight_name}"].dtype
-            return w_dtype == dtype
-
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ec3b1d and a07584d.

📒 Files selected for processing (21)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • requirements.txt (1 hunks)
  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (0 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (0 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (0 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (5 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
  • tensorrt_llm/_utils.py (0 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/quantization/utils/__init__.py (1 hunks)
  • tensorrt_llm/quantization/utils/fp8_utils.py (0 hunks)
  • tests/unittest/_torch/helpers.py (0 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (1 hunks)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1 hunks)
  • tests/unittest/test_pip_install.py (0 hunks)
💤 Files with no reviewable changes (8)
  • tests/unittest/test_pip_install.py
  • tests/unittest/_torch/helpers.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/quantization/utils/fp8_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/quantization/utils/__init__.py
  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
  • tensorrt_llm/_torch/speculative/utils.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/quantization/utils/__init__.py
  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
  • tensorrt_llm/_torch/speculative/utils.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (5)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/speculative/mtp.py

1049-1049: Undefined name mapping

(F821)

tensorrt_llm/_torch/models/modeling_speculative.py

418-418: Undefined name get_sm_version

(F821)


435-435: Line too long (135 > 120)

(E501)


487-487: Undefined name F

(F821)


491-491: Undefined name enable_llm_debug

(F821)


494-494: Undefined name warnings

(F821)


580-580: Line too long (122 > 120)

(E501)


712-712: Line too long (125 > 120)

(E501)


718-718: Line too long (136 > 120)

(E501)


730-730: Undefined name TRTLLMGenFusedMoE

(F821)


1289-1289: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

40-40: Redefinition of unused ModelConfig from line 39

(F811)

🔇 Additional comments (16)
requirements.txt (1)

63-66: Do not remove the deep_gemm dependency – it’s still in use

A search for “deep_gemm” reveals active references in the codebase, so the VCS-pinned package is still required:

• tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (calls test_deep_gemm_in_subprocess)
• tests/unittest/_torch/thop/deep_gemm_tests.py (actual test suite)
• scripts/build_wheel.py (copies the deep_gemm include directory into the wheel)

Please keep the existing line in requirements.txt.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py (1)

15-15: Synchronization Removal in HfWeightLoader.prefetch_files Is Intentional and Safe

The prefetch_files method shards files across ranks using local_mpi_rank()/local_mpi_size() and each rank only prefetches its own subset. No cross‐rank dependency exists, so an explicit barrier after prefetch isn’t required for correctness—each rank independently reads its files into OS cache before loading.

• No mpi_barrier calls remain or are needed in this file.
• Other MPI synchronization sites (e.g., in allreduce, executor, build) are unaffected and still present where inter‐rank ordering matters.

Please ignore the suggestion to reintroduce a barrier here.

Likely an incorrect or invalid review comment.

tensorrt_llm/quantization/utils/__init__.py (1)

1-3: LGTM — all fp8_utils references removed.

Verified via ripgrep searches that there are no remaining import statements or direct usages of fp8_utils in the codebase.

tensorrt_llm/_torch/speculative/utils.py (1)

150-150: MTPWorker constructor correctly updated to accept mapping

The __init__ in tensorrt_llm/_torch/speculative/mtp.py now defines mapping=None, so using MTPWorker(spec_config, model_config.mapping) is valid and remains backward-compatible with tests that omit the mapping. No further changes required.

  • Constructor signature in tensorrt_llm/_torch/speculative/mtp.py
  • Usage in tensorrt_llm/_torch/speculative/utils.py updated to pass model_config.mapping
  • Tests in tests/unittest/_torch/speculative/test_mtp.py still call MTPWorker(spec_config), which works with the default mapping=None
tensorrt_llm/llmapi/llm_args.py (1)

170-172: No leftover DEEPGEMM references detected – changes approved

A repo-wide search for "DEEPGEMM" returned no matches, confirming the literal has been fully pruned.
No further action required.

tests/unittest/_torch/modules/test_fused_moe.py (1)

12-12: LGTM! Clean import cleanup.

The removal of unused FP8 quantization helpers (per_block_cast_to_fp8_e8m0 and per_token_cast_to_fp8_e8m0) aligns with the DeepGEMM backend removal. The remaining import per_block_cast_to_fp8 is still needed for other tests.

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)

21-21: LGTM! Clean import cleanup.

The removal of unused FP8 quantization helpers (per_block_cast_to_fp8_e8m0 and per_token_cast_to_fp8_e8m0) is consistent with the DeepGEMM test removal. The remaining imports are still needed for other tests.

tensorrt_llm/_torch/speculative/mtp.py (5)

321-321: LGTM! Constructor signature simplification.

The change from model_config to mapping parameter simplifies the constructor interface and reduces dependencies.


324-324: LGTM! Attribute assignment update.

The assignment properly stores the mapping parameter as an instance attribute.


335-335: LGTM! Forward method signature consolidation.

Consolidating model components into a single draft_model parameter provides a cleaner interface than separate lm_head, embed_tokens, and mtp_layers parameters.


471-474: LGTM! Proper utilization of consolidated draft_model parameter.

The code correctly accesses MTP layers and model components through the consolidated draft_model object structure.


1091-1094: LGTM! Consistent mapping access pattern.

The changes properly use self.mapping instead of self.model_config.mapping, consistent with the constructor refactoring.

tensorrt_llm/_torch/modules/linear.py (3)

573-576: Code simplified by removing DeepGEMM backend

The removal of SM 100-specific deep_gemm logic simplifies the FP8 GEMM computation path. This aligns with the broader removal of DeepGEMM support from the codebase.


601-617: Critical fix: Missing weight copy in fused QKV loading

Good catch! The fused weight tensor is now properly copied to the module's weight parameter before copying the scales. This fixes a critical bug where the concatenated QKV weights were not being assigned to the module.


619-632: Critical fix: Missing weight copy in fused gate/up loading

Same critical fix as in the QKV method - the fused weight tensor is now properly copied to the module's weight parameter. This ensures gate and up weights are correctly loaded.

tensorrt_llm/_torch/modules/attention.py (1)

1351-1352: Updated function calls to match new signature

The fp8_block_scaling_bmm_out calls have been correctly updated to remove the mat2_dequant argument, matching the simplified function signature.

Also applies to: 1401-1403

@kris1025 kris1025 force-pushed the feature-refactor-deepseekv3-linquanh branch from a07584d to 2e064cd Compare August 7, 2025 09:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (7)
tensorrt_llm/_torch/speculative/mtp.py (1)

1049-1049: Fix undefined variable reference.

The code references mapping.tp_rank but mapping is not defined in the local scope. This should be self.mapping.tp_rank.

Apply this fix:

-            torch.int32) + (mapping.tp_rank * vocab_per_rank)
+            torch.int32) + (self.mapping.tp_rank * vocab_per_rank)
tensorrt_llm/_torch/models/modeling_speculative.py (4)

1-1: Add required NVIDIA copyright header.

Per the coding guidelines and retrieved learnings, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the current year.


418-418: Fix undefined function get_sm_version().

The function get_sm_version() is used but not imported, which will cause a NameError at runtime.


487-494: Fix multiple undefined names in routing implementation.

The noaux_tc method uses several undefined names that will cause runtime errors:

  • Line 487: F.sigmoid - F is not imported
  • Line 491: enable_llm_debug() - function not imported
  • Line 494: warnings - module not imported

730-730: Fix undefined class TRTLLMGenFusedMoE.

The class TRTLLMGenFusedMoE is used in isinstance check but not imported.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

49-51: Remove duplicate import of ModelConfig.

Line 50 redefines ModelConfig that was already imported on line 49, causing confusion and potential bugs.


402-402: Remove debug print statement.

The print statement appears to be debug code that should not be in production.

🧹 Nitpick comments (2)
tensorrt_llm/_torch/models/modeling_speculative.py (2)

468-546: Review routing implementation for potential performance and correctness issues.

The Deepseekv3RoutingImpl class has a complex routing algorithm with both fused and non-fused paths. Consider the following:

  1. The non-fused path (lines 498-533) has nested operations that could be optimized
  2. The fused path relies on a custom operator torch.ops.trtllm.noaux_tc_op which may not be available in all environments
  3. Error handling for the debug path uses undefined names (already flagged)

Consider adding fallback logic if the fused operator is not available:

 else:
+    try:
         topk_values, topk_indices = torch.ops.trtllm.noaux_tc_op(
             scores, scores_with_bias, n_group, self.topk_group, self.top_k,
             self.routed_scaling_factor)
+    except (AttributeError, RuntimeError):
+        # Fallback to non-fused implementation
+        return self._non_fused_routing(scores, scores_with_bias)
     return topk_values, topk_indices

1288-1300: Fix comparison operator for None check.

Use is not None instead of != None for consistency with Python best practices.

-    assert getattr(model_config, 'spec_config', None) != None
+    assert getattr(model_config, 'spec_config', None) is not None
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a07584d and 2e064cd.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (10 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (5 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/speculative/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/speculative/mtp.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/speculative/mtp.py
🧠 Learnings (5)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
📚 Learning: applies to **/*.{h,hpp} : use a preprocessor guard in header files. the guard name must have prefix ...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{h,hpp} : Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py

418-418: Undefined name get_sm_version

(F821)


435-435: Line too long (135 > 120)

(E501)


487-487: Undefined name F

(F821)


491-491: Undefined name enable_llm_debug

(F821)


494-494: Undefined name warnings

(F821)


580-580: Line too long (122 > 120)

(E501)


712-712: Line too long (125 > 120)

(E501)


718-718: Line too long (136 > 120)

(E501)


730-730: Undefined name TRTLLMGenFusedMoE

(F821)


1289-1289: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

tensorrt_llm/_torch/speculative/mtp.py

1049-1049: Undefined name mapping

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (14)
tensorrt_llm/_torch/speculative/mtp.py (5)

321-321: LGTM! Constructor signature updated correctly.

The constructor signature change from accepting model_config to mapping aligns with the broader refactoring to use more direct configuration objects.


324-324: LGTM! Mapping assignment is correct.

The assignment of mapping to self.mapping properly stores the configuration for later use in tensor parallelism operations.


335-335: LGTM! Forward method signature simplified appropriately.

The change from individual parameters (lm_head, embed_tokens, mtp_layers) to a single draft_model parameter simplifies the interface and aligns with the new draft model encapsulation approach.


471-475: LGTM! Draft model usage is implemented correctly.

The code properly accesses the MTP layers and components through the draft model:

  • draft_model.model.mtp_layers for iterating over layers
  • draft_model.model.embed_tokens for embeddings
  • draft_model.lm_head for the language model head

This aligns with the new draft model encapsulation pattern.


1091-1094: LGTM! Mapping usage updated correctly.

The code properly uses self.mapping instead of self.model_config.mapping for tensor parallelism operations, which is consistent with the constructor changes and the new configuration approach.

tensorrt_llm/_torch/models/modeling_speculative.py (4)

334-372: LGTM: Well-implemented MTP head with proper logits processing.

The DeepseekV3MTPHead class correctly handles last token extraction using cumulative sum and provides proper gather control for distributed attention. The compiled get_last_token_states method should provide good performance for the token extraction logic.


617-794: LGTM: Comprehensive MoE implementation with proper optimization patterns.

The Deepseekv3MoE class demonstrates good engineering practices:

  • Proper handling of distributed parallelism with attention DP vs tensor parallelism
  • Smart tensor parallelism size computation based on quantization constraints
  • Effective use of auxiliary streams for overlapping computation
  • Correct all-reduce handling for different execution modes

The implementation properly balances performance optimizations with correctness.


796-1106: Well-structured decoder layer with comprehensive fusion optimizations.

The DeepseekV3DecoderLayer implementation shows excellent attention to performance:

  • Intelligent fusion configuration based on hardware capabilities and model configuration
  • Proper handling of different MLP vs MoE execution paths
  • Smart tensor parallelism size computation considering quantization block sizes
  • Effective use of eager fusion for reducing communication overhead

The separation between forward_MoE and forward_mlp methods maintains code clarity while handling the complexity of different execution modes.


1312-1335: LGTM: Proper integration with speculative decoding framework.

The updated constructor properly handles both Eagle3 and MTP speculative decoding modes, with appropriate configuration handling and draft model instantiation. The conditional logic correctly sets up the different speculative decoding paths.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (5)

153-154: LGTM: Proper integration with speculative decoding metadata.

Adding the spec_metadata parameter aligns the DeepseekV3Model forward method with the speculative decoding framework requirements. The parameter is correctly typed and positioned.


179-197: LGTM: Clean integration with speculative decoding engine.

The inheritance change from DecoderModelForCausalLM to SpecDecOneEngineForCausalLM and the updated constructor properly integrate with the new speculative decoding architecture. The use of keyword arguments in the superclass call is more explicit and maintainable.


214-233: LGTM: Proper draft model integration.

The changes correctly integrate with the new draft model architecture:

  • Lines 214-215: Properly extend layers from self.draft_model.model.mtp_layers
  • Line 233: Correctly append self.spec_worker instead of the previous self.mtp_worker

This aligns with the new speculative decoding framework where draft models encapsulate MTP layers.


246-252: LGTM: Clean delegation to speculative decoding engine.

The simplified forward method correctly delegates all processing to the superclass, which now handles the speculative decoding logic. This is a good example of separation of concerns - the base model focuses on its core functionality while the speculative decoding engine handles the advanced features.


406-407: LGTM: Proper filtering of draft model weights.

The addition of draft model filtering in weight loading is appropriate since draft models are handled separately in the new architecture. This prevents conflicts during weight loading.

@kris1025 kris1025 force-pushed the feature-refactor-deepseekv3-linquanh branch 4 times, most recently from 4b20324 to aa6a243 Compare August 8, 2025 08:38
@kris1025 kris1025 changed the title Feature refactor deepseekv3 [TRTLLM-6668] feat refactor deepseekv3 model Aug 8, 2025
@kris1025 kris1025 changed the title [TRTLLM-6668] feat refactor deepseekv3 model [TRTLLM-6668] [feat] refactor deepseekv3 model Aug 8, 2025
@kris1025 kris1025 changed the title [TRTLLM-6668] [feat] refactor deepseekv3 model [TRTLLM-6668][feat] refactor deepseekv3 model Aug 8, 2025
@kris1025 kris1025 requested review from moconnor725 and mikeiovine and removed request for moconnor725 August 8, 2025 08:59
@kris1025 kris1025 self-assigned this Aug 8, 2025
@kris1025 kris1025 marked this pull request as ready for review August 8, 2025 09:03
@kris1025 kris1025 force-pushed the feature-refactor-deepseekv3-linquanh branch from aa6a243 to 694e973 Compare August 8, 2025 09:12
@kris1025 kris1025 force-pushed the feature-refactor-deepseekv3-linquanh branch from ae87923 to d9d16b6 Compare August 12, 2025 03:32
@kris1025
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

449-455: Guard against AttributeError when draft_model has no weights API (MTP path).

MTPForCausalLM is a lightweight wrapper without load_weights/load_weights_from_target_model. Unconditional calls will raise AttributeError. Make this method resilient.

-        self.draft_model.load_weights(weights=weights,
-                                      weight_mapper=weight_mapper)
-        self.draft_model.load_weights_from_target_model(self)
+        # Some draft models (e.g., MTPForCausalLM) are weightless.
+        if hasattr(self.draft_model, "load_weights"):
+            self.draft_model.load_weights(weights=weights,
+                                          weight_mapper=weight_mapper)
+        if hasattr(self.draft_model, "load_weights_from_target_model"):
+            self.draft_model.load_weights_from_target_model(self)
tensorrt_llm/_torch/speculative/mtp.py (3)

103-110: Trailing comma makes mtp_num_modules default a tuple, not int.

mtp_num_modules: int = 1, evaluates to a tuple (1,). This will break consumers (e.g., torch.arange(self.mtp_num_modules)). Remove the comma.

-    mtp_num_modules: int = 1,
+    mtp_num_modules: int = 1

495-496: Fix None check order to avoid AttributeError.

if attn_metadata.is_cuda_graph and attn_metadata is not None dereferences attn_metadata before checking None. Reverse the order.

-        if attn_metadata.is_cuda_graph and attn_metadata is not None:
+        if attn_metadata is not None and attn_metadata.is_cuda_graph:

1166-1198: Bug: MTPEagleWorker always uses mtp_layers[0] instead of layer i.

Within the loop over self.mtp_num_modules, calls to the MTP layer and shared_head are hard-coded to index 0, breaking multi-layer MTP. Use the loop index.

-            if i == 0:
-                hidden_states = draft_model.mtp_layers[0](
+            if i == 0:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
                     all_rank_max_num_tokens=spec_metadata.
                     all_rank_max_num_tokens,
                     **inputs)
...
-            else:
-                hidden_states = draft_model.mtp_layers[0](
+            else:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.
                     subseq_all_rank_num_tokens,
                     all_rank_max_num_tokens=max(
                         spec_metadata.subseq_all_rank_num_tokens)
                     if spec_metadata.subseq_all_rank_num_tokens is not None else
                     None,
                     **inputs)
...
-            logits = draft_model.mtp_layers[0].shared_head(
+            logits = draft_model.mtp_layers[i].shared_head(
                 hidden_states[gather_ids], draft_model.lm_head, attn_metadata,
                 True)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

1-1: Add required NVIDIA copyright header (current year).

This production source file lacks the NVIDIA SPDX header. Retain the third-party MIT notice, but prepend the NVIDIA Apache-2.0 header at the very top.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 # --------------------------------------------------
 # Portions of this code were derived from DeepSeek‑V3:

214-221: Fused A GEMM path ignores LoRA delta — gate fast path when LoRA is active.

When lora_params is provided, the fused path will drop LoRA contributions. Only take the fused path when lora_params is None.

-        if (not self.has_any_quant and 1 <= num_tokens <= 16
-                and get_sm_version() != 120):
+        if (lora_params is None
+                and not self.has_any_quant
+                and 1 <= num_tokens <= 16
+                and get_sm_version() != 120):
             output = torch.ops.trtllm.dsv3_fused_a_gemm_op(
                 input, self.weight.t(), bias, None)
         else:
             output = super().apply_linear(input, bias, lora_params, layer_idx)

1128-1157: Fix: self.model_nextn never updated from spec_config.

forward() sets attn_metadata.num_generations_per_batch based on self.model_nextn, but __init__ never assigns it from spec_config. Set it to model_nextn once computed.

         self.model_nextn = 0
         if model_config.spec_config is not None:
             model_nextn = model_config.spec_config.num_nextn_predict_layers
+            self.model_nextn = model_nextn
             ckpt_nextn = self.config.num_nextn_predict_layers
♻️ Duplicate comments (1)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

1-1: Add required NVIDIA copyright header (current year).

Per repo guidelines, all OSS source files must include the NVIDIA copyright header. Please add it at the very top of the file, before any imports. This was previously flagged and remains unresolved.

Apply this patch:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (5)

351-353: Use identity comparison for None (fix Ruff E711).

Replace != None with is not None.

-    assert getattr(model_config, 'spec_config', None) != None
+    assert getattr(model_config, 'spec_config', None) is not None

69-75: Fix incorrect return annotation on init.

__init__ must return None. Current annotation is a Tuple, which is incorrect.

-    ) -> Tuple[torch.Tensor, torch.Tensor]:
+    ) -> None:

104-112: Correct forward() return type annotation (returns a tuple).

Eagle3DecoderLayer.forward returns (hidden_states, residual) but the annotation says torch.Tensor.

-    ) -> torch.Tensor:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:

187-196: Correct forward() return type annotation (returns a tuple).

Eagle3DraftModel.forward returns a tuple (hidden_states, hidden_states_to_save) but the annotation is torch.Tensor.

-    ) -> torch.Tensor:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:

323-349: MTPForCausalLM relies on non-None model/lm_head — assert early.

Given model.aux_stream_dict, model.embed_tokens, and lm_head are dereferenced, assert they are provided to fail-fast with a clear message.

     def __init__(
         self,
         model_config: ModelConfig[PretrainedConfig],
         start_layer_idx: int = 0,
         lm_head: nn.Module = None,
         model: nn.Module = None,
     ):
         super().__init__()
+        assert model is not None and lm_head is not None, \
+            "MTPForCausalLM requires non-None `model` and `lm_head` from the target model."
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

969-971: Nit: fix comment typo.

Minor typo in comment.

-            return self.enorm(embed_tokens(input_ids))  #emdedding
+            return self.enorm(embed_tokens(input_ids))  # embedding

1076-1105: Unused spec_metadata in model forward — either use or document.

DeepseekV3Model.forward() accepts spec_metadata but never uses it. If this is solely for signature-unification with the engine, add a short docstring note or a comment; otherwise, remove it to avoid confusion.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1d678b and ae87923.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (13 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py

351-351: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

🔇 Additional comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

375-393: LGTM: unified draft_model construction and one-engine wiring.

Using get_draft_model(...) and preserving kv_cache quant algo for Eagle3 is clean and minimizes special cases. Good separation of concerns.

tensorrt_llm/_torch/speculative/mtp.py (1)

467-475: LGTM: refactor to use draft_model encapsulation.

Passing a single draft_model into MTP workers simplifies wiring and reduces parameter surface. The use of draft_model.lm_head and embed_tokens is consistent.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

536-545: LGTM: replace boolean-or on tensor with explicit None check.

The hidden_states_fp4 if hidden_states_fp4 is not None else hidden_states pattern avoids PyTorch boolean evaluation pitfalls. Good fix.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14893 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/speculative/mtp.py (1)

1-1: Add required NVIDIA copyright header.

Per the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header with the current year.

Add the following copyright header at the beginning of the file:

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
♻️ Duplicate comments (4)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

1-1: Add required NVIDIA copyright header.

Per the coding guidelines, all TensorRT-LLM source files (.py, .cpp, .h, .cu, etc.) must contain an NVIDIA copyright header with the current year.

Add the following copyright header at the beginning of the file:

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
tensorrt_llm/_torch/speculative/mtp.py (1)

1041-1048: Fix undefined variable reference.

The code references self.model_config.mapping.tp_rank but based on the context and line 217 where self.mapping = None is initialized, this should be self.mapping.tp_rank after proper initialization.

-            torch.int32) + (self.model_config.mapping.tp_rank * vocab_per_rank)
+            torch.int32) + (self.mapping.tp_rank * vocab_per_rank)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (2)

382-382: Remove debug print statement.

The print statement on line 382 appears to be debug code that should not be in production.

         for name, module in tqdm(all_named_modules.items(),
                                  desc="Loading weights"):
-            print(f"name: {name}")

1082-1084: Remove unused parameters or handle them appropriately.

The spec_metadata and **kwargs parameters are accepted but never used in the forward method. Either remove them from the signature or pass them through to the layers that might need them.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (2)

351-351: Use is not None for None comparison.

Replace equality check with identity check for better Python practice.

-    assert getattr(model_config, 'spec_config', None) != None
+    assert getattr(model_config, 'spec_config', None) is not None

350-363: Consider simplifying the conditional logic.

The function has three branches but might be clearer with early returns.

 def get_draft_model(model_config, draft_config, lm_head, model):
-    assert getattr(model_config, 'spec_config', None) != None
+    spec_config = getattr(model_config, 'spec_config', None)
+    assert spec_config is not None, "spec_config is required for draft model creation"
+    
     spec_dec_mode = model_config.spec_config.spec_dec_mode
+    
     if spec_dec_mode.is_eagle3_one_model():
         return Eagle3ForCausalLM(
             draft_config, model_config.pretrained_config.num_hidden_layers)
-    elif spec_dec_mode.is_mtp():
+    
+    if spec_dec_mode.is_mtp():
         return MTPForCausalLM(model_config,
                               model_config.pretrained_config.num_hidden_layers,
                               lm_head, model)
-    else:
-        raise NotImplementedError(
-            f"get_draft_model does not support speculative decoding mode {spec_dec_mode}."
-        )
+    
+    raise NotImplementedError(
+        f"get_draft_model does not support speculative decoding mode {spec_dec_mode}."
+    )
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

747-748: Update return type annotations for consistency.

Multiple methods now return Tuple[torch.Tensor, torch.Tensor] but the model's forward method returns just torch.Tensor. Ensure all return type annotations accurately reflect what's being returned.

Also applies to: 779-780, 863-864

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae87923 and d9d16b6.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (12 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (3 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
🧠 Learnings (3)
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_speculative.py

351-351: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

323-348: Well-structured MTP draft model initialization.

The MTPForCausalLM class properly integrates with the DeepseekV3 model by:

  • Correctly creating MTP layers based on configuration
  • Sharing the auxiliary stream dictionary from the target model
  • Properly exposing lm_head and embed_tokens for shared use
tensorrt_llm/_torch/speculative/mtp.py (2)

327-336: Clean refactor to use draft_model parameter.

The forward method signature has been properly updated to accept a centralized draft_model parameter, eliminating the need for separate lm_head, embed_tokens, and mtp_layers parameters. This improves encapsulation and maintainability.


471-476: Verify shared_head signature across all call sites

I wasn’t able to locate the shared_head definition in the repo, but it’s invoked with three args at L474 and four at L1195—this may be an optional flag or overload. Please:

• Check the shared_head method declaration in tensorrt_llm/_torch/speculative/mtp.py
• Ensure it accepts both
– 3 parameters (hidden_states, lm_head, attn_metadata)
– 4 parameters (plus the boolean flag)
• Update either the signature or the call sites for consistency:

– Line 474:
mtp_layer.shared_head(hidden_states, draft_model.lm_head, attn_metadata)
– Line 1195:
draft_model.mtp_layers[0].shared_head(hidden_states[gather_ids], draft_model.lm_head, attn_metadata, True)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

1-26: Copyright header properly includes DeepSeek attribution.

The file correctly includes both the DeepSeek copyright notice and the required attribution for derived code.


1108-1109: Excellent refactoring to use SpecDecOneEngineForCausalLM base class.

The DeepseekV3ForCausalLM class has been cleanly refactored to inherit from SpecDecOneEngineForCausalLM, which centralizes the speculative decoding logic and eliminates code duplication.


536-537: Consistent use of ternary operator for tensor selection.

Good use of explicit None checks to avoid issues with tensor boolean evaluation.

Also applies to: 562-563


955-956: Well-designed MTP head implementation.

The DeepseekV3MTPHead class provides a clean abstraction for the MTP-specific head operations with proper normalization and context-aware logit extraction.

@kris1025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14918 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14893 [ run ] completed with state ABORTED

@kris1025
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14931 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14918 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14931 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11272 completed with status: 'SUCCESS'

@kris1025
Copy link
Collaborator Author

@mikeiovine The pipeline is passed. Could you please review the PR again? Thanks.

@mikeiovine mikeiovine merged commit 4aed7a7 into NVIDIA:main Aug 14, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 17, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 17, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 17, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 17, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 18, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 18, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants