Skip to content

Conversation

sunnygqq
Copy link
Collaborator

@sunnygqq sunnygqq commented Aug 18, 2025

Summary by CodeRabbit

  • New Features
    • MTP Eagle decoding now supports both one-model and two-model modes.
    • Configurable Eagle capture layers and relaxed acceptance controls (top-k, delta).
    • MTP draft decoding enabled with automatic routing; draft prompts drop first token in Eagle/MTP Eagle modes.
  • Refactor
    • Weight-loading modularized for more reliable loading, dequantization, and tensor-parallel support.
    • Dynamic capture-layer handling to reduce memory and scale with configuration.
  • Chores
    • Updated decoding-mode and config wiring to enable new variants.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds MTP Eagle variants and one-model vs two-model speculative decoding, introduces MTP draft models and a dedicated DeepseekV3 weight loader, makes Eagle3 capture layers dynamic, extends decoding configs and speculative-mode predicates, and updates auto-model routing and executor draft overrides.

Changes

Cohort / File(s) Summary
Example setup (Quickstart)
examples/llm-api/quickstart_advanced.py
Adjusts MTP/Eagle branching: configures EagleDecodingConfig for two-model flow, augments MTPDecodingConfig fields for one-model flow, and updates printed messaging.
Decoding config / CLI API
tensorrt_llm/llmapi/llm_args.py
Adds eagle3_layers_to_capture, is_mtp_eagle, num_capture_layers to EagleDecodingConfig; adds mtp_eagle_one_model to MTPDecodingConfig; updates from_dict and spec_dec_mode logic.
Speculative interface & mode predicates
tensorrt_llm/_torch/speculative/interface.py
Adds MTP_EAGLE_ONE_MODEL; replaces is_mtp with is_mtp_one_model; adds is_mtp_eagle_one_model and SpecMetadata.is_layer_capture; updates helper predicates.
Speculative utilities & resource wiring
tensorrt_llm/_torch/speculative/utils.py
Refactors mode checks for one-model vs two-model MTP/Eagle, propagates layers_to_capture/is_mtp_eagle, expands Eagle3ResourceManager signature, and updates resource/worker selection and extra-KV logic.
Eagle3 metadata & resource manager
tensorrt_llm/_torch/speculative/eagle3.py
Makes capture-layer count dynamic (lazy layers_to_capture), removes fixed num_capture_layers, adds is_mtp_eagle, and sizes hidden-state buffers based on capture count.
MTP metadata & flow
tensorrt_llm/_torch/speculative/mtp.py
Switches checks to one-model predicates, adjusts token-count adjustments and cross-rank token handling for MTP Eagle one-model vs two-model.
Speculative drafter
tensorrt_llm/_torch/speculative/model_drafter.py
Drops first token in draft prompt for Eagle3 and MTP Eagle modes.
New draft models & speculative modeling
tensorrt_llm/_torch/models/modeling_speculative.py
Adds MTPDraftModel and MTPDraftModelForCausalLM; updates MTP/Eagle layer-count checks; conditions Eagle3DraftModel FC init; wires draft-model streams and load paths.
Auto-model routing
tensorrt_llm/_torch/models/modeling_auto.py
Remaps arch to MTPDraftModelForCausalLM when DeepseekV3ForCausalLM and max_draft_len == 0 before class lookup.
DeepseekV3 weight loading
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Replaces DeepseekV3MTPHead with DeepseekV3WeightLoader; moves weight-loading, TP-splitting, KV projection reconstruction and dequant into load_weights invoked from model load path.
Executor draft override
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
When draft model arch is DeepseekV3ForCausalLM, sets num_hidden_layers=1 on draft model pretrained_config before KV cache/weight wiring.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Quickstart
  participant Config
  participant AutoModel
  participant DraftModel
  participant TargetModel
  Note over Quickstart,Config: build decoding config (Eagle/MTP + one-model flags)

  User->>Quickstart: setup_llm(args)
  Quickstart->>Config: construct Eagle/MTP config (is_mtp_eagle, mtp_eagle_one_model, layers_to_capture)
  Quickstart->>AutoModel: create draft model (spec mode)
  AutoModel->>AutoModel: remap arch (DeepseekV3 & max_draft_len==0 -> MTPDraftModelForCausalLM)
  AutoModel-->>DraftModel: instantiate draft model
  Quickstart->>TargetModel: instantiate target model
  DraftModel->>DraftModel: load_weights (DeepseekV3WeightLoader.load_weights)
  DraftModel->>TargetModel: optionally wire shared embeddings / lm_head
  User->>TargetModel: generate(request with spec_metadata)
  TargetModel->>DraftModel: request draft logits/hidden states
  DraftModel-->>TargetModel: draft outputs
  TargetModel-->>User: final tokens
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • syuoni
  • nv-guomingz
  • hlu1
  • 2ez4bz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sunnygqq sunnygqq changed the title TRTLLM-6746]Enable two-model spec dec for MTP Eagle [TRTLLM-6746]Enable two-model spec dec for MTP Eagle Aug 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (8)
tensorrt_llm/_torch/speculative/mtp.py (3)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

Apply at top of file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

70-71: Scalar copy_ misuse — use zero_/fill_ instead.

copy_ expects a Tensor, not a Python scalar. This will error at runtime on some PyTorch versions.

Use zero_ or fill_ for in-place scalar writes:

-                    self.mtp_relaxed_delta_pool[slot_id].copy_(
-                        0, non_blocking=True)
+                    self.mtp_relaxed_delta_pool[slot_id].zero_()
-            self.mtp_relaxed_delta_pool[free_slot_id].copy_(0,
-                                                            non_blocking=True)
+            self.mtp_relaxed_delta_pool[free_slot_id].zero_()

Also applies to: 79-81


1168-1196: Draft loop calls first MTP layer repeatedly (likely bug).

Inside the loop over self.mtp_num_modules, you always invoke draft_model.mtp_layers[0] rather than indexing by i. This will ignore all but the first layer and produce incorrect tokens/hidden states when num_nextn_predict_layers > 1.

Use the ith layer:

-            if i == 0:
-                hidden_states = draft_model.mtp_layers[0](
+            if i == 0:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
                     all_rank_max_num_tokens=spec_metadata.
                     all_rank_max_num_tokens,
                     **inputs)
...
-            else:
-                hidden_states = draft_model.mtp_layers[0](
+            else:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.
                     subseq_all_rank_num_tokens,
                     all_rank_max_num_tokens=max(
                         spec_metadata.subseq_all_rank_num_tokens)
                     if spec_metadata.subseq_all_rank_num_tokens is not None else
                     None,
                     **inputs)
examples/llm-api/quickstart_advanced.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/eagle3.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/utils.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/llmapi/llm_args.py (2)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

1676-1691: EagleDecodingConfig validate: reinforces need for draft_model_dir in quickstart.

Given speculative_model_dir is required for Eagle-based paths, the quickstart must pass --draft_model_dir in two-model MTP Eagle.

See related comment in quickstart_advanced.py with a concrete diff to fix this wiring mistake.

🧹 Nitpick comments (15)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

264-266: Avoid post-construction mutation of num_hidden_layers; guard against MTPDraftModel remap

Mutating pretrained_config.num_hidden_layers after the draft model is already instantiated risks mismatched assumptions (e.g., KV cache sizing, layer-indexed logic) and will also trigger even when AutoModel remapped to MTPDraftModelForCausalLM (architectures remains "DeepseekV3ForCausalLM"). Add guards so this only applies to an actual DeepseekV3ForCausalLM draft engine. Also handle missing/empty architectures gracefully.

Apply this minimal guard:

-            if draft_model_engine.model.model_config.pretrained_config.architectures[
-                    0] == "DeepseekV3ForCausalLM":
-                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1
+            archs = getattr(
+                draft_model_engine.model.model_config.pretrained_config,
+                "architectures", None)
+            model_arch = archs[0] if archs and len(archs) > 0 else None
+            # Only force to 1 layer when the instantiated class is DeepseekV3ForCausalLM
+            if (model_arch == "DeepseekV3ForCausalLM"
+                    and draft_model_engine.model.__class__.__name__
+                    == "DeepseekV3ForCausalLM"):
+                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1

If the intent was to influence KV cache allocation only, consider moving this into a draft-engine construction-time knob (or pass through PyTorchModelEngine is_draft_model logic) rather than mutating config after instantiation.

Also, per repository guidelines, prepend the NVIDIA copyright header to this file.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/interface.py (1)

99-109: Verify SM gating in extend_ctx; consider >=100 instead of ==100

extend_ctx disables capture for TrtllmAttention only when get_sm_version() == 100. If future Blackwell revisions report >100, this may degrade unexpectedly. Consider using >= 100 unless there’s a known incompatibility.

Would you like me to adjust the predicate to >= 100?

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

132-141: New DeepseekV3WeightLoader improves modularity; add a brief docstring

The separation of weight-loading concerns into a dedicated class is a solid move. Add a short docstring to clarify scope and expected inputs per guidelines.

Example:

class DeepseekV3WeightLoader:
    """TP-aware weight loader for DeepseekV3 variants.

    Args:
        model: The target model instance (e.g., DeepseekV3ForCausalLM or draft variant).
        is_draft_model: Whether weights are being loaded for the draft path.
    """

Also, add the NVIDIA header to this file in addition to preserving upstream license:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

514-535: Wrap long line; comply with Ruff E501 (<=120 chars)

predicted_tokens_per_seq assignment exceeds 120 chars.

-        predicted_tokens_per_seq = model_config.spec_config.max_draft_len + 1 if model_config.spec_config is not None else 1
+        spec_cfg = model_config.spec_config
+        predicted_tokens_per_seq = (spec_cfg.max_draft_len + 1) if spec_cfg is not None else 1

566-574: Prefer torch.sigmoid over torch.nn.functional.sigmoid (deprecation)

torch.nn.functional.sigmoid is deprecated; use torch.sigmoid for clarity and future-proofing.

-        scores = F.sigmoid(logits)
+        scores = torch.sigmoid(logits)

659-667: Ruff E501: wrap long call to DeepseekV3Gate constructor

A few of the argument lines exceed 120 chars. Wrap to satisfy E501.

For example, break routed_scaling_factor line:

-            routed_scaling_factor=config.routed_scaling_factor,
+            routed_scaling_factor=config.routed_scaling_factor,

and ensure each argument is on its own line with hanging indentation (PEP8).

If you want, I can run through the file and submit a style-only patch for the flagged lines.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Add NVIDIA copyright header to file

Per repo guidelines, prepend the header.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/mtp.py (2)

495-497: None check order in boolean guard.

Accessing attn_metadata.is_cuda_graph before verifying attn_metadata is not None risks AttributeError. Reverse the condition order.

-        if attn_metadata.is_cuda_graph and attn_metadata is not None:
+        if attn_metadata is not None and attn_metadata.is_cuda_graph:

292-296: Minor performance nit: avoid clone() when not needed.

Shifting input buffers uses clone() unnecessarily. copy_ achieves the same semantics with less overhead.

-            input_ids[:-1] = input_ids[1:].clone()
+            input_ids[:-1].copy_(input_ids[1:])
...
-            draft_hidden_states[:-1] = draft_hidden_states[1:].clone()
+            draft_hidden_states[:-1].copy_(draft_hidden_states[1:])
tensorrt_llm/_torch/speculative/eagle3.py (1)

95-105: Default layer-capture pattern: “last-4” could be fragile.

Using (1, mid-1, last-4) is a notable change. For shallow models (<=5 layers) you error out; for 6–7 layers, mid-1 and last-4 may collide or yield very early layers. Please confirm this is intentional and measured. Consider clamping and deduping, or fall back to capturing the last layer when spacing becomes too tight.

If intended, add a short inline comment explaining the rationale for last-4 vs last-1.

tensorrt_llm/_torch/models/modeling_speculative.py (5)

382-385: Use LongTensor for token indices in type hints.

Token IDs and position IDs are conventionally torch.LongTensor (int64). The current hints use torch.IntTensor in some places.

Apply this diff:

-        input_ids: torch.IntTensor,
-        position_ids: torch.IntTensor,
+        input_ids: torch.LongTensor,
+        position_ids: torch.LongTensor,

And:

-                input_ids: torch.IntTensor = None,
-                position_ids: torch.IntTensor = None,
+                input_ids: torch.LongTensor = None,
+                position_ids: torch.LongTensor = None,

Also applies to: 437-439


370-375: Align Embedding instantiation with TP mapping (optional).

Elsewhere embeddings are usually constructed with mapping/tensor-parallel settings and gathered outputs. Consider matching that here to avoid surprises in TP runs.

Apply this diff:

-        self.embed_tokens = Embedding(
-            self.config.vocab_size,
-            self.config.hidden_size,
-            dtype=self.config.torch_dtype,
-        )
+        self.embed_tokens = Embedding(
+            self.config.vocab_size,
+            self.config.hidden_size,
+            dtype=self.config.torch_dtype,
+            mapping=model_config.mapping,
+            tensor_parallel_mode=TensorParallelMode.COLUMN,
+            gather_output=True,
+        )

431-433: Embedding sharing likely never happens; decide whether to always share or always own.

self.model.embed_tokens is created in MTPDraftModel.__init__, so this if ... is None branch will never run. If the intent is to share embeddings with the target model to save memory and keep weights in sync, set it unconditionally; otherwise, remove the dead branch.

Proposed change to share:

-        if self.model.embed_tokens is None:
-            self.model.embed_tokens = target_model.model.embed_tokens
+        # Share embeddings with target model to save memory and keep weights in sync.
+        self.model.embed_tokens = target_model.model.embed_tokens

Please confirm whether the draft model should own separate embeddings (loaded via weight loader) or share with the target model.


353-359: Optional: Add concise docstring to new public class.

Add a short, Google-style docstring to clarify the per-layer draft behavior and expected inputs.

Example:

 class MTPDraftModel(nn.Module):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig],
                  layer_idx: int, aux_stream_dict: Dict[AuxStreamType,
                                                        torch.cuda.Stream]):
         super().__init__()
+        """
+        Per-layer MTP draft block used in the two-model MTP-Eagle flow.
+        
+        Args:
+            model_config: Global model configuration.
+            layer_idx: Index of the draft layer relative to the target stack.
+            aux_stream_dict: CUDA aux streams for attention/MoE overlap.
+        """

403-411: Optional: Add docstring to the public MTPDraftModelForCausalLM wrapper.

Clarify the two-stream setup and how weights are loaded.

Example:

 @register_auto_model("MTPDraftModelForCausalLM")
 class MTPDraftModelForCausalLM(DecoderModelForCausalLM[MTPDraftModel,
                                                        PretrainedConfig]):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig]):
-        self.model_config = model_config
+        """
+        Causal LM wrapper around a per-layer MTP draft model for two-model
+        MTP-Eagle speculative decoding.
+
+        Sets up CUDA aux streams and delegates weight loading to the DeepseekV3 loader.
+        """
+        self.model_config = model_config
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce2354 and 9ae0936.

📒 Files selected for processing (11)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_auto.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (9 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (5 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (6 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (3 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (10 hunks)
  • tensorrt_llm/llmapi/llm_args.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T06:36:40.681Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py

514-514: Line too long (124 > 120)

(E501)


659-659: Line too long (122 > 120)

(E501)


797-797: Line too long (125 > 120)

(E501)


803-803: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_speculative.py

463-463: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tensorrt_llm/_torch/speculative/interface.py (3)

24-29: Helper predicates for one-model MTP look good

is_mtp_one_model and is_mtp_eagle_one_model are clear and simplify downstream checks. No issues.


39-41: use_one_engine correctly covers MTP one-model

The updated predicate now includes MTP one-model alongside EAGLE3 one-model, which is consistent with the intended architecture split.


196-202: Default SpecMetadata.is_layer_capture stub is fine

Providing an overridable hook with a safe default is appropriate.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1496-1499: Weight loader integration looks correct

Switching to DeepseekV3WeightLoader.load_weights simplifies the class and centralizes logic. LGTM.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Drop-first-token logic extended to MTP_EAGLE (two-model) — LGTM

Aligns with EAGLE3 behavior and the broader MTP Eagle flow. Safe for short prompts (returns empty tensor when len==1).

tensorrt_llm/_torch/speculative/mtp.py (3)

154-156: One-model Eagle gating: confirm behavior.

Setting subseq_all_rank_num_tokens only for is_mtp_eagle_one_model looks intentional for the MTPEagle one-model path. Please confirm that two-model Eagle does not rely on subseq_all_rank_num_tokens here (it should be using Eagle3 metadata instead). If not, add the appropriate branch.


171-173: Double-check num_tokens adjustment gating.

MTP vanilla uses total max_draft_len tokens, while MTP Eagle’s first step uses max_draft_len+1 and subsequent steps 1. The change to gate on not is_mtp_eagle_one_model() seems correct for the one-model Eagle path. Please verify that the two-model Eagle path won’t traverse this code (i.e., uses Eagle3 metadata exclusively).


181-202: Vanilla-only hidden-state updates gating looks correct.

Updating hidden-state pointers only under is_mtp_one_model() aligns with excluding the MTP Eagle one-model path, which has its own flow.

examples/llm-api/quickstart_advanced.py (1)

186-193: One-model MTP path wiring looks consistent.

Using MTPDecodingConfig with num_nextn_predict_layers and mtp_eagle_one_model mirrors the new spec_dec_mode predicates.

tensorrt_llm/_torch/speculative/eagle3.py (2)

39-41: Capacity of hidden_states now depends on config.num_capture_layers — verify alignment.

Resource allocation uses EagleDecodingConfig.num_capture_layers, while metadata can derive layers_to_capture dynamically. Ensure config.num_capture_layers and metadata.layers_to_capture length always match, or document the assumption. Mismatch can cause OOB writes during capture.


187-219: Eagle3OneModelSpecMetadata mirrors capture logic — good parity.

The mirrored lazy init and allocation for one-model Eagle3 aligns with the two-model path.

tensorrt_llm/_torch/speculative/utils.py (6)

22-30: One-model MTP metadata unification looks correct.

Routing both vanilla MTP and MTP Eagle one-model through MTPSpecMetadata via is_mtp_one_model() simplifies the stack.


75-85: Resource manager for MTP Eagle one-model only when relaxed acceptance is enabled.

This conditional avoids unnecessary buffers. Please confirm MTPEagleWorker never expects mtp_* pools when use_relaxed_acceptance_for_thinking=False. Otherwise, allocate conditionally per usage site.


92-101: Two-model Eagle and MTP Eagle share Eagle3ResourceManager — assert message updated.

The assertion covers both modes. Good consistency.


111-118: Decoder selection via is_mtp_one_model() is fine if it includes Eagle one-model.

Ensure SpeculativeDecodingMode.is_mtp_one_model() returns True for both vanilla MTP and MTP Eagle one-model. If not, add an explicit is_mtp_eagle_one_model() branch to return MTPSampler.


162-170: Worker selection split is clear and correct.

MTP Vanilla → MTPWorker; MTP Eagle one-model → MTPEagleWorker; Eagle3 one-model → Eagle3OneModelWorker.


186-191: MTP one-model max_draft_len reconciliation is necessary — good.

Ensures low-level APIs expecting max_draft_len receive a consistent value derived from num_nextn_predict_layers.

tensorrt_llm/llmapi/llm_args.py (4)

445-452: num_capture_layers logic: confirm alignment with eagle3.py defaults.

If eagle3_layers_to_capture is unset, you return 1 for MTP Eagle and 3 otherwise. This should match eagle3.py’s computed layers_to_capture length. Please verify consistency to avoid allocation/usage mismatches.


457-461: Spec mode mapping for MTP Eagle is correct.

EAGLE3_ONE_MODEL takes priority; otherwise MTP_EAGLE if flagged.


542-547: from_dict sets max_draft_len — good for downstream APIs.

Keeps legacy code that reads max_draft_len working.


557-561: MTP_EAGLE vs MTP_EAGLE_ONE_MODEL selection looks right.

Using num_nextn_predict_layers_from_model_config and mtp_eagle_one_model disambiguates both modes cleanly.

tensorrt_llm/_torch/models/modeling_speculative.py (3)

22-22: LGTM: AuxStreamType import is appropriate.

Import looks correct and aligns with the new aux-stream wiring.


340-343: LGTM: MTP layer count computation is consistent with the new predicates.

Using is_mtp_one_model() and is_mtp_eagle_one_model() reads clearly, and the fallback to num_nextn_predict_layers makes sense.


468-474: LGTM: New MTP-Eagle branch in get_draft_model.

Adding the is_mtp_eagle() path to return MTPDraftModelForCausalLM is consistent with enabling the two-model spec-dec for MTP Eagle.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 18, 2025
@mikeiovine mikeiovine self-requested a review August 18, 2025 14:38
Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not reviewed fully. Will come back to it once #6210 is merged; looks like this one has a lot of overlap with it.

@sunnygqq sunnygqq changed the title [TRTLLM-6746]Enable two-model spec dec for MTP Eagle [TRTLLM-6746][feat]Enable two-model spec dec for MTP Eagle Aug 20, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/llm-api/quickstart_advanced.py (1)

173-179: Two-model MTP Eagle must use draft_model_dir (not model_dir) and guard when missing

EagleDecodingConfig.validate requires a draft model path. Passing model_dir here will resolve to the target model path and likely break draft loading for two-model MTP Eagle.

Apply:

-        if not args.use_one_model:
-            print("Running MTP eagle with two model style.")
-            spec_config = EagleDecodingConfig(
-                max_draft_len=args.spec_decode_max_draft_len,
-                speculative_model_dir=args.model_dir,
-                eagle3_one_model=args.use_one_model,
-                is_mtp_eagle=True)
+        if not args.use_one_model:
+            print("Running MTP eagle with two model style.")
+            if args.draft_model_dir is None:
+                raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
+            spec_config = EagleDecodingConfig(
+                max_draft_len=args.spec_decode_max_draft_len,
+                speculative_model_dir=args.draft_model_dir,
+                eagle3_one_model=args.use_one_model,
+                is_mtp_eagle=True)
🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/modeling_auto.py (2)

23-24: Good guard on spec_config; consider intent for EAGLE3 and wrap long line

  • The None-check on spec_config prevents AttributeError. Looks good.
  • Verify intent: this remap won’t trigger when draft_vocab_size is present (EAGLE3 path) since model_arch is prefixed to "EAGLE3...". If you also want the remap for EAGLE3 checkpoints, include "EAGLE3DeepseekV3ForCausalLM" in the condition or move the remap above the EAGLE3 prefix step.

Also fix Ruff E501 (126 > 120). Example:

-        if model_arch == "DeepseekV3ForCausalLM" and config.spec_config is not None and config.spec_config.max_draft_len == 0:
+        if (model_arch == "DeepseekV3ForCausalLM"
+                and config.spec_config is not None
+                and config.spec_config.max_draft_len == 0):
             model_arch = "MTPDraftModelForCausalLM"

1-1: Prepend NVIDIA copyright header

Per repo guidelines, prepend the 2025 NVIDIA header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 from typing import Generic
examples/llm-api/quickstart_advanced.py (2)

174-174: Prefer logger over print in library/examples

Minor: use the project logger to keep output consistent and suppressible.

-            print("Running MTP eagle with two model style.")
+            from tensorrt_llm.logger import logger
+            logger.info("Running MTP eagle with two model style.")

1-1: Prepend NVIDIA copyright header

Per repo guidelines, add the 2025 header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import argparse
tensorrt_llm/llmapi/llm_args.py (3)

428-447: Eagle capture layer controls and derived num_capture_layers are reasonable

  • Tuple[int, ...] for eagle3_layers_to_capture is fine.
  • num_capture_layers fallback to 1 for MTP Eagle and 3 otherwise makes sense.

Optional: add a validator to ensure elements are non-negative and strictly increasing when provided.


540-542: from_dict sets max_draft_len: OK; minor duplication elsewhere

Setting max_draft_len = num_nextn_predict_layers here is fine. Note it’s also set again in BaseLlmArgs.validate_speculative_config (Lines 1701-1704), so you could deduplicate in one place later.


1-1: Prepend NVIDIA copyright header

Per repo guidelines, prepend the 2025 NVIDIA header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 import copy
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae0936 and 830a4a2.

📒 Files selected for processing (3)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_auto.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/models/modeling_auto.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/models/modeling_auto.py
🧬 Code Graph Analysis (3)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/_torch/speculative/interface.py (2)
  • SpeculativeDecodingMode (12-120)
  • is_mtp_eagle (33-34)
examples/llm-api/quickstart_advanced.py (2)
tensorrt_llm/llmapi/llm_args.py (5)
  • EagleDecodingConfig (411-449)
  • speculative_model_dir (1347-1348)
  • model_dir (1104-1106)
  • model_dir (1109-1113)
  • MTPDecodingConfig (510-548)
tensorrt_llm/_torch/speculative/interface.py (1)
  • is_mtp_eagle (33-34)
tensorrt_llm/_torch/models/modeling_auto.py (1)
tensorrt_llm/llmapi/llm_utils.py (1)
  • model_arch (92-94)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_auto.py

23-23: Line too long (126 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/llmapi/llm_args.py (4)

12-13: Typing import updates look correct

Importing Tuple/Type/TypeAlias is appropriate for new annotations.


454-456: Extended spec_dec_mode mapping for MTP Eagle

The new branch returning MTP_EAGLE when is_mtp_eagle is set (and not one-model) aligns with the new SpeculativeDecodingMode predicates.


525-525: New flag for one-model MTP Eagle

mtp_eagle_one_model adds the necessary distinction for one-model vs two-model handling. No concerns.


553-556: MTP spec_dec_mode branching aligns with new modes

Branching on num_nextn_predict_layers_from_model_config and mtp_eagle_one_model to return MTP_EAGLE_ONE_MODEL / MTP_EAGLE looks consistent with interface predicates.

@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15924 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15924 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11968 completed with status: 'FAILURE'

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from 830a4a2 to c3e3985 Compare September 1, 2025 03:09
@sunnygqq
Copy link
Collaborator Author

sunnygqq commented Sep 1, 2025

/bot run

1 similar comment
@sunnygqq
Copy link
Collaborator Author

sunnygqq commented Sep 1, 2025

/bot run

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from 1898a10 to 364d801 Compare September 1, 2025 10:02
@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from dad1051 to 1e93f53 Compare September 17, 2025 02:19
@sunnygqq
Copy link
Collaborator Author

/bot run

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from 1e93f53 to e1f5058 Compare September 17, 2025 02:41
@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18867 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18867 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14142 completed with status: 'FAILURE'

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from dfa8bc0 to 4d54c27 Compare September 17, 2025 06:19
@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18907 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18907 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14174 completed with status: 'FAILURE'

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from 4d54c27 to f8f48ce Compare September 17, 2025 09:37
@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18967 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18967 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14218 completed with status: 'FAILURE'

@sunnygqq sunnygqq force-pushed the qgai/deepseekeagle2models branch from f8f48ce to 0716a6b Compare September 18, 2025 02:17
@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19086 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19086 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14318 completed with status: 'FAILURE'

@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19146 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19146 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14366 completed with status: 'FAILURE'

@sunnygqq
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19173 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19173 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14389 completed with status: 'SUCCESS'

@mikeiovine mikeiovine merged commit 80dd8fe into NVIDIA:main Sep 18, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 19, 2025
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants