[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. #7283

zheyuf · 2025-08-27T06:57:41Z

Summary by CodeRabbit

New Features
- New config options to set an acceptance window and threshold for speculative decoding and a gate that tracks rolling acceptance to permanently disable speculation when needed.
- Drafting now considers batch size, token budget, and max draft length for resource-aware decisions.
Bug Fixes
- Draft token container initialized as empty list when speculation is off; permanent-disable takes precedence over dynamic drafting.
Tests
- Added unit and integration tests covering resource-aware decisions, rolling-acceptance gate, and output parity with non-speculative decoding.

Description

This pull request (PR) depends on PR#7511 and should be merged after it.

Feature requested from Microsoft. Keep a rolling average of the acceptance length over the last N requests (specified via a DecodingConfig). Turn off speculative decoding permanently when the rolling average drops below some user-specified threshold. This should only kick in after at least N requests have completed since it's going to fluctuate a lot at the beginning.

Test Coverage

Added unit tests in test_spec_gate.py, which contains an end-to-end test and several functional test only for the class SpeculationGate.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-27T06:57:47Z

📝 Walkthrough

Walkthrough

Adds rolling acceptance-based speculative-decoding controls: new DecodingBaseConfig fields, a SpeculationGate class (two module locations), integration into PyTorchModelEngine to track permanent-disable state, executor changes to consult Drafter with resource constraints and to update/speculation-disable via the gate, and unit-test updates for the new Drafter signature and logic.

Changes

Cohort / File(s)	Summary of changes
Speculation gate (pyexecutor) `tensorrt_llm/_torch/pyexecutor/speculation_gate.py`	New `SpeculationGate` class tracking a rolling window of per-request accepted lengths, computing windowed averages, returning a flag to permanently disable speculation, and supporting reset.
Speculation gate (speculative) `tensorrt_llm/_torch/speculative/speculation_gate.py`	Parallel/new module providing `SpeculationGate` (same API) to control speculative decoding via rolling acceptance statistics.
Engine wiring `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Imports `SpeculationGate`; reads `acceptance_window` and `acceptance_threshold` from `spec_config`; adds `acceptance_window`, `acceptance_threshold`, `speculation_permanently_disabled`, and `speculation_gate` fields; conditionally instantiates the gate.
Executor control flow `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Before drafting, honors `speculation_permanently_disabled`; calls `Drafter.should_use_spec_decode` with extra resource args (`active_requests`, `max_batch_size`, `max_num_tokens`, `max_draft_len`) and syncs `enable_spec_decode`; `_prepare_draft_requests` now uses `[]` for draft tokens when speculation off; on response completion, records `avg_decoded` to `SpeculationGate` and may set permanent-disable based on gate result (guarded by try/except and logging).
Drafter API & logic `tensorrt_llm/_torch/speculative/drafter.py`	`Drafter.should_use_spec_decode` signature changed to accept `max_batch_size`, `max_num_tokens`, `max_draft_len`; decision logic updated from pure concurrency to resource-aware budgeting (batch size, token cap, draft length) with edge-case handling.
Config additions `tensorrt_llm/llmapi/llm_args.py`	Adds `acceptance_window: Optional[int] = None` and `acceptance_threshold: Optional[float] = None` (with validators and `MAX_ACCEPTANCE_WINDOW`) to `DecodingBaseConfig`.
Unit tests (drafter) `tests/unittest/_torch/speculative/test_dynamic_spec_decode.py`	Updates mock side-effect to match new `should_use_spec_decode` signature; adds `test_should_use_spec_decode` exercising multiple resource/budget scenarios and edge cases.
Unit tests (spec gate) `tests/unittest/_torch/speculative/test_spec_gate.py`	New integration-style test `test_dynamic_spec_decode()` comparing speculative outputs to non-speculative reference under configured acceptance window/threshold and model setup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant PyExec as PyExecutor
  participant Engine as PyTorchModelEngine
  participant Drafter as Drafter

  Client->>PyExec: _prepare_and_schedule_batch(active_requests)
  alt Engine.speculation_permanently_disabled == True
    PyExec->>Engine: enable_spec_decode = False
  else
    PyExec->>Drafter: should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len)
    Drafter-->>PyExec: decision (True/False)
    PyExec->>Engine: enable_spec_decode = decision
    opt decision == True
      PyExec->>Drafter: _prepare_draft_requests(...)
    end
  end
  PyExec-->>Client: scheduled batch

sequenceDiagram
  autonumber
  participant PyExec as PyExecutor
  participant Engine as PyTorchModelEngine
  participant Gate as SpeculationGate

  PyExec->>PyExec: _handle_responses(request_done)
  alt Engine.is_spec_decode && not Engine.speculation_permanently_disabled && Gate exists
    PyExec->>Gate: record_avg_decoded(avg_decoded_tokens_per_iter, request_id)
    Gate-->>PyExec: (disabled_now, avg_accept)
    alt disabled_now == True
      PyExec->>Engine: enable_spec_decode = False
      PyExec->>Engine: speculation_permanently_disabled = True
    end
  end
  PyExec-->>PyExec: continue routing/termination

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically #6363 — Modifies runtime speculative-decoding control and Drafter integration; strong overlap with engine/py_executor gating changes.
[None][feat] Clean up ngram auto mode, add max_concurrency to configs #6676 — Changes Drafter gating logic; closely related to the updated resource-aware decision in Drafter here.
[feat] Auto-enable ngram with concurrency <= 32. #6232 — Touches py_executor draft-preparation flow; related to the changes in drafting and permanent-disable handling.

Suggested reviewers

ziyixiong-nv
syuoni
mikeiovine
HuiGao-NV
nv-guomingz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

tensorrt_llm/llmapi/llm_args.py (1)

1-1: Add NVIDIA copyright header (2025).

Repository guidelines require the NVIDIA header on all source files.

Apply this diff at the top of the file (replace with the repo-standard header if different):

+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/speculative/drafter.py (1)

1-1: Add NVIDIA copyright header (2025).

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

1-1: Add NVIDIA copyright header (2025).

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1-1: Add NVIDIA copyright header (2025).

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Add NVIDIA copyright header (2025).

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

🧹 Nitpick comments (5)

tensorrt_llm/llmapi/llm_args.py (1)

356-365: Validate acceptance gating knobs (‘both-or-none’, positive values).

Prevent silent misconfigurations and NPEs: require both fields together and enforce acceptance_window > 0 and acceptance_threshold > 0.

Add a validator to DecodingBaseConfig:

@model_validator(mode="after")
def validate_acceptance_gating(self):
    if (self.acceptance_window is None) ^ (self.acceptance_threshold is None):
        raise ValueError("Both acceptance_window and acceptance_threshold must be set together.")
    if self.acceptance_window is not None:
        if self.acceptance_window <= 0:
            raise ValueError("acceptance_window must be > 0")
        if self.acceptance_threshold <= 0:
            raise ValueError("acceptance_threshold must be > 0")
    return self

tensorrt_llm/_torch/speculative/drafter.py (1)

30-55: Solid resource-aware gating; add guard for non-positive max_concurrency.

Logic looks good. Treat max_concurrency <= 0 as “always off” explicitly.

Apply this diff:
     if self.max_concurrency is None:
         return True
+    if isinstance(self.max_concurrency, int) and self.max_concurrency <= 0:
+        return False

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

91-145: Good coverage for the new budgeting logic. Consider parametrize to reduce repetition.

The scenarios are clear; you can condense with pytest.mark.parametrize for readability.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

295-308: Instantiate gate only for meaningful values; warn on partial config.

Avoid creating a gate for zero/invalid thresholds and surface partial configs.

Apply this diff:

-        if self.acceptance_window and self.acceptance_threshold is not None:
-            self.speculation_gate = SpeculationGate(self.acceptance_window,
-                                                    self.acceptance_threshold)
+        if (self.acceptance_window or 0) > 0 and (self.acceptance_threshold or 0) > 0:
+            self.speculation_gate = SpeculationGate(
+                self.acceptance_window, self.acceptance_threshold
+            )
+        elif (self.acceptance_window is not None) ^ (self.acceptance_threshold is not None):
+            logger.warning(
+                "SpeculationGate requires both acceptance_window (>0) and "
+                "acceptance_threshold (>0); partial config will be ignored."
+            )

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1688-1711: Make gating update immediate; fix long lines; guard None avg.

Set self.use_spec_decode = False when permanently disabling to take effect next iteration without recompute.
Wrap long log lines (Ruff E501).
Skip gate update when avg_decoded is None.

Apply this diff:

-                logger.info(
-                    f"[PyExecutor] _handle_responses: request_done={request_done}, request.py_request_id={request.py_request_id}"
-                )
+                logger.info(
+                    "[PyExecutor] _handle_responses: "
+                    f"request_done={request_done}, "
+                    f"request.py_request_id={request.py_request_id}"
+                )
                 try:
                     if self.model_engine.is_spec_decode and not self.model_engine.speculation_permanently_disabled:
-                        logger.info(
-                            f"[PyExecutor] _handle_responses: self.model_engine.is_spec_decode={self.model_engine.is_spec_decode}, self.model_engine.speculation_permanently_disabled={self.model_engine.speculation_permanently_disabled}"
-                        )
+                        logger.info(
+                            "[PyExecutor] _handle_responses: "
+                            f"is_spec_decode={self.model_engine.is_spec_decode}, "
+                            f"permanently_disabled={self.model_engine.speculation_permanently_disabled}"
+                        )
                         if self.model_engine.speculation_gate is not None:
-                            avg_decoded = getattr(
-                                request, 'avg_decoded_tokens_per_iter', None)
-                            disabled_now, _ = self.model_engine.speculation_gate.record_avg_decoded(
-                                avg_decoded,
-                                request_id=getattr(request, 'py_request_id',
-                                                   None))
-                            if disabled_now:
+                            avg_decoded = getattr(request, 'avg_decoded_tokens_per_iter', None)
+                            if avg_decoded is not None:
+                                disabled_now, _ = self.model_engine.speculation_gate.record_avg_decoded(
+                                    avg_decoded, request_id=getattr(request, 'py_request_id', None)
+                                )
+                            else:
+                                disabled_now = False
+                            if disabled_now:
                                 self.model_engine.speculation_permanently_disabled = True
                                 self.model_engine.enable_spec_decode = False
+                                # Ensure executor stops drafting immediately on subsequent loop
+                                self.use_spec_decode = False

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abdb273 and 8901692.

📒 Files selected for processing (5)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/drafter.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

🧠 Learnings (1)

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (282-424)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

should_use_spec_decode (30-54)

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

enable_spec_decode (52-53)

tensorrt_llm/logger.py (1)

warning (131-132)

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

tensorrt_llm/_torch/speculative/drafter.py (3)

Drafter (9-54)

prepare_draft_tokens (16-27)

should_use_spec_decode (30-54)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1690-1690: Line too long (129 > 120)

(E501)

1695-1695: Line too long (235 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

54-65: Mock shape aligns with new signature. LGTM.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

65-65: Import of SpeculationGate looks right.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

887-897: Respect permanent-disable first; resource-aware Drafter call is correct.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (7)

tensorrt_llm/_torch/pyexecutor/speculation_gate.py (7)

4-4: Import style: prefer module namespace for logger (repo guideline)

To preserve module namespaces, consider importing the module and accessing the member via the module (e.g., import tensorrt_llm.logger as trt_logger; trt_logger.logger.info(...)). If this diverges from existing local convention, feel free to ignore.

8-11: Docstring formatting and content (Google style)

Docstring uses tabs and lacks Args/Behavior details. Propose a concise Google-style docstring.

Apply this diff:

 class SpeculationGate:
-    """
-	Tracks rolling average of accepted draft tokens per iteration over the last N completed requests.
-	Permanently disables speculation when average falls below a threshold.
-	"""
+    """
+    Tracks a rolling average of accepted draft tokens per iteration across the last N completed requests.
+
+    When the rolling average falls below a threshold, speculative decoding is permanently disabled
+    until `reset()` is called.
+    """

30-38: Method docstring: clarify contract and return values

Make the API crystal-clear for call sites and tests.

Apply this diff:

-        """
-		Record a completed request's avg_decoded_tokens_per_iter.
-		Returns (disabled_now, current_avg_accept) where disabled_now is True only when the call causes disable.
-		"""
+        """
+        Record the per-request average decoded tokens per iteration.
+
+        Args:
+            avg_decoded_tokens_per_iter: Average tokens decoded per iteration for the request.
+                Interpreted as accepted_len = max(0, value - 1). None or invalid values are treated as 0.
+            request_id: Optional request identifier for logging.
+
+        Returns:
+            (disabled_now, current_avg_accept)
+                disabled_now: True only on the call that causes permanent disable.
+                current_avg_accept: The rolling average once at least `window` samples have been observed;
+                                   otherwise None.
+        """

44-47: Redundant None/<=0 checks for window/threshold

Constructor enforces valid values; these early returns are dead code. Remove for clarity, or make the ctor accept Optional and keep the checks.

Apply this diff to remove them:

-        if self.window is None or self.threshold is None:
-            return False, None
-        if self.window <= 0:
-            return False, None

30-91: Potential data race if called from multiple threads

If record_avg_decoded can be invoked concurrently (e.g., multiple response-handling threads), updates to deque/sum/counters need a lock.

Apply this diff if multi-threaded:

+import threading
@@
     def __init__(self, window: int, threshold: float):
@@
         self.disabled = False
+        self._lock = threading.Lock()
@@
-    def record_avg_decoded(
+    def record_avg_decoded(
             self,
             avg_decoded_tokens_per_iter: Optional[float],
             *,
             request_id: Optional[int] = None) -> Tuple[bool, Optional[float]]:
@@
-        logger.debug("[SpeculationGate] record_avg_decoded avg=%s req_id=%s",
-                     avg_decoded_tokens_per_iter, request_id)
-        if self.disabled:
-            return False, None
+        logger.debug("[SpeculationGate] record_avg_decoded avg=%s req_id=%s",
+                     avg_decoded_tokens_per_iter, request_id)
+        with self._lock:
+            if self.disabled:
+                return False, None
@@
-        self.acceptance_history.append(accepted_len)
-        self.acceptance_sum += accepted_len
+        with self._lock:
+            self.acceptance_history.append(accepted_len)
+            self.acceptance_sum += accepted_len
             if len(self.acceptance_history) > self.window:
                 removed = self.acceptance_history.popleft()
                 self.acceptance_sum -= removed
@@
-        self.num_completed_for_acceptance += 1
+        with self._lock:
+            self.num_completed_for_acceptance += 1
@@
-        if self.num_completed_for_acceptance >= self.window:
-            avg_accept = self.acceptance_sum / len(self.acceptance_history)
+        with self._lock:
+            if self.num_completed_for_acceptance >= self.window:
+                avg_accept = self.acceptance_sum / len(self.acceptance_history)

24-29: Optional: log reset() for traceability

Lightweight DEBUG log helps correlate state transitions.

Apply this diff:

     def reset(self) -> None:
         self.acceptance_history.clear()
         self.acceptance_sum = 0.0
         self.num_completed_for_acceptance = 0
         self.disabled = False
+        logger.debug("[SpeculationGate] reset() called; state cleared and re-enabled")

1-92: Unit tests to add (happy path + edge cases)

Recommend adding tests to pin behavior: warmup gate, disable trigger, skip/invalid inputs, reset(), and idempotence after disable.

window=3, threshold=0.5: feed [1.0, 1.2, 1.1] → stays enabled with avg≈0.1 < 0.5? should disable; verify disabled_now on 3rd call.
window=2, threshold=0.0: any inputs → never disable.
Include None/NaN/negative → treated as 0; ensure no NaN propagation.
After disable, subsequent calls return (False, None) and don’t mutate history/sum.
reset() re-enables and clears stats.
I can draft these tests if helpful.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8901692 and 6f9ef21.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/speculation_gate.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/speculation_gate.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/speculation_gate.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/speculation_gate.py

40-40: Line too long (146 > 120)

(E501)

56-56: Line too long (198 > 120)

(E501)

66-66: Line too long (128 > 120)

(E501)

71-71: Line too long (203 > 120)

(E501)

77-77: Line too long (143 > 120)

(E501)

82-82: Line too long (173 > 120)

(E501)

87-87: Line too long (148 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/pyexecutor/speculation_gate.py (2)
69-75: Confirm counting semantics for None/invalid samples

Currently, requests with None/invalid avg contribute 0 and count toward the N-sample warmup. Verify this matches the product requirement; alternatively, skip such samples to avoid premature disables.

If you decide to skip them, apply:
-        self.num_completed_for_acceptance += 1
+        # Count only valid samples toward warmup.
+        if avg_decoded_tokens_per_iter is not None and math.isfinite(float(avg_decoded_tokens_per_iter)) \
+                and float(avg_decoded_tokens_per_iter) >= 0.0:
+            self.num_completed_for_acceptance += 1
+        else:
+            logger.debug("[SpeculationGate] skipping invalid sample for warmup")
-        if self.num_completed_for_acceptance >= self.window:
+        if self.num_completed_for_acceptance >= self.window and len(self.acceptance_history) > 0:
74-91: LGTM on core gating logic

Windowed average, permanent disable, and return contract look correct and align with the PR intent.

tensorrt_llm/_torch/pyexecutor/speculation_gate.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (8)

tensorrt_llm/_torch/speculative/speculation_gate.py (4)
39-47: Return early branches fine; reduce log verbosity to debug

These paths run per request and log at INFO. Switch to DEBUG to avoid log flood.
-        logger.info(
+        logger.debug(
             f"[SpeculationGate] record_avg_decoded called with avg_decoded_tokens_per_iter={avg_decoded_tokens_per_iter}, request_id={request_id}"
         )
49-58: Acceptance metric assumption may not hold across algorithms

accepted_len = max(0, avg_decoded-1) assumes 1 target token + rest accepted. Verify this is consistent for all speculative modes you gate; otherwise, take the actual “accepted” metric from request stats and record that.

I can wire this to a more direct “accepted per iter” field if available.

60-72: Long log lines >120 cols (ruff E501)

Break lines or lower to DEBUG to comply with style.
-            logger.info(
-                f"[SpeculationGate] Rolling window: removed old value {removed:.3f}, window size={len(self.acceptance_history)}"
-            )
+            logger.debug(
+                "[SpeculationGate] Rolling window: removed old value %.3f, window size=%d",
+                removed, len(self.acceptance_history)
+            )
74-91: Simplify condition; rely on history length instead of a separate counter

num_completed_for_acceptance is redundant. Using len(self.acceptance_history) improves clarity and avoids divergence if code changes pop logic later.
-        self.num_completed_for_acceptance += 1
-        logger.info(
-            f"[SpeculationGate] Rolling stats: completed={self.num_completed_for_acceptance}/{self.window}, current_sum={self.acceptance_sum:.3f}, history={[f'{x:.3f}' for x in self.acceptance_history]}"
-        )
-
-        if self.num_completed_for_acceptance >= self.window:
+        logger.debug(
+            "[SpeculationGate] Rolling stats: completed=%d/%d, current_sum=%.3f",
+            len(self.acceptance_history), self.window, self.acceptance_sum
+        )
+        if len(self.acceptance_history) >= self.window:
tests/unittest/_torch/speculative/test_spec_gate.py (3)
15-21: Guard on model availability; skip gracefully

If models are missing at llm_models_root(), generate a clear skip instead of failing later in model load.
     models_path = llm_models_root()
+    if not os.path.isdir(models_path):
+        pytest.skip(f"Models path not found: {models_path}")
45-52: This test doesn’t exercise gating disablement

You set acceptance_window/threshold but run only 2 prompts (window=3), so the gate never triggers. Add a lightweight unit test for SpeculationGate logic that doesn’t need GPUs.

I can add a new fast test (no CUDA) that feeds synthetic averages to trigger disable and asserts state transitions.
Additional file (new):
# tests/unittest/_torch/speculative/test_speculation_gate_unit.py
import pytest
from tensorrt_llm._torch.speculative.speculation_gate import SpeculationGate

def test_gate_triggers_disable():
    g = SpeculationGate(window=3, threshold=0.6)
    # accepted lens: 0.2, 0.4, 0.5 -> avg 0.366 < 0.6 => disable
    outs = [g.record_avg_decoded(a) for a in (1.2, 1.4, 1.5)]
    assert outs[-1][0] is True
    assert g.disabled is True
71-75: Remove prints in tests

Use assertion messages instead of print noise.
-        print(f"text_spec: {text_spec}")
-        print(f"text_ref: {text_ref}")
-        # The spec decode algorithm currently guarantees identical results
-        assert text_spec == text_ref
+        assert text_spec == text_ref, f"Mismatch:\n spec={text_spec}\n ref={text_ref}"
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1689-1713: Shorten long INFO logs (ruff E501)

Break long f-strings or switch to structured logging.
-                logger.info(
-                    f"[PyExecutor] _handle_responses: request_done={request_done}, request.py_request_id={request.py_request_id}"
-                )
+                logger.info("[PyExecutor] request_done=%s, req_id=%s",
+                            request_done, request.py_request_id)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9ef21 and 41080dc.

📒 Files selected for processing (5)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
tensorrt_llm/_torch/speculative/speculation_gate.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)
tests/unittest/_torch/speculative/test_spec_gate.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/pyexecutor/model_engine.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; indent 4 spaces

Files:

tests/unittest/_torch/speculative/test_spec_gate.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/speculation_gate.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic

Files:

tests/unittest/_torch/speculative/test_spec_gate.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/speculation_gate.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tests/unittest/_torch/speculative/test_spec_gate.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/speculation_gate.py

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/speculative/test_spec_gate.py

🧬 Code graph analysis (2)

tests/unittest/_torch/speculative/test_spec_gate.py (2)

tensorrt_llm/llmapi/llm_args.py (4)

CudaGraphConfig (106-163)

EagleDecodingConfig (455-493)

KvCacheConfig (980-1075)

speculative_model_dir (1398-1399)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

shutdown (353-366)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

should_use_spec_decode (30-54)

tensorrt_llm/_torch/speculative/speculation_gate.py (1)

record_avg_decoded (30-91)

tensorrt_llm/logger.py (1)

warning (131-132)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1691-1691: Line too long (129 > 120)

(E501)

1696-1696: Line too long (235 > 120)

(E501)

tensorrt_llm/_torch/speculative/speculation_gate.py

16-16: Undefined name Deque

(F821)

40-40: Line too long (146 > 120)

(E501)

56-56: Line too long (198 > 120)

(E501)

66-66: Line too long (128 > 120)

(E501)

71-71: Line too long (203 > 120)

(E501)

77-77: Line too long (143 > 120)

(E501)

82-82: Line too long (173 > 120)

(E501)

87-87: Line too long (148 > 120)

(E501)

🔇 Additional comments (3)

tensorrt_llm/llmapi/llm_args.py (1)

369-391: Validators OK; align semantics with docs

Window=0 disables feature per comments; this is good. Consider clamping very small thresholds (e.g., negative rejected already) or documenting that 0 means “never disable.”

If you want a warning on the “no-op” case (window==0 with threshold set), I can add one. -->
tests/unittest/_torch/speculative/test_spec_gate.py (1)
1-14: Missing NVIDIA copyright header

Add the standard header per repo guidelines.
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
⛔ Skipped due to learnings
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-29T06:18:00.220Z
Learning: Applies to **/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py} : Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1029-1031: Retain empty list for req.py_draft_tokens
Downstream code uniformly treats py_draft_tokens as a list (falsy checks, extends, appends), so None isn’t expected and would break list operations.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt_llm/_torch/speculative/speculation_gate.py

tensorrt_llm/llmapi/llm_args.py

mikeiovine

Generally LGTM, thanks!

tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt_llm/llmapi/llm_args.py

tensorrt_llm/_torch/speculative/speculation_gate.py

zheyuf · 2025-09-23T18:42:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-23T18:49:15Z

PR_Github #19715 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-24T00:24:31Z

PR_Github #19715 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14837 completed with status: 'FAILURE'

zheyuf · 2025-09-24T03:45:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-24T03:52:56Z

PR_Github #19749 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-24T07:50:14Z

PR_Github #19749 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14864 completed with status: 'FAILURE'

zheyuf · 2025-09-24T15:34:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-24T15:39:40Z

PR_Github #19823 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-24T21:25:59Z

PR_Github #19823 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14915 completed with status: 'FAILURE'

zheyuf · 2025-09-24T23:18:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-24T23:24:05Z

PR_Github #19847 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T04:04:46Z

PR_Github #19847 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14936 completed with status: 'FAILURE'

zheyuf · 2025-09-25T04:11:59Z

/bot run

tensorrt-cicd · 2025-09-25T04:17:25Z

PR_Github #19884 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T06:20:33Z

PR_Github #19884 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14964 completed with status: 'FAILURE'

zheyuf · 2025-09-25T18:33:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-25T18:41:36Z

PR_Github #19988 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-25T23:09:17Z

PR_Github #19988 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15050 completed with status: 'FAILURE'

… test in test_dynamic_spec_decode(patch is not called at all). Signed-off-by: Zheyu Fu <[email protected]>

… threshold. Signed-off-by: Zheyu Fu <[email protected]>

Signed-off-by: Zheyu Fu <[email protected]>

…r. Also clean. Signed-off-by: Zheyu Fu <[email protected]>

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf · 2025-09-26T22:12:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-26T22:17:44Z

PR_Github #20124 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-27T03:39:52Z

PR_Github #20124 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15166 completed with status: 'FAILURE'

Superjomn

LGTM on the llmapi changes.

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

zheyuf force-pushed the roll_avg branch from e754f24 to 6015d41 Compare September 5, 2025 21:07

zheyuf changed the title ~~[TRTLLM-7412][feat] Turn off spec decode when the AR is too low.~~ [TRTLLM-7412][feat] Turn off spec decode when the rolling acceptance drops below threshold. Sep 5, 2025

zheyuf changed the title ~~[TRTLLM-7412][feat] Turn off spec decode when the rolling acceptance drops below threshold.~~ [TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. Sep 5, 2025

zheyuf marked this pull request as ready for review September 5, 2025 21:14

zheyuf requested review from a team as code owners September 5, 2025 21:14

zheyuf requested review from syuoni, nv-yilinf and mikeiovine and removed request for syuoni and nv-yilinf September 5, 2025 21:14

mikeiovine reviewed Sep 9, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

tensorrt_llm/llmapi/llm_args.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/speculative/speculation_gate.py Outdated Show resolved Hide resolved

zheyuf requested a review from mikeiovine September 10, 2025 20:25

mikeiovine approved these changes Sep 10, 2025

View reviewed changes

zheyuf force-pushed the roll_avg branch 3 times, most recently from 73f2065 to a760ddc Compare September 23, 2025 17:35

zheyuf enabled auto-merge (squash) September 23, 2025 18:36

zheyuf force-pushed the roll_avg branch from d3fc668 to 66b693a Compare September 23, 2025 18:36

zheyuf force-pushed the roll_avg branch from 66b693a to 81c48c6 Compare September 24, 2025 03:45

zheyuf force-pushed the roll_avg branch from 81c48c6 to ea44e47 Compare September 24, 2025 15:34

zheyuf force-pushed the roll_avg branch from f6baa8a to a18f422 Compare September 25, 2025 04:11

zheyuf force-pushed the roll_avg branch from a18f422 to 0b69a5e Compare September 25, 2025 18:32

zheyuf added 7 commits September 26, 2025 15:11

Set py_draft_token to [] instead of None when spec decode is off. Fix…

8015321

… test in test_dynamic_spec_decode(patch is not called at all). Signed-off-by: Zheyu Fu <[email protected]>

[None][feat] Turn off spec decode when rolling acceptance drops below…

4366ed4

… threshold. Signed-off-by: Zheyu Fu <[email protected]>

Clean.

253a70a

Signed-off-by: Zheyu Fu <[email protected]>

Add PP guard to prevent overcounting average acceptance in py_executo…

7448b54

…r. Also clean. Signed-off-by: Zheyu Fu <[email protected]>

Easier test case.

0c0314c

Signed-off-by: Zheyu Fu <[email protected]>

Address Mike's comments

403b461

Signed-off-by: Zheyu Fu <[email protected]>

Clean.

3f4781c

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf force-pushed the roll_avg branch from 0b69a5e to 3f4781c Compare September 26, 2025 22:11

Superjomn approved these changes Sep 29, 2025

View reviewed changes

[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. #7283

Are you sure you want to change the base?

[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. #7283

Uh oh!

Conversation

zheyuf commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikeiovine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zheyuf commented Sep 23, 2025

Uh oh!

tensorrt-cicd commented Sep 23, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

zheyuf commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

zheyuf commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

zheyuf commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 24, 2025

Uh oh!

tensorrt-cicd commented Sep 25, 2025

Uh oh!

zheyuf commented Sep 25, 2025

Uh oh!

tensorrt-cicd commented Sep 25, 2025

zheyuf commented Aug 27, 2025 •

edited

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading