Skip to content

Conversation

dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Aug 22, 2025

Summary by CodeRabbit

  • New Features

    • Multimodal quickstart: multi‑turn conversation mode, new CLI flags, and a TensorRT‑engine quickstart example and example script.
  • Documentation

    • Added legacy TensorRT quickstart; expanded Gemma3 VLM guidance; release-branch gating moved to commented/documentation‑only entries.
  • Improvements

    • Per-instance CUDA workspace/stream handling, KV‑cache error messaging, FP8 weight/scale compatibility, and multimodal preprocessing/API refinements.
  • Tests

    • Expanded FP8 coverage, MIG benchmarking, multimodal multiturn and 2‑GPU E2E tests, new LLM API example test, many test list/skip updates.
  • Chores

    • Parameterized Triton build tag and build script, Docker context reduced for safetensors, canonicalized Makefile cache path.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

📝 Walkthrough

Walkthrough

Removed active release-branch gating from CODEOWNERS (kept as commented block), added TensorRT-engine quickstart docs and example, made MoE workspace per-instance and stream-aware, parameterized Triton build with TRITON_BASE_TAG, refactored multiple models/quantization and multimodal paths, replaced multiprocessing with thread pools for weight prefetch, and applied broad test/CI manifest updates.

Changes

Cohort / File(s) Summary
Governance
/.github/CODEOWNERS
Removed active release-branch gating block and appended it as a commented trailing block; deleted per-path API gating entries for tests/unittest/api_stability/*.
CUDA MoE workspace
cpp/tensorrt_llm/thop/moeOp.cpp
Added per-instance WorkspaceInfo workspace_info member; changed getWorkspaceInfo(...) to accept cudaStream_t stream and return WorkspaceInfo const&; allocation is persisted, stream- and CUDA-graph-aware; call sites bind const-ref and pass stream.
Triton build / Docker
docker/Dockerfile.multi, triton_backend/inflight_batcher_llm/scripts/build.sh, .dockerignore, docker/Makefile
Added TRITON_BASE_TAG ARG and pass derived short tag to build script (-s); build script gains -s handling, DIRNAME/realpath build dir resolution, updated cmake args and install step; .dockerignore excludes examples/**/*.safetensors; Makefile canonicalizes USER_CACHE_DIR.
Docs / Quickstart
docs/source/index.rst, docs/source/legacy/tensorrt_quickstart.md
Added hidden toctree entry and a new legacy TensorRT quickstart page that literal-includes the quickstart example.
Examples — TensorRT engine & runtime
examples/llm-api/_tensorrt_engine/quickstart_example.py, examples/llm-api/quickstart_example.py, examples/llm-api/llm_runtime.py
New TensorRT-engine quickstart example; moved LLM import to tensorrt_llm._tensorrt_engine, added BuildConfig usage in example, and updated KvCacheConfig free_gpu_memory_fraction arguments.
Examples — Multimodal multiturn
examples/llm-api/quickstart_multimodal.py
Added --multiturn and --conversation_turns CLI flags and implemented a multi-turn conversation loop (requires prompts+media); single-turn path preserved.
Models — attention / fusion / FP8 scales
tensorrt_llm/_torch/attention_backend/trtllm.py, tensorrt_llm/_torch/modules/linear.py, tensorrt_llm/_torch/models/modeling_llama.py
Improved KV-cache assertion messaging; FP8 block-scale loaders updated to handle 4D singleton-dimension scales; adjusted decoder-layer post-attention fusion/all-reduce logic and NVFP4/FP4 tuple handling.
Models — Mistral / batching
tensorrt_llm/_torch/models/modeling_mistral.py, tests/unittest/_torch/modeling/test_modeling_mistral.py
Exposed batch_pixel_values (was private), added _get_sub_model_config(..., **changes) override support, changed batching/padding to use image_sizes, and added unit tests for batching behavior.
Models — Gemma3 & Phi4‑MM
tensorrt_llm/_torch/models/modeling_gemma3vl.py, tensorrt_llm/_torch/models/modeling_phi4mm.py, examples/models/core/multimodal/README.md
Removed internal weight-loading helper; added Gemma3 documentation (duplicated block); refactored Phi4‑MM to a unified generation-ready multimodal encoder path, added NewPreTrainedModel, MM_TOKEN_IDS, and changed Phi4MM input-processor signature to accept model_path.
Checkpoints / weight loader
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py
Replaced multiprocessing.Pool with concurrent.futures.ThreadPoolExecutor in prefetch_files; renamed max_processesmax_workers and used executor.map.
Integration tests — accuracy refs & tests
tests/integration/defs/accuracy/references/*.yaml, tests/integration/defs/accuracy/test_cli_flow.py, tests/integration/defs/accuracy/test_disaggregated_serving.py, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Added/updated FP8 and kv_cache FP8 accuracy variants in YAMLs; renamed/added FP8 tests; moved many class-level skip/gating decorators to per-test, and added device/memory gating markers.
Integration tests — llmapi examples & perf config
tests/integration/defs/llmapi/test_llm_examples.py, tests/integration/defs/perf/pytorch_model_config.py
Added test running the TensorRT-engine quickstart example; changed kv_cache_config from KvCacheConfig instance to a plain dict and set dtype via key.
Integration tests — e2e / bench runner
tests/integration/defs/test_e2e.py, tests/integration/test_lists/**/*
BenchRunner gains concurrency & num_requests, __call__/run_bench return parsed metrics via new parse_benchmark_output; added MIG bench and multimodal multiturn/2‑GPU tests; updated test lists and manifests.
Test waivers & lists
tests/integration/test_lists/waives.txt, tests/integration/test_lists/*
Removed and added multiple SKIP entries, updated bug IDs, and adjusted QA/test manifests to include FP8 and multimodal tests.
Unit tests — structural & fixtures
tests/unittest/llmapi/apps/_test_openai_chat_json.py, tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py, tests/unittest/_torch/speculative/test_eagle3.py
Simplified extra options fixture, changed fixture signatures, added tool fixtures and schema-based validation in structural-tag test, and added unconditional skip for an Eagle3 unit test.
Integration test harness — registration
tests/integration/defs/llmapi/test_llm_examples.py, tests/integration/test_lists/test-db/*.yml
Registered new TensorRT-engine example test in sanity and l0 test lists.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant FusedMoeRunner
  participant CUDA as CUDA Runtime

  Caller->>FusedMoeRunner: getWorkspaceInfo(num_rows, hidden_size,..., min_latency_mode, stream)
  activate FusedMoeRunner
  FusedMoeRunner->>CUDA: inspect stream & capture state
  alt capturing or workspace too small
    FusedMoeRunner->>CUDA: allocate/resize workspace on stream
    Note over FusedMoeRunner: persist into workspace_info.workspace<br/>update src_to_dest_map
  end
  FusedMoeRunner-->>Caller: return WorkspaceInfo const&
  deactivate FusedMoeRunner

  Caller->>CUDA: launch kernels using provided stream and workspace buffers
Loading
sequenceDiagram
  autonumber
  participant User
  participant Script as quickstart_multimodal.py
  participant Loader as default_multimodal_input_loader
  participant LLM

  User->>Script: run --multiturn --conversation_turns=N
  loop for each turn ≤ N
    Script->>Loader: load inputs(conversation_history, media, device)
    Loader-->>Script: pixel_values, image_sizes, prompts
    Script->>LLM: generate(inputs[, lora_request])
    LLM-->>Script: response
    Script->>Script: append response to conversation_history
  end
  Script-->>User: print recap of prompts and responses
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

CI

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (16)
tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (2)

25-37: Temp file naming can collide across workers; generate a unique file.

Hard-coding extra_llm_api_options.yaml in the system temp dir risks cross-test/process collisions (pytest-xdist) and flaky cleanup. Prefer a unique, per-fixture temp file.

Apply this diff:

-@pytest.fixture(scope="module")
-def temp_extra_llm_api_options_file():
-    temp_dir = tempfile.gettempdir()
-    temp_file_path = os.path.join(temp_dir, "extra_llm_api_options.yaml")
+@pytest.fixture(scope="module")
+def temp_extra_llm_api_options_file():
+    fd, temp_file_path = tempfile.mkstemp(prefix="extra_llm_api_options_", suffix=".yaml")
+    os.close(fd)
     try:
         extra_llm_api_options_dict = {"guided_decoding_backend": "xgrammar"}
-
-        with open(temp_file_path, 'w') as f:
-            yaml.dump(extra_llm_api_options_dict, f)
+        with open(temp_file_path, "w") as f:
+            yaml.safe_dump(extra_llm_api_options_dict, f)
 
         yield temp_file_path
     finally:
-        if os.path.exists(temp_file_path):
-            os.remove(temp_file_path)
+        try:
+            os.remove(temp_file_path)
+        except FileNotFoundError:
+            pass

123-146: Fix prompt contradictions and remove unused tool reference.

The prompt instructs “Only call one function at a time,” yet the test asserts both function tags are present in a single reply. Also, “brave_search” isn’t a defined tool here and may distract guided decoding.

Apply this focused replacement within the system_prompt string:

-If a you choose to call a function ONLY reply in the following format:
+If you choose to call a function, ONLY reply in the following format:
@@
-Reminder:
-- Function calls MUST follow the specified format
-- Required parameters MUST be specified
-- Only call one function at a time
-- Put the entire function call reply on one line
-- Always add your sources when using search results to answer the user query
+Reminder:
+- Function calls MUST follow the specified format.
+- Required parameters MUST be specified.
+- You may call multiple functions to fulfill a single request; if calling more than one, put each function call on its own line.
+- Put each function call on a single line and do not include any text outside the <function=...>...</function> tags.

Optionally drop this line earlier in the prompt since it references an undefined tool here:

-- When looking for real time information use relevant functions if available else fallback to brave_search
tests/unittest/llmapi/apps/_test_openai_chat_json.py (3)

1-2: Add NVIDIA copyright header (2025) at file top.

Required by repository guidelines.

Apply this diff at the very top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 # Adapted from
 # https://github.com/vllm-project/vllm/blob/aae6927be06dedbda39c6b0c30f6aa3242b84388/tests/entrypoints/openai/test_chat.py

24-37: Avoid fixed temp filename and drop unused request parameter.

Same collision risk as the other test; also, request isn’t used.

Apply this diff:

-@pytest.fixture(scope="module")
-def temp_extra_llm_api_options_file(request):
-    temp_dir = tempfile.gettempdir()
-    temp_file_path = os.path.join(temp_dir, "extra_llm_api_options.yaml")
+@pytest.fixture(scope="module")
+def temp_extra_llm_api_options_file():
+    fd, temp_file_path = tempfile.mkstemp(prefix="extra_llm_api_options_", suffix=".yaml")
+    os.close(fd)
     try:
         extra_llm_api_options_dict = {"guided_decoding_backend": "xgrammar"}
 
         with open(temp_file_path, "w") as f:
-            yaml.dump(extra_llm_api_options_dict, f)
+            yaml.safe_dump(extra_llm_api_options_dict, f)
 
         yield temp_file_path
     finally:
-        if os.path.exists(temp_file_path):
-            os.remove(temp_file_path)
+        try:
+            os.remove(temp_file_path)
+        except FileNotFoundError:
+            pass

6-6: Fix Python 3.8 typing compatibility (builtins generics).

list[dict[...]] requires Python 3.9+. Tests target 3.8+, so use typing.List/Dict (or add future annotations and still avoid PEP 585).

Apply this diff:

-from typing import Any
+from typing import Any, Dict, List
@@
-    def _create_and_validate_response(
-            messages: list[dict[str, Any]]) -> dict[str, Any]:
+    def _create_and_validate_response(
+            messages: List[Dict[str, Any]]) -> Dict[str, Any]:

Also applies to: 79-81

tests/integration/defs/perf/pytorch_model_config.py (1)

30-31: Replace PEP 585 generic type hints with typing generics for Python 3.8 compatibility

Python 3.8 does not support subscripted built-in generics (e.g. list[str], dict[str, Any], etc.); all of these will raise a SyntaxError when parsing. Our ripgrep scan uncovered hundreds of such occurrences across the repo, so this is a critical, cross-cutting compatibility issue that must be fixed before merging.

• In tests/integration/defs/perf/pytorch_model_config.py (lines 30–31):
– Add from typing import Optional, List near the top of the file.
– Change the signature to wrap lora_dirs in Optional[List[str]] instead of using list[str].

• Apply the same pattern throughout the codebase:
– Replace list[...]List[...], dict[...]Dict[...], set[...]Set[...], tuple[...]Tuple[...].
– Wrap any default-None parameters in Optional[...].
– Ensure all required names (List, Dict, Set, Tuple, Optional, etc.) are imported from typing.

• To find all remaining builtin-generic annotations, run:

#!/bin/bash
# Find PEP 585 builtin generics incompatible with Python 3.8
rg -n --type=py -e '\b(list|dict|set|tuple)\['

–––

Example diff for the one function:

--- a/tests/integration/defs/perf/pytorch_model_config.py
+++ b/tests/integration/defs/perf/pytorch_model_config.py
@@
+from typing import Optional, List
 def get_model_yaml_config(model_label: str,
-                          lora_dirs: list[str] = None) -> dict:
+                          lora_dirs: Optional[List[str]] = None) -> dict:
examples/llm-api/llm_runtime.py (1)

1-1: Add NVIDIA SPDX header to comply with repository licensing standards.

All source files should carry the current-year NVIDIA copyright header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# -*- coding: utf-8 -*-
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)

1-4: Add NVIDIA copyright header (2025) per repo guideline.

All source files must include the current-year NVIDIA header. Please prepend the standard header used elsewhere in this repo.

Apply this diff (adjust to your repo’s exact header if different):

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0

53-66: Fix Python 3.8 incompatibilities: PEP 585 generics and cancel_futures kwarg.

  • list[... ] requires Python 3.9+, but our target is 3.8+. Use typing.List.
  • shutdown(cancel_futures=...) is 3.9+. Provide a 3.8 fallback.

Apply this diff:

 class MyThreadPoolExecutor(ThreadPoolExecutor):
 
     def __init__(self, *args, **kwargs) -> None:
         super().__init__(*args, **kwargs)
-        self.futures: list[concurrent.futures.Future[RequestOutput]] = []
+        # Python 3.8-compatible typing (avoid PEP 585 generics)
+        from typing import List
+        self.futures: List[concurrent.futures.Future] = []
 
     def __exit__(self, exc_type, exc_val, exc_tb):
         if exc_type is None:
             for future in self.futures:
                 future.result()
             return super().__exit__(exc_type, exc_val, exc_tb)
 
         for future in self.futures:
             future.cancel()
-        self.shutdown(wait=True, cancel_futures=True)
+        # Python 3.8 fallback for shutdown(cancel_futures=...)
+        try:
+            self.shutdown(wait=True, cancel_futures=True)
+        except TypeError:
+            self.shutdown(wait=True)
         return False

207-217: Avoid potential indefinite hang in health check; add timeout and use time.monotonic().

requests.get() without a timeout can block forever even inside the 1h loop. Also prefer time.monotonic() for elapsed time.

Apply this diff:

-        start_time = time.time()
-        while time.time() - start_time < 3600:
+        start_time = time.monotonic()
+        while time.monotonic() - start_time < 3600:
             time.sleep(1)
             try:
                 print("Checking health endpoint")
-                response = requests.get("http://localhost:8000/health")
+                response = requests.get("http://localhost:8000/health", timeout=5)
                 if response.status_code == 200:
                     break
-            except requests.exceptions.ConnectionError:
+            except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
                 continue

Optional: rate-limit the "Checking health endpoint" log (e.g., every 10s) to keep CI logs concise.

examples/llm-api/quickstart_multimodal.py (1)

1-4: Prepend the NVIDIA 2025 copyright header.

Per repository guidelines, every Python source must include the current-year NVIDIA header. This file is missing it.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
 import argparse
 import json
 import os
.github/CODEOWNERS (1)

158-163: Catch‑all CODEOWNERS rule overrides everything; will require release-branch-approval on all PRs to main

GitHub CODEOWNERS uses last-match-wins. The line with “* @NVIDIA/trt-llm-release-branch-approval” at the end overrides all specific entries above, effectively gating every change (including on main) behind that team. The adjacent comments say this rule should only be uncommented on release branches.

Recommend commenting it out here (main) to avoid blocking merges, and only enable it on release/* branches.

- * @NVIDIA/trt-llm-release-branch-approval
+# * @NVIDIA/trt-llm-release-branch-approval

Alternative: if you truly want a fallback owner while preserving specific entries, place the * rule at the top so later, more-specific patterns take precedence. But per your comment, keeping it commented on main is safest.

tensorrt_llm/bench/benchmark/throughput.py (1)

1-1: Missing NVIDIA copyright header (2025).

Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: Missing NVIDIA copyright header (2025).

Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1-1: Missing NVIDIA copyright header (2025).

Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)

45-81: Avoid global monkey-patching of transformers.PreTrainedModel.

Temporarily reassigning transformers.modeling_utils.PreTrainedModel is unsafe under concurrency. Prefer a local subclass wrapper or patch only the target class of the loaded model.

Option A (preferred): Load normally, then dynamically mix in GenerationMixin to the instance (no global patch):

-        OldPreTrainedModel = transformers.modeling_utils.PreTrainedModel
-        transformers.modeling_utils.PreTrainedModel = NewPreTrainedModel
-        ref_phi4mm_model = transformers.AutoModelForCausalLM.from_pretrained(
+        ref_phi4mm_model = transformers.AutoModelForCausalLM.from_pretrained(
             model_path,
             trust_remote_code=True,
             # Flash_attn_2 only supports bf16 or fp16 and set in HF config.
             torch_dtype='auto',
             _attn_implementation='flash_attention_2',
         ).eval()
-        transformers.modeling_utils.PreTrainedModel = OldPreTrainedModel
+        # Ensure prepare_inputs_for_generation is available even if the loaded class didn't
+        if not isinstance(ref_phi4mm_model, transformers.generation.GenerationMixin):
+            ref_phi4mm_model.__class__ = type(
+                "PretrainedWithGen",
+                (ref_phi4mm_model.__class__, transformers.generation.GenerationMixin),
+                {},
+            )

This avoids touching global state.

I can push a follow-up patch applying Option A throughout this block if you’d like.

🧹 Nitpick comments (48)
examples/models/core/multimodal/README.md (5)

356-359: Clarify support scope and fix capitalization ("PyTorch")

The note should explicitly state that Gemma 3 is PyTorch LLM API–only, and use the correct capitalization.

-**NOTE: We only support Gemma3 VLMs in Pytorch workflow.**
+**NOTE: Gemma 3 VLMs are currently supported only via the PyTorch LLM API backend; TensorRT engine build is not supported.**

360-363: Tighten the attention-mask description and wording

Minor clarity/grammar improvements; also use consistent model naming.

-Gemma3VL decoder requires a custom attention mask while processing images. During the context phase:
-- Text tokens attend to other tokens in a causal fashion (standard autoregressive behavior)
-- Image tokens attend to other tokens in a causal fashion AND attend to other tokens from the same image in a bidirectional manner
+The Gemma 3 VLM decoder requires a custom attention mask when processing images. During the context phase:
+- Text tokens attend to prior tokens in a causal fashion (standard autoregressive behavior).
+- Image tokens attend causally to all prior tokens and bidirectionally within the same image.

366-374: Make backend requirement exclusive and add an incompatibility note

  • Make it clear that FlashInfer is the only supported backend for this mask.
  • Add a note that Gemma 3’s “must disable chunked prefill” requirement conflicts with Embedding Table Offloading (which relies on chunking).
-We support this custom mask with FlashInfer attention backend.
+This custom mask is supported only with the FlashInfer attention backend.
@@
-### Requirements
+### Requirements
@@
-To ensure expected behavior with Gemma3VL, the following configurations are **required**:
-- **Attention Backend**: Use the FlashInfer attention backend
-- **Chunked Prefill**: Must be disabled
-- **KV Cache Reuse**: Must be disabled
+To ensure expected behavior with Gemma 3 VLMs, the following configurations are **required**:
+- **Attention Backend**: Use the FlashInfer attention backend (other backends are not supported)
+- **Chunked Prefill**: Must be disabled.
+- **KV Cache Reuse**: Must be disabled.
+
+Note: Embedding Table Offloading depends on context chunking and is therefore not compatible with Gemma 3 at this time. See "Enabling Embedding Table Offloading" below.

420-423: Optional: enumerate exact HF IDs or a link for each supported variant

Listing concrete HF model IDs (e.g., google/gemma-3-4b-it, 12b-it, 27b-it) or linking to the canonical model page will reduce guesswork for users.


15-15: TOC hint: call out PyTorch-only to set expectations early (optional)

Consider signaling the PyTorch-only constraint directly in the TOC entry to prevent users from jumping to TRT build steps by habit.

-- [Gemma3](#gemma3)
+- [Gemma3 (PyTorch only)](#gemma3)
tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (1)

44-48: Args formatting differs from sibling test; optional unification.

This test passes CLI flags with “=”, while _test_openai_chat_json.py passes “--flag value”. Both likely work; consider standardizing across tests for consistency.

tests/unittest/llmapi/apps/_test_openai_chat_json.py (2)

41-45: Optional: apply the same resource-limiting flags as the structural_tag test.

Helps avoid OOM on smaller GPUs (A10/A30) and keeps CI deterministic.

Apply this diff:

-    args = ["--extra_llm_api_options", temp_extra_llm_api_options_file]
+    args = [
+        "--max_batch_size=8",
+        "--max_seq_len=1024",
+        "--max_num_tokens=1024",
+        "--extra_llm_api_options", temp_extra_llm_api_options_file
+    ]

41-45: Minor: CLI style consistency.

This test uses ["--flag", value] while the structural_tag test uses "--flag=value". Consider standardizing for readability.

tests/integration/defs/perf/pytorch_model_config.py (2)

205-209: Streamline kv_cache_config merge and avoid redundant reassignment.

Use setdefault to ensure we always mutate the dict inside base_config and drop the extra update. Functionally equivalent and a bit cleaner.

-    kv_cache_config = base_config.get('kv_cache_config', {})
-    if 'kv_cache_dtype' in base_config:
-        kv_cache_dtype = base_config.pop('kv_cache_dtype', 'auto')
-        kv_cache_config['dtype'] = kv_cache_dtype
-        base_config.update({'kv_cache_config': kv_cache_config})
+    kv_cache_config = base_config.setdefault('kv_cache_config', {})
+    if 'kv_cache_dtype' in base_config:
+        kv_cache_config['dtype'] = base_config.pop('kv_cache_dtype', 'auto')

45-49: Make kv_cache_dtype extraction resilient to formatting.

Current split-based parsing is brittle. A small regex avoids index errors and unexpected formats like trailing spaces.

-    if 'kv_cache_dtype' in model_label:
-        base_config.update({
-            'kv_cache_dtype':
-            model_label.split('kv_cache_dtype:')[1].split('-')[0]
-        })
+    if 'kv_cache_dtype:' in model_label:
+        m = re.search(r'kv_cache_dtype:([^-]+)', model_label)
+        if m:
+            base_config['kv_cache_dtype'] = m.group(1).strip()

Add this import at the top of the file if not already present:

import re
examples/llm-api/llm_runtime.py (1)

8-8: Prefer module-namespace imports for llmapi types; also remove redundant enable_block_reuse=True.

  • Importing the module keeps symbols namespaced and reduces potential name collisions.
  • enable_block_reuse defaults to True in KvCacheConfig; passing it explicitly adds noise without effect.
-from tensorrt_llm.llmapi import CudaGraphConfig, KvCacheConfig
+from tensorrt_llm import llmapi
@@
-    cuda_graph_config = CudaGraphConfig(
+    cuda_graph_config = llmapi.CudaGraphConfig(
@@
-        kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.5))
+        kv_cache_config=llmapi.KvCacheConfig(free_gpu_memory_fraction=0.5))
@@
-                       kv_cache_config=KvCacheConfig(
-                           free_gpu_memory_fraction=0.5,
-                           enable_block_reuse=True))
+                       kv_cache_config=llmapi.KvCacheConfig(
+                           free_gpu_memory_fraction=0.5))

Also applies to: 22-26, 32-32, 57-60

tests/integration/defs/accuracy/test_disaggregated_serving.py (7)

34-36: Avoid shadowing built-in id; rename to request_id.

Shadowing id is a minor readability smell and can confuse tooling.

Apply this diff:

-class Result(GenerationResultBase):
-
-    def __init__(self, id: int, sampling_params: SamplingParams,
-                 outputs: List[CompletionOutput]):
-        super().__init__(id, sampling_params)
+class Result(GenerationResultBase):
+
+    def __init__(self, request_id: int, sampling_params: SamplingParams,
+                 outputs: List[CompletionOutput]):
+        super().__init__(request_id, sampling_params)
         self._outputs = outputs
         self._streaming = False
-            result = Result(id=0,
+            result = Result(request_id=0,
                             sampling_params=sampling_params,
                             outputs=[
                                 CompletionOutput(text=response.choices[0].text,
                                                  index=0)
                             ])

Also applies to: 252-257


141-148: Guard against GPU index overruns when launch_disaggregated_llm is used directly.

When called outside run_parallel_test, this function can oversubscribe GPUs via CUDA_VISIBLE_DEVICES if gating is missing or misconfigured. Add a quick check vs get_device_count() before launching.

For example:

     ctx_total_gpus = ctx_tp * ctx_pp
     gen_total_gpus = gen_tp * gen_pp
+    # Optional safety guard
+    total_needed = ctx_total_gpus * len(ctx_ports) + gen_total_gpus * len(gen_ports)
+    if total_needed > get_device_count():
+        pytest.skip(f"Not enough devices for disaggregated serving: need {total_needed}, have {get_device_count()}")

Also applies to: 165-169


135-137: Nit: URL port parsing is brittle.

url.split(":")[1] breaks for IPv6 or if a scheme/path sneaks in. Prefer rsplit(":", 1) or urllib.parse. Low priority since tests use localhost:port.

Consider:

-    ctx_ports = [int(url.split(":")[1]) for url in ctx_urls]
-    gen_ports = [int(url.split(":")[1]) for url in gen_urls]
+    ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls]
+    gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]

221-247: Streaming path not implemented; fail fast or implement.

You pass stream=streaming to the OpenAI client, but the downstream handling assumes non-streaming. If streaming=True is ever used, this will likely break. Either implement the streaming consumer or fail fast.

Apply a defensive guard:

         def send_request(prompt: str, sampling_params: SamplingParams,
                          streaming: bool):
             kwargs = {}
+            if streaming:
+                raise NotImplementedError("Streaming completions not yet supported in disaggregated tests.")

If you want streaming support, I can sketch a compatible event-consumer for OpenAI-style completions.


682-687: Remove duplicate @pytest.mark.skip_less_device(8) decorator.

skip_less_device(8) appears twice on test_auto_dtype, which is redundant.

Apply this diff:

 @pytest.mark.skip_less_device(8)
 @parametrize_with_ids("overlap_scheduler", [True, False])
 @parametrize_with_ids("mtp_nextn",
                       [0, pytest.param(2, marks=skip_pre_hopper)])
-@pytest.mark.skip_less_device(8)
 def test_auto_dtype(self, overlap_scheduler, mtp_nextn):

730-736: Prefer a collection-time skip marker over runtime pytest.skip for Gemma3 1B test.

This test is always skipped; using @pytest.mark.skip(reason=...) is cleaner and avoids executing setup code.

Apply this diff:

-    @pytest.mark.skip_less_device(2)
-    @pytest.mark.parametrize("overlap_scheduler", [False, True])
-    def test_auto_dtype(self, overlap_scheduler):
-        pytest.skip(
-            "Currently we require full kvcache for variable sliding window. "
-            "This test only transfers the kvcache inside the sliding window.")
+    @pytest.mark.skip_less_device(2)
+    @pytest.mark.parametrize("overlap_scheduler", [False, True])
+    @pytest.mark.skip(reason="Requires full KV cache for variable sliding window; current test only transfers KV within sliding window.")
+    def test_auto_dtype(self, overlap_scheduler):

2-4: Nit: Fix small typo in file header comment.

“I need to to this” → “I need to do this”.

-# I need to to this by creating a new class that mimics LLM class. Instead of implementing the
+# I need to do this by creating a new class that mimics the LLM class. Instead of implementing the
examples/llm-api/quickstart_multimodal.py (5)

125-134: CLI: Good addition; add lightweight validation for conversation_turns.

Flags look fine. Consider rejecting non-positive values early to avoid silent no-op runs.

 parser.add_argument(
     "--conversation_turns",
     type=int,
     default=2,
-    help="Number of conversation turns for automated testing.")
+    help="Number of conversation turns for automated testing (must be >= 1).")
+parser.set_defaults()

Or validate after parsing:

 args = parser.parse_args()
+if args.conversation_turns < 1:
+    raise ValueError("--conversation_turns must be >= 1")

230-238: LoRA: remove redundant model_class None check; ensure model_dir type consistency.

model_class is guaranteed to be set earlier when args.load_lora is True. The None-check is dead code and adds confusion. Also prefer consistent str() for model_dir.

-                lora_request = None
-                if args.load_lora:
-                    if model_class is None:
-                        raise ValueError(
-                            "model_class must be provided when load_lora is True"
-                        )
-                    lora_request = model_class.lora_request(
-                        len(inputs), args.modality, llm._hf_model_dir)
+                lora_request = None
+                if args.load_lora:
+                    lora_request = model_class.lora_request(
+                        len(inputs), args.modality, str(llm._hf_model_dir))

240-246: Noise reduction: disable tqdm for per-turn single-request generates.

A progress bar per turn is noisy and slower. Pass use_tqdm=False for this inner loop.

-                outputs = llm.generate(inputs,
-                                       sampling_params,
-                                       lora_request=lora_request)
+                outputs = llm.generate(
+                    inputs,
+                    sampling_params,
+                    use_tqdm=False,
+                    lora_request=lora_request)

256-260: Conversation formatting is not role-aware; consider structured chat turns.

Concatenating text and responses into a single user prompt bypasses chat templates and roles, which can degrade quality or break models relying on role markers. Prefer building a conversation list of messages (user/assistant) and applying the chat template over that structure, or introduce a dedicated multiturn input loader that accepts prior turns.

If you want, I can sketch a minimal helper that:

  • accumulates List[ConversationMessage] with roles,
  • applies apply_chat_template(model_type, tokenizer, processor, conversation=list, add_generation_prompt=True),
  • reuses the same multi_modal_data across turns.

267-271: Return the captured generated_outputs for programmatic callers.

You already collect generated_outputs “for return” but discard it. Returning it enables tests to assert per-turn content without parsing stdout.

-        for i, output in enumerate(generated_outputs):
+        for i, output in enumerate(generated_outputs):
             print(
                 f"[{i}] Prompt: {output['user_input']!r}, Generated text: {output['assistant_response']!r}"
             )
-        return
+        return generated_outputs
tests/unittest/_torch/speculative/test_eagle3.py (1)

16-16: Unconditional skip is fine for triage; add a reminder to revert.

Recommend adding a short TODO with the nvbugs ID to make it obvious this should be re-enabled once fixed, or convert to a targeted skipif (e.g., by device/arch) if the failure scope is known.

tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py (3)

124-127: Thread pool switch is appropriate for I/O-bound prefetch; simplify cpu count and make workers tunable.

Reading files is I/O-bound and releases the GIL; threads are a good choice. Minor cleanups:

  • Drop the multiprocessing import in favor of os.cpu_count().
  • Make max_workers configurable via env to adapt to different storage backends without code changes.

Apply:

-        max_workers = min(multiprocessing.cpu_count() * 2, 16,
-                          len(local_file_names))
-        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        # Default heuristic: 2x CPUs capped at 16; allow overrides for fast/slow storage.
+        default_workers = min(((os.cpu_count() or 1) * 2), 16, len(local_file_names))
+        max_workers = int(os.environ.get("TLLM_PREFETCH_MAX_WORKERS", default_workers))
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
             list(executor.map(self._prefetch_one_file, local_file_names))

And remove the unused multiprocessing import (see separate diff on Line 1).


106-112: Harden prefetch against transient filesystem errors.

Between exists() and open(), files can be moved/removed. Swallowing I/O errors here avoids failing an optional optimization step.

Apply:

     def _prefetch_one_file(self, file_name):
-        if os.path.exists(file_name):
-            logger.info(f"Prefetching {file_name} to memory...")
-            with open(file_name, 'rb') as f:
-                f.read()
-            logger.info(f"Finished prefetching {file_name}.")
+        if os.path.exists(file_name):
+            logger.info(f"Prefetching {file_name} to memory...")
+            try:
+                with open(file_name, 'rb') as f:
+                    # Read once to populate page cache; we discard the bytes object.
+                    _ = f.read()
+            except OSError as e:
+                logger.warning(f"Prefetch failed for {file_name}: {e}")
+                return
+            logger.info(f"Finished prefetching {file_name}.")

33-45: Consider per-rank prefetch gating.

enable_prefetch is decided using the total size of all weight files, but each rank only prefetches a subset. Using per-rank size could avoid unnecessarily disabling prefetch on multi-rank jobs with large global checkpoints.

If feasible, compute the prefetch size for local_file_names (the per-rank slice) before deciding to prefetch, or move the memory check into prefetch_files() using that subset. No need to block the PR if this is intended; just a potential perf improvement.

docker/Dockerfile.multi (1)

180-182: Ensure helper script is executable inside the image.

If get_triton_tag.sh is invoked by the build scripts, make sure it has exec perms (Docker preserves mode, but this avoids host-FS surprises).

Apply:

 COPY ./jenkins/scripts/get_triton_tag.sh ./jenkins/scripts/get_triton_tag.sh
 COPY ./docker/Dockerfile.multi ./docker/Dockerfile.multi
+RUN chmod +x ./jenkins/scripts/get_triton_tag.sh
 RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

768-776: Make the assertion scalar and reuse the computed max; current message prints a Tensor repr

As written, the f-string embeds a 0-dim Tensor and the assert compares a Tensor to an int. While this often works, it’s cleaner and more readable to convert to Python scalars and compute once.

Apply this diff:

-            error_message = (
-                f"The max KV cache length of input sequences ({self.kv_lens[:self.num_seqs].max()}) "
-                f"exceeds the KV cache manager's maximum supported length "
-                f"({self.kv_cache_manager.max_seq_len}).")
-
-            assert self.kv_lens[:self.num_seqs].max(
-            ) <= self.kv_cache_manager.max_seq_len, error_message
+            max_kv_cache_len = int(self.kv_lens[:self.num_seqs].max().item())
+            error_message = (
+                f"The max KV cache length of input sequences ({max_kv_cache_len}) "
+                f"exceeds the KV cache manager's maximum supported length "
+                f"({self.kv_cache_manager.max_seq_len})."
+            )
+            assert max_kv_cache_len <= self.kv_cache_manager.max_seq_len, error_message
tests/integration/test_lists/waives.txt (3)

243-244: Duplicate SKIP for the same test with different bug IDs

llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp appears twice (Lines 243 and 304) with different NVBugs links. This is confusing and increases maintenance debt.

Pick one bug reference (prefer the newer one if it supersedes the old) and remove the duplicate entry. Example:

-llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp SKIP (https://nvbugs/5410399)
 llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp SKIP (https://nvbugs/5461796)

Also applies to: 304-304


266-267: Duplicate SKIP for llava v1.6 mistral multimodal quickstart

The two entries target the same test variant. Keep one and, if helpful, include both bug links in a single comment.

For example:

-examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False] SKIP (https://nvbugs/5444095)
+examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False] SKIP (https://nvbugs/5444060, https://nvbugs/5444095)

And remove the other duplicate line.

Also applies to: 322-323


1-331: Add CI duplicate detection for waives.txt

The provided script confirms duplicate entries in tests/integration/test_lists/waives.txt. To keep the list tidy and prevent regressions, please:

  • Remove the redundant waives at the following locations:
    • llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp (lines 243 & 304)
    • test_e2e.py::test_ptp_quickstart_multimodal[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False] (lines 266 & 322)
  • Integrate the duplicate-detection script into your CI or pre-commit hook using the provided Bash snippet. This will automatically flag any repeated SKIP entries before they land in the repo.

Let me know if you’d like help wiring this into your CI configuration.

examples/llm-api/quickstart_example.py (1)

1-1: Add NVIDIA copyright header

Per repo guidelines, Python sources should carry the NVIDIA copyright header.

Add at the top of the file (before imports):

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and limitations under the License.
tensorrt_llm/_torch/models/modeling_llama.py (3)

186-188: Safeguard FP4 tuple -> Fp4QuantizedTensor conversion.

The tuple-to-container conversion is fine, but please confirm Fp4QuantizedTensor supports positional construction. If it’s not a @DataClass (or signature differs), this will throw. If uncertain, use a defensive init.

-        if isinstance(attn_output, tuple):
-            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])
+        if isinstance(attn_output, tuple):
+            # attn_output: (fp4_tensor, scaling_factor)
+            attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1])
+        # Optionally, add a runtime check during bring-up:
+        # assert isinstance(attn_output, (torch.Tensor, Fp4QuantizedTensor))

Would you like me to scan the repo to verify the Fp4QuantizedTensor constructor signature?


556-567: Prefer explicit “no fusion” over None for fusion_op.

AllReduceParams typically expects an enum; passing None relies on downstream permissiveness. Use AllReduceFusionOp.NONE to be explicit and avoid type pitfalls.

-                hidden_states, residual = self.all_reduce(
+                hidden_states, residual = self.all_reduce(
                     hidden_states,
                     all_reduce_params=AllReduceParams(
-                        fusion_op=None,
+                        fusion_op=AllReduceFusionOp.NONE,
                         residual=residual,
                     ))

768-799: Align post-MLP fusion path with explicit “no fusion” enum.

Same concern as above; prefer AllReduceFusionOp.NONE over None.

-                hidden_states, residual = self.all_reduce(
+                hidden_states, residual = self.all_reduce(
                     hidden_states,
                     all_reduce_params=AllReduceParams(
-                        fusion_op=None,
+                        fusion_op=AllReduceFusionOp.NONE,
                         residual=residual,
                     ))
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

740-743: Line length > 120 chars (Ruff E501).

Minor style nit. Consider breaking the long logger.info f-strings to satisfy E501.

Example tweak:

-                                logger.info(f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase")
+                                phase = 'context' if num_tokens_per_request > 1 else 'generation'
+                                logger.info(f"Run warmup for batch size={bs}, pure {phase} phase")
-            logger.info(f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes.")
+            num_sizes = len(self._cuda_graph_batch_sizes)
+            logger.info(f"Creating CUDA graph instances for {num_sizes} batch sizes.")

Also applies to: 771-773

cpp/tensorrt_llm/thop/moeOp.cpp (2)

392-394: Callsite update to stream-aware getWorkspaceInfo looks correct; ensure header for isCapturing is visible.

The switch to binding a const-ref and passing the CUDA stream is good. However, since getWorkspaceInfo now calls tensorrt_llm::common::isCapturing(stream), make sure this TU includes the declaration to avoid ODR/implicit-decl issues on some compilers.

Apply this diff near the other includes to make the dependency explicit:

 #include "tensorrt_llm/common/workspace.h"
+#include "tensorrt_llm/common/cudaUtils.h"
 #include "tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm.h"

705-706: Member naming: use m-prefix for member variables.

Coding guidelines require mPrefix for members. workspace_info should be renamed to mWorkspaceInfo. Update all references (call sites returning the const-ref and internal assignments).

Apply these diffs:

@@
-    WorkspaceInfo workspace_info;
+    WorkspaceInfo mWorkspaceInfo;
@@
-        WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
+        WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
             static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);
@@
-        WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
+        WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total,
             static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);
@@
-        return workspace_info;
+        return mWorkspaceInfo;
@@
-        if (is_capturing || workspace_info.workspace.numel() < total_workspace_size)
+        if (is_capturing || mWorkspaceInfo.workspace.numel() < static_cast<int64_t>(total_workspace_size))
         {
             if (is_capturing)
             {
                 TLLM_LOG_DEBUG(
-                    "Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
+                    "Allocating MoE workspace with %zu bytes size during cuda graph capture", total_workspace_size);
             }
             else
             {
-                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
-                    workspace_info.workspace.numel(), total_workspace_size);
+                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %lld bytes to %zu bytes",
+                    static_cast<long long>(mWorkspaceInfo.workspace.numel()), total_workspace_size);
             }
-            workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
+            mWorkspaceInfo.workspace = torch::empty({static_cast<int64_t>(total_workspace_size)},
                 torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
         }
-        workspace_info.src_to_dest_map
-            = common::nextWorkspacePtr(static_cast<int8_t*>(workspace_info.workspace.data_ptr()), moe_workspace_size);
+        mWorkspaceInfo.src_to_dest_map
+            = common::nextWorkspacePtr(static_cast<int8_t*>(mWorkspaceInfo.workspace.data_ptr()), moe_workspace_size);

Also applies to: 392-394, 411-426, 550-552, 784-788

tests/integration/defs/test_e2e.py (1)

626-685: Throughput increase assertion might be too strict for all environments.

The 1.3x step-up per concurrency level could be brittle across devices/MIG partitions. Consider relaxing to >1.1x or asserting monotonic increase only to reduce false negatives in CI variability.

tensorrt_llm/_torch/models/modeling_mistral.py (1)

456-461: Type hint tuple[...] is Python 3.9+; repo targets 3.8+.

Per guidelines, Python must target 3.8+. Use typing.Tuple for forward/backward compatibility.

Apply this diff (typing is already imported at top):

-    ) -> tuple[torch.Tensor, torch.Tensor]:
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
tensorrt_llm/_torch/models/modeling_phi4mm.py (2)

26-36: PreTrainedModel shim and special tokens: LGTM with a caveat.

Special tokens centralized is good. The temporary NewPreTrainedModel solves HF 4.53.1 interface needs, but see below for a safer pattern to avoid global monkey-patching.


86-88: mm_processor_kwargs is unused; either wire it through or drop it.

Ruff flagged F841. If the processor supports extra kwargs, forward them; otherwise, remove the variable.

Apply one of the following:

  • Wire through:
-        text_prompt, mm_data, mm_processor_kwargs = inputs.get("prompt"), \
-                        inputs.get("multi_modal_data", {}), inputs.get("mm_processor_kwargs", {})
+        text_prompt = inputs.get("prompt")
+        mm_data = inputs.get("multi_modal_data", {})
+        mm_processor_kwargs = inputs.get("mm_processor_kwargs", {})
@@
-        inputs = self.processor(text=[text_prompt],
-                                images=images,
-                                audios=audios,
-                                return_tensors='pt').to(self.device)
+        inputs = self.processor(
+            text=[text_prompt],
+            images=images,
+            audios=audios,
+            return_tensors='pt',
+            **mm_processor_kwargs,
+        ).to(self.device)
  • Or drop the unused local:
-        text_prompt, mm_data, mm_processor_kwargs = inputs.get("prompt"), \
-                        inputs.get("multi_modal_data", {}), inputs.get("mm_processor_kwargs", {})
+        text_prompt = inputs.get("prompt")
+        mm_data = inputs.get("multi_modal_data", {})

Also applies to: 100-104

examples/llm-api/_tensorrt_engine/quickstart_example.py (2)

6-9: Optional: Make “engine” intent explicit (if API supports it).

Given this lives under _tensorrt_engine/, consider showing an explicit engine load so readers don’t assume PyTorch build. If LLM supports passing an engine directory or a from_engine() constructor, update the snippet to highlight that path. If not, disregard.

I can propose a concrete update once you confirm the preferred public API (e.g., LLM(engine_dir=...), LLM.from_engine(...), or runtime="trt").


25-30: Wrap or trim exceedingly long comment lines (Ruff E501).

The sample output comments exceed the 120-char limit flagged by Ruff. Breaking them improves readability and keeps linters quiet.

Apply this diff to wrap the example comments:

-    # Got output like
-    # Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming'
-    # Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the'
-    # Prompt: 'The capital of France is', Generated text: 'Paris.'
-    # Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are'
+    # Example outputs:
+    # Prompt: 'Hello, my name is'
+    #   Generated text: 'Jane Smith. I am a student pursuing my degree in
+    #   Computer Science at [university]...'
+    # Prompt: 'The capital of France is'
+    #   Generated text: 'Paris.'
+    # Prompt: 'The future of AI is'
+    #   Generated text: 'an exciting time for us...'
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)

444-456: Add an explicit FP8 assertion and clarify CUDA graph capture size.

Minor but helpful:

  • Assert the quant algo is FP8 to fail fast if a wrong artifact is routed.
  • Since you’re using max_batch_size=1 for CUDA graphs but LLM(..., max_batch_size=16), consider explicitly pinning batch sizes to [1] for clarity.

Apply:

@@
-    def test_fp8_eagle3_tp8(self, eagle3_one_model):
+    def test_fp8_eagle3_tp8(self, eagle3_one_model):
         model_path = f"{llm_models_root()}/modelopt-hf-model-hub/Llama-3.3-70B-Instruct-fp8"
@@
-        pytorch_config = dict(
-            disable_overlap_scheduler=True,
-            cuda_graph_config=CudaGraphConfig(max_batch_size=1))
+        pytorch_config = dict(
+            disable_overlap_scheduler=True,
+            cuda_graph_config=CudaGraphConfig(batch_sizes=[1], max_batch_size=1))
@@
-        with LLM(model_path,
+        with LLM(model_path,
                  max_batch_size=16,
                  tensor_parallel_size=8,
                  speculative_config=spec_config,
                  kv_cache_config=kv_cache_config,
                  **pytorch_config) as llm:
+            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
             task = CnnDailymail(self.MODEL_NAME)
             task.evaluate(llm)

784-794: Auto-dtype for MistralSmall24B: memory guard present; consider documenting why 0.75 is chosen.

The 80GB skip and free_gpu_memory_fraction=0.75 look sane for 24B. Optional: add a brief comment on why 0.75 is needed (empirical headroom vs warmup/OOM).


795-807: MistralSmall24B FP8: assert kv-cache quant if expected; otherwise add a note.

You assert weights/activations FP8. If the prequantized artifact also sets KV to FP8, add an assertion for kv_cache_quant_algo == QuantAlgo.FP8 to catch regressions. If KV remains bf16, add a one-liner comment to avoid confusion.

Proposed tweak if KV is FP8:

@@
-        with LLM(model_path, kv_cache_config=kv_cache_config) as llm:
-            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+        with LLM(model_path, kv_cache_config=kv_cache_config) as llm:
+            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+            # If artifact enables KV FP8 as well, keep this; else drop it.
+            assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8

If KV is intentionally not FP8, add:

@@
-            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+            assert llm.args.quant_config.quant_algo == QuantAlgo.FP8
+            # Note: KV cache remains bf16 for this artifact.

Comment on lines +765 to +790
int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());

WorkspaceInfo info{};
info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
info.src_to_dest_map
= common::nextWorkspacePtr(static_cast<int8_t*>(info.workspace.data_ptr()), moe_workspace_size);
bool is_capturing = tensorrt_llm::common::isCapturing(stream);
// Always allocate workspace when capturing cuda graph to avoid illegal memory access during replay
if (is_capturing || workspace_info.workspace.numel() < total_workspace_size)
{
if (is_capturing)
{
TLLM_LOG_DEBUG(
"Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
}
else
{
TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
workspace_info.workspace.numel(), total_workspace_size);
}
workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use size_t for byte sizes and correct printf-specifiers; avoid narrowing to long.

  • calculateTotalWorkspaceSize returns size_t; keep it size_t.
  • Prefer %zu for size_t in logs, and cast int64 prints explicitly to long long (%lld).
  • Avoid static_cast(...) for tensor shapes; Torch expects int64_t dimension.

Apply this diff:

-        int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
+        size_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
@@
-                TLLM_LOG_DEBUG(
-                    "Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
+                TLLM_LOG_DEBUG(
+                    "Allocating MoE workspace with %zu bytes size during cuda graph capture", total_workspace_size);
@@
-                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
-                    workspace_info.workspace.numel(), total_workspace_size);
+                TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %lld bytes to %zu bytes",
+                    static_cast<long long>(mWorkspaceInfo.workspace.numel()), total_workspace_size);
@@
-            workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
+            mWorkspaceInfo.workspace = torch::empty({static_cast<int64_t>(total_workspace_size)},
                 torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/thop/moeOp.cpp around lines 765-783, change
total_workspace_size to size_t (it’s returned as size_t), update log format
specifiers to use %zu for size_t values, and when printing int64 values like
workspace_info.workspace.numel() cast explicitly to long long and use %lld;
finally avoid static_cast<long>(...) when building the torch::empty size — use
an int64_t cast (e.g., static_cast<int64_t>(total_workspace_size)) so the tensor
dimension type matches Torch’s expected int64_t.

Comment on lines +4 to +7
```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
:language: python
:linenos:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix literalinclude relative path (currently off by one level).

The code block includes the script via "../../examples/…", but this document lives under docs/source/legacy. To reach the repo-level examples directory, you need to go up three levels (not two). The link below (Line 9) already uses "../../../…", which further indicates the mismatch.

Apply this diff:

-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
     :language: python
     :linenos:


Build docs locally to catch include errors:

---

🏁 Script executed:

```shell
#!/bin/bash
# Strict docs build to fail on warnings
python -m pip install -r docs/requirements.txt
sphinx-build -nW docs/source docs/_build/html

Length of output: 11225


Let’s bypass the PATH issue and invoke Sphinx as a module:


🏁 Script executed:

#!/bin/bash
# Strict docs build to fail on warnings (invoke via module to avoid PATH problems)
python -m sphinx -nW docs/source docs/_build/html

Length of output: 2359


🏁 Script executed:

#!/bin/bash
# Verify existence of example file via both relative paths from the docs file
BASE="docs/source/legacy"
INCORRECT="../../examples/llm-api/_tensorrt_engine/quickstart_example.py"
CORRECT="../../../examples/llm-api/_tensorrt_engine/quickstart_example.py"

echo "Checking incorrect path (../../examples/...):"
if [[ -f "${BASE}/${INCORRECT}" ]]; then
  echo "FOUND: ${BASE}/${INCORRECT}"
else
  echo "MISSING: ${BASE}/${INCORRECT}"
fi

echo "Checking corrected path (../../../examples/...):"
if [[ -f "${BASE}/${CORRECT}" ]]; then
  echo "FOUND: ${BASE}/${CORRECT}"
else
  echo "MISSING: ${BASE}/${CORRECT}"
fi

Length of output: 949


Fix literalinclude path in docs/source/legacy/tensorrt_quickstart.md

The current {literalinclude} goes up only two levels and fails to locate the example; it must go up three levels to reach the repository’s examples directory (verified via shell check).

Apply this patch:

--- docs/source/legacy/tensorrt_quickstart.md
@@ lines 4-7
-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
     :language: python
     :linenos:

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🤖 Prompt for AI Agents
In docs/source/legacy/tensorrt_quickstart.md around lines 4 to 7 the
{literalinclude} path incorrectly goes up two directories and cannot find the
example; change the include to go up three directories so it points to
../../../examples/llm-api/_tensorrt_engine/quickstart_example.py instead of
../../examples/llm-api/_tensorrt_engine/quickstart_example.py, keeping the
:language: python and :linenos: options intact.

@@ -0,0 +1,33 @@
from tensorrt_llm import LLM, SamplingParams
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add NVIDIA SPDX header per repo guidelines.

Examples are source files and should carry the NVIDIA SPDX header.

Apply this diff to prepend the header:

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
 from tensorrt_llm import LLM, SamplingParams
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import LLM, SamplingParams
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
from tensorrt_llm import LLM, SamplingParams
🤖 Prompt for AI Agents
In examples/llm-api/_tensorrt_engine/quickstart_example.py around lines 1 to 1,
the file is missing the required NVIDIA SPDX header; prepend the repository's
standard NVIDIA SPDX header block (e.g., SPDX-License-Identifier and
copyright/ownership lines as specified by project guidelines) at the very top of
the file before any imports so the example carries the required license
metadata.

Comment on lines +1 to +2
from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Use public API import for LLM; avoid private module path

End-user examples should import LLM from the public package namespace. Importing from _tensorrt_engine is an internal detail and may change without notice.

Apply this diff:

-from tensorrt_llm import BuildConfig, SamplingParams
-from tensorrt_llm._tensorrt_engine import LLM  # NOTE the change
+from tensorrt_llm import BuildConfig, SamplingParams, LLM
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
from tensorrt_llm import BuildConfig, SamplingParams, LLM
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_example.py around lines 1 to 2, the code imports
LLM from a private module path (tensorrt_llm._tensorrt_engine) instead of the
public package API; change the import to pull LLM from the public package (e.g.,
from tensorrt_llm import LLM alongside BuildConfig and SamplingParams) so
examples rely only on the stable public surface area and avoid referencing
internal modules.

Comment on lines +219 to +229
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=llm._hf_model_dir,
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=args.media,
image_data_format="pt",
num_frames=8,
device="cpu")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: nested-media modalities will assert in loader; also honor CLI image_format/num_frames/device.

default_multimodal_input_loader asserts when len(prompts)=1 and media is a list-of-lists (e.g., modality="image_audio"). Your current call passes prompts=[cur_prompt] and media=args.media unchanged, which will trip the assert for nested media. Additionally, the code ignores user CLI values and hardcodes "pt"/8/"cpu".

Apply this refactor to (1) select a single sample for nested-media modalities, (2) pass user-specified format/frames/device, and (3) keep model_dir type consistent:

-                inputs = default_multimodal_input_loader(
-                    tokenizer=llm.tokenizer,
-                    model_dir=llm._hf_model_dir,
-                    model_type=model_type,
-                    modality=args.modality,
-                    prompts=[cur_prompt],
-                    media=args.media,
-                    image_data_format="pt",
-                    num_frames=8,
-                    device="cpu")
+                # For nested-media (e.g., image_audio = [ [img,aud], [img,aud], ... ]),
+                # pick one sample to pair with a single-turn prompt. For flat media
+                # (image/video/audio), 1 prompt + N media is supported by the loader.
+                media_for_turn = args.media
+                if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list):
+                    media_for_turn = [media_for_turn[0]]
+
+                inputs = default_multimodal_input_loader(
+                    tokenizer=llm.tokenizer,
+                    model_dir=str(llm._hf_model_dir),
+                    model_type=model_type,
+                    modality=args.modality,
+                    prompts=[cur_prompt],
+                    media=media_for_turn,
+                    image_data_format=image_format,
+                    num_frames=args.num_frames,
+                    device=args.device)

Follow-up: If you want to reuse the same nested media across turns, consider extracting the first sample once outside the loop and reusing it to avoid repeated conditionals.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=llm._hf_model_dir,
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=args.media,
image_data_format="pt",
num_frames=8,
device="cpu")
# For nested-media (e.g., image_audio = [[img, aud], [img, aud], …]),
# pick one sample to pair with a single-turn prompt. For flat media
# (image/video/audio), 1 prompt + N media is supported by the loader.
media_for_turn = args.media
if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list):
media_for_turn = [media_for_turn[0]]
inputs = default_multimodal_input_loader(
tokenizer=llm.tokenizer,
model_dir=str(llm._hf_model_dir),
model_type=model_type,
modality=args.modality,
prompts=[cur_prompt],
media=media_for_turn,
image_data_format=image_format,
num_frames=args.num_frames,
device=args.device)
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_multimodal.py around lines 219 to 229, the call
to default_multimodal_input_loader will assert for nested-media modalities when
prompts=[cur_prompt] because media may be a list-of-lists; also
image_data_format/num_frames/device are hardcoded and model_dir type must remain
consistent. Fix by: if media is nested (e.g., a list and its first element is a
list) and you are sending a single prompt, select the corresponding single-media
sample (media = media[0]) before calling the loader; replace hardcoded "pt", 8
and "cpu" with args.image_format, args.num_frames, and args.device respectively;
and ensure the model_dir argument uses the same type the loader expects (use
llm._hf_model_dir as-is or cast consistently). Keep this selection logic minimal
and consider hoisting the single-sample extraction outside any loop if you want
to reuse the same nested media across turns.

Comment on lines 160 to 166
chat_completion = client.chat.completions.create(
model=model_name,
messages=messages,
max_completion_tokens=100,
max_completion_tokens=256,
response_format={
"type":
"structural_tag",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Use the correct parameter name: max_tokens (not max_completion_tokens).

chat.completions.create expects max_tokens; max_completion_tokens is unsupported and may be ignored or raise.

Apply this diff:

-        max_completion_tokens=256,
+        max_tokens=256,

Run to find other occurrences to fix:


🏁 Script executed:

#!/bin/bash
rg -nP '\bmax_completion_tokens\b' -C2 tests/

Length of output: 14891


Replace max_completion_tokens with max_tokens in all affected tests

The following test files use the unsupported max_completion_tokens parameter and must be updated to max_tokens:

• tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (line 163)
• tests/unittest/llmapi/apps/_test_openai_chat.py (lines 103, 162, 201, 223, 236, 258, 269, 322, 338, 402, 414, 438, 448, 458, 484, 497, 520)
• tests/unittest/llmapi/apps/_test_openai_reasoning.py (lines 48, 95, 108)
• tests/unittest/llmapi/apps/_test_openai_multi_gpu.py (line 123)
• tests/unittest/llmapi/apps/_test_openai_consistent_chat.py (lines 109, 154, 200)
• tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py (lines 85, 90, 145, 150, 198, 203, 253, 258, 310, 321)

Apply this change everywhere:

-    max_completion_tokens=<value>,
+    max_tokens=<value>,

This ensures the client’s chat.completions.create calls use the correct max_tokens parameter.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
chat_completion = client.chat.completions.create(
model=model_name,
messages=messages,
max_completion_tokens=100,
max_completion_tokens=256,
response_format={
"type":
"structural_tag",
chat_completion = client.chat.completions.create(
model=model_name,
messages=messages,
max_tokens=256,
response_format={
"type":
"structural_tag",
🤖 Prompt for AI Agents
In tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py around lines
160 to 166, the chat.completions.create call uses the unsupported parameter
max_completion_tokens; replace that parameter name with max_tokens preserving
its value and placement. Also apply the same replacement across all affected
test files and line ranges listed in the review comment so every
chat.completions.create call uses max_tokens instead of max_completion_tokens.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/modules/linear.py (3)

625-630: Fix: make FP8 block-scale shape handling safetensors-friendly; also pin device for sharded load.

Calling .dim()/.squeeze() on weights[0][scale_name] breaks when it’s a safetensors slice (PySafeSlice). Use a helper that inspects shape via get_shape()/.shape and drops singleton dims through indexing. Also pass device=module.weight_scale.device to avoid CPU/CUDA mismatches when modules are already on GPU.

Apply this diff within the selected lines:

-        full_weight_scale = weights[0][scale_name]
-        # modelopt fp8_pb_wo can have 2 extra singleton dimensions
-        if full_weight_scale.dim() == 4:
-            full_weight_scale = full_weight_scale.squeeze(1).squeeze(-1)
-        weight_scale = load_weight_shard(full_weight_scale, module.tp_size,
-                                         module.tp_rank, module.tp_mode)
+        full_weight_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name])
+        weight_scale = load_weight_shard(
+            full_weight_scale,
+            module.tp_size,
+            module.tp_rank,
+            module.tp_mode,
+            device=module.weight_scale.device,
+        )

Add this helper inside FP8BlockScalesLinearMethod (outside the selected range):

def _normalize_fp8_block_scale_like(self, scale_like):
    """
    Accept a torch.Tensor or a safetensors slice-like. If the shape is [R, 1, C, 1],
    lazily drop singleton dims to [R, C] without materializing.
    """
    get_shape = getattr(scale_like, "get_shape", None)
    shape = tuple(get_shape()) if callable(get_shape) else tuple(getattr(scale_like, "shape", ()))
    if len(shape) == 4 and shape[1] == 1 and shape[-1] == 1:
        # Indexing preserves safetensors laziness
        return scale_like[:, 0, :, 0]
    return scale_like

643-660: Refactor Q/K/V scale normalization to helper; avoid safetensors-incompatible .dim()/.squeeze() and ensure correct device.

Same issue as above for full_q_scale/full_k_scale/full_v_scale. Normalize with the helper and shard onto module.weight_scale.device to prevent device mismatch during copy_.

Apply this diff within the selected lines:

-        full_q_scale = weights[0][scale_name]
-        full_k_scale = weights[1][scale_name]
-        full_v_scale = weights[2][scale_name]
-        # modelopt fp8_pb_wo can have 2 extra singleton dimensions
-        if full_q_scale.dim() == 4:
-            full_q_scale = full_q_scale.squeeze(1).squeeze(-1)
-        if full_k_scale.dim() == 4:
-            full_k_scale = full_k_scale.squeeze(1).squeeze(-1)
-        if full_v_scale.dim() == 4:
-            full_v_scale = full_v_scale.squeeze(1).squeeze(-1)
-        q_scale = load_weight_shard(full_q_scale, module.tp_size,
-                                    module.tp_rank, module.tp_mode)
-        k_scale = load_weight_shard(full_k_scale, module.tp_size,
-                                    module.tp_rank, module.tp_mode)
-        v_scale = load_weight_shard(full_v_scale, module.tp_size,
-                                    module.tp_rank, module.tp_mode)
+        full_q_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name])
+        full_k_scale = self._normalize_fp8_block_scale_like(weights[1][scale_name])
+        full_v_scale = self._normalize_fp8_block_scale_like(weights[2][scale_name])
+        q_scale = load_weight_shard(
+            full_q_scale, module.tp_size, module.tp_rank, module.tp_mode,
+            device=module.weight_scale.device)
+        k_scale = load_weight_shard(
+            full_k_scale, module.tp_size, module.tp_rank, module.tp_mode,
+            device=module.weight_scale.device)
+        v_scale = load_weight_shard(
+            full_v_scale, module.tp_size, module.tp_rank, module.tp_mode,
+            device=module.weight_scale.device)
         fused_fp8_block_scale = torch.cat((q_scale, k_scale, v_scale))

671-682: Normalize left/right FP8 scales via helper and shard on the correct device.

Eliminate direct .dim()/.squeeze() to support safetensors slices and pass the destination device to load_weight_shard.

Apply this diff within the selected lines:

-        full_left_scale = weights[0][scale_name]
-        full_right_scale = weights[1][scale_name]
-        # modelopt fp8_pb_wo can have 2 extra singleton dimensions
-        if full_left_scale.dim() == 4:
-            full_left_scale = full_left_scale.squeeze(1).squeeze(-1)
-        if full_right_scale.dim() == 4:
-            full_right_scale = full_right_scale.squeeze(1).squeeze(-1)
-        left_scale = load_weight_shard(full_left_scale, module.tp_size,
-                                       module.tp_rank, module.tp_mode)
-        right_scale = load_weight_shard(full_right_scale, module.tp_size,
-                                        module.tp_rank, module.tp_mode)
+        full_left_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name])
+        full_right_scale = self._normalize_fp8_block_scale_like(weights[1][scale_name])
+        left_scale = load_weight_shard(
+            full_left_scale, module.tp_size, module.tp_rank, module.tp_mode,
+            device=module.weight_scale.device)
+        right_scale = load_weight_shard(
+            full_right_scale, module.tp_size, module.tp_rank, module.tp_mode,
+            device=module.weight_scale.device)
         fused_scale = torch.cat([left_scale, right_scale], dim=0)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

1-1: Add NVIDIA copyright header per repo guidelines.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

Apply at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+
 from __future__ import annotations
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e49078f and 7da226a.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (1 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tests/integration/test_lists/waives.txt (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/attention_backend/trtllm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/linear.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • tp_rank (54-55)
tensorrt_llm/mapping.py (1)
  • tp_rank (338-339)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17076 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17075 [ run ] completed with state ABORTED

@dominicshanshan
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17076 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12835 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17077 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17077 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12836 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17081 [ run ] triggered by Bot

@dominicshanshan
Copy link
Collaborator Author

Since previous bot run already passed the full CI pipeline, but always encounter org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: 6aa38ccf-181e-4f45-894b-4d674bb99d8a java.io.IOException: Failed to kill container '526f7d4dd88de86db97929d898581b47d9d3ed47edb957047045b855943db929'. and it is known issue, so I will run bot skip command to pass this PR @EmmaQiaoCh @chzblych

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17081 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12840 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot skip --comment "passed full CI pipeline"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17086 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17086 [ skip ] completed with state SUCCESS
Skipping testing for commit 2576c53

crazydemo and others added 9 commits August 31, 2025 19:08
Signed-off-by: Ivy Zhang <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971)

Signed-off-by: Yanchao Lu <[email protected]>
Co-authored-by: Zhanrui Sun <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
…images (NVIDIA#7157)

Signed-off-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]>
Co-authored-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]>
Co-authored-by: Dimitrios Bariamis <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
@dominicshanshan
Copy link
Collaborator Author

/bot skip --comment "passed full CI pipeline"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17135 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17135 [ skip ] completed with state SUCCESS
Skipping testing for commit 7040a83

@joyang-nv joyang-nv merged commit b0558c7 into NVIDIA:main Sep 1, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.