-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][chore] Mass integration of release/1.0 - 2nd #7171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughRemoved active release-branch gating from CODEOWNERS (kept as commented block), added TensorRT-engine quickstart docs and example, made MoE workspace per-instance and stream-aware, parameterized Triton build with TRITON_BASE_TAG, refactored multiple models/quantization and multimodal paths, replaced multiprocessing with thread pools for weight prefetch, and applied broad test/CI manifest updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant FusedMoeRunner
participant CUDA as CUDA Runtime
Caller->>FusedMoeRunner: getWorkspaceInfo(num_rows, hidden_size,..., min_latency_mode, stream)
activate FusedMoeRunner
FusedMoeRunner->>CUDA: inspect stream & capture state
alt capturing or workspace too small
FusedMoeRunner->>CUDA: allocate/resize workspace on stream
Note over FusedMoeRunner: persist into workspace_info.workspace<br/>update src_to_dest_map
end
FusedMoeRunner-->>Caller: return WorkspaceInfo const&
deactivate FusedMoeRunner
Caller->>CUDA: launch kernels using provided stream and workspace buffers
sequenceDiagram
autonumber
participant User
participant Script as quickstart_multimodal.py
participant Loader as default_multimodal_input_loader
participant LLM
User->>Script: run --multiturn --conversation_turns=N
loop for each turn ≤ N
Script->>Loader: load inputs(conversation_history, media, device)
Loader-->>Script: pixel_values, image_sizes, prompts
Script->>LLM: generate(inputs[, lora_request])
LLM-->>Script: response
Script->>Script: append response to conversation_history
end
Script-->>User: print recap of prompts and responses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
f3f6a88
to
e49078f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 24
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (16)
tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (2)
25-37
: Temp file naming can collide across workers; generate a unique file.Hard-coding extra_llm_api_options.yaml in the system temp dir risks cross-test/process collisions (pytest-xdist) and flaky cleanup. Prefer a unique, per-fixture temp file.
Apply this diff:
-@pytest.fixture(scope="module") -def temp_extra_llm_api_options_file(): - temp_dir = tempfile.gettempdir() - temp_file_path = os.path.join(temp_dir, "extra_llm_api_options.yaml") +@pytest.fixture(scope="module") +def temp_extra_llm_api_options_file(): + fd, temp_file_path = tempfile.mkstemp(prefix="extra_llm_api_options_", suffix=".yaml") + os.close(fd) try: extra_llm_api_options_dict = {"guided_decoding_backend": "xgrammar"} - - with open(temp_file_path, 'w') as f: - yaml.dump(extra_llm_api_options_dict, f) + with open(temp_file_path, "w") as f: + yaml.safe_dump(extra_llm_api_options_dict, f) yield temp_file_path finally: - if os.path.exists(temp_file_path): - os.remove(temp_file_path) + try: + os.remove(temp_file_path) + except FileNotFoundError: + pass
123-146
: Fix prompt contradictions and remove unused tool reference.The prompt instructs “Only call one function at a time,” yet the test asserts both function tags are present in a single reply. Also, “brave_search” isn’t a defined tool here and may distract guided decoding.
Apply this focused replacement within the system_prompt string:
-If a you choose to call a function ONLY reply in the following format: +If you choose to call a function, ONLY reply in the following format: @@ -Reminder: -- Function calls MUST follow the specified format -- Required parameters MUST be specified -- Only call one function at a time -- Put the entire function call reply on one line -- Always add your sources when using search results to answer the user query +Reminder: +- Function calls MUST follow the specified format. +- Required parameters MUST be specified. +- You may call multiple functions to fulfill a single request; if calling more than one, put each function call on its own line. +- Put each function call on a single line and do not include any text outside the <function=...>...</function> tags.Optionally drop this line earlier in the prompt since it references an undefined tool here:
-- When looking for real time information use relevant functions if available else fallback to brave_search
tests/unittest/llmapi/apps/_test_openai_chat_json.py (3)
1-2
: Add NVIDIA copyright header (2025) at file top.Required by repository guidelines.
Apply this diff at the very top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # Adapted from # https://github.com/vllm-project/vllm/blob/aae6927be06dedbda39c6b0c30f6aa3242b84388/tests/entrypoints/openai/test_chat.py
24-37
: Avoid fixed temp filename and drop unused request parameter.Same collision risk as the other test; also, request isn’t used.
Apply this diff:
-@pytest.fixture(scope="module") -def temp_extra_llm_api_options_file(request): - temp_dir = tempfile.gettempdir() - temp_file_path = os.path.join(temp_dir, "extra_llm_api_options.yaml") +@pytest.fixture(scope="module") +def temp_extra_llm_api_options_file(): + fd, temp_file_path = tempfile.mkstemp(prefix="extra_llm_api_options_", suffix=".yaml") + os.close(fd) try: extra_llm_api_options_dict = {"guided_decoding_backend": "xgrammar"} with open(temp_file_path, "w") as f: - yaml.dump(extra_llm_api_options_dict, f) + yaml.safe_dump(extra_llm_api_options_dict, f) yield temp_file_path finally: - if os.path.exists(temp_file_path): - os.remove(temp_file_path) + try: + os.remove(temp_file_path) + except FileNotFoundError: + pass
6-6
: Fix Python 3.8 typing compatibility (builtins generics).list[dict[...]] requires Python 3.9+. Tests target 3.8+, so use typing.List/Dict (or add future annotations and still avoid PEP 585).
Apply this diff:
-from typing import Any +from typing import Any, Dict, List @@ - def _create_and_validate_response( - messages: list[dict[str, Any]]) -> dict[str, Any]: + def _create_and_validate_response( + messages: List[Dict[str, Any]]) -> Dict[str, Any]:Also applies to: 79-81
tests/integration/defs/perf/pytorch_model_config.py (1)
30-31
: Replace PEP 585 generic type hints withtyping
generics for Python 3.8 compatibilityPython 3.8 does not support subscripted built-in generics (e.g.
list[str]
,dict[str, Any]
, etc.); all of these will raise aSyntaxError
when parsing. Our ripgrep scan uncovered hundreds of such occurrences across the repo, so this is a critical, cross-cutting compatibility issue that must be fixed before merging.• In tests/integration/defs/perf/pytorch_model_config.py (lines 30–31):
– Addfrom typing import Optional, List
near the top of the file.
– Change the signature to wraplora_dirs
inOptional[List[str]]
instead of usinglist[str]
.• Apply the same pattern throughout the codebase:
– Replacelist[...]
→List[...]
,dict[...]
→Dict[...]
,set[...]
→Set[...]
,tuple[...]
→Tuple[...]
.
– Wrap any default-None
parameters inOptional[...]
.
– Ensure all required names (List
,Dict
,Set
,Tuple
,Optional
, etc.) are imported fromtyping
.• To find all remaining builtin-generic annotations, run:
#!/bin/bash # Find PEP 585 builtin generics incompatible with Python 3.8 rg -n --type=py -e '\b(list|dict|set|tuple)\['–––
Example diff for the one function:
--- a/tests/integration/defs/perf/pytorch_model_config.py +++ b/tests/integration/defs/perf/pytorch_model_config.py @@ +from typing import Optional, List def get_model_yaml_config(model_label: str, - lora_dirs: list[str] = None) -> dict: + lora_dirs: Optional[List[str]] = None) -> dict:examples/llm-api/llm_runtime.py (1)
1-1
: Add NVIDIA SPDX header to comply with repository licensing standards.All source files should carry the current-year NVIDIA copyright header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# -*- coding: utf-8 -*-tests/integration/defs/accuracy/test_disaggregated_serving.py (3)
1-4
: Add NVIDIA copyright header (2025) per repo guideline.All source files must include the current-year NVIDIA header. Please prepend the standard header used elsewhere in this repo.
Apply this diff (adjust to your repo’s exact header if different):
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # SPDX-License-Identifier: Apache-2.0
53-66
: Fix Python 3.8 incompatibilities: PEP 585 generics andcancel_futures
kwarg.
list[... ]
requires Python 3.9+, but our target is 3.8+. Usetyping.List
.shutdown(cancel_futures=...)
is 3.9+. Provide a 3.8 fallback.Apply this diff:
class MyThreadPoolExecutor(ThreadPoolExecutor): def __init__(self, *args, **kwargs) -> None: super().__init__(*args, **kwargs) - self.futures: list[concurrent.futures.Future[RequestOutput]] = [] + # Python 3.8-compatible typing (avoid PEP 585 generics) + from typing import List + self.futures: List[concurrent.futures.Future] = [] def __exit__(self, exc_type, exc_val, exc_tb): if exc_type is None: for future in self.futures: future.result() return super().__exit__(exc_type, exc_val, exc_tb) for future in self.futures: future.cancel() - self.shutdown(wait=True, cancel_futures=True) + # Python 3.8 fallback for shutdown(cancel_futures=...) + try: + self.shutdown(wait=True, cancel_futures=True) + except TypeError: + self.shutdown(wait=True) return False
207-217
: Avoid potential indefinite hang in health check; add timeout and usetime.monotonic()
.
requests.get()
without a timeout can block forever even inside the 1h loop. Also prefertime.monotonic()
for elapsed time.Apply this diff:
- start_time = time.time() - while time.time() - start_time < 3600: + start_time = time.monotonic() + while time.monotonic() - start_time < 3600: time.sleep(1) try: print("Checking health endpoint") - response = requests.get("http://localhost:8000/health") + response = requests.get("http://localhost:8000/health", timeout=5) if response.status_code == 200: break - except requests.exceptions.ConnectionError: + except (requests.exceptions.ConnectionError, requests.exceptions.Timeout): continueOptional: rate-limit the "Checking health endpoint" log (e.g., every 10s) to keep CI logs concise.
examples/llm-api/quickstart_multimodal.py (1)
1-4
: Prepend the NVIDIA 2025 copyright header.Per repository guidelines, every Python source must include the current-year NVIDIA header. This file is missing it.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + import argparse import json import os.github/CODEOWNERS (1)
158-163
: Catch‑all CODEOWNERS rule overrides everything; will require release-branch-approval on all PRs to mainGitHub CODEOWNERS uses last-match-wins. The line with “* @NVIDIA/trt-llm-release-branch-approval” at the end overrides all specific entries above, effectively gating every change (including on main) behind that team. The adjacent comments say this rule should only be uncommented on release branches.
Recommend commenting it out here (main) to avoid blocking merges, and only enable it on release/* branches.
- * @NVIDIA/trt-llm-release-branch-approval +# * @NVIDIA/trt-llm-release-branch-approvalAlternative: if you truly want a fallback owner while preserving specific entries, place the
*
rule at the top so later, more-specific patterns take precedence. But per your comment, keeping it commented on main is safest.tensorrt_llm/bench/benchmark/throughput.py (1)
1-1
: Missing NVIDIA copyright header (2025).Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/models/modeling_llama.py (1)
1-1
: Missing NVIDIA copyright header (2025).Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1-1
: Missing NVIDIA copyright header (2025).Per coding guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
45-81
: Avoid global monkey-patching of transformers.PreTrainedModel.Temporarily reassigning transformers.modeling_utils.PreTrainedModel is unsafe under concurrency. Prefer a local subclass wrapper or patch only the target class of the loaded model.
Option A (preferred): Load normally, then dynamically mix in GenerationMixin to the instance (no global patch):
- OldPreTrainedModel = transformers.modeling_utils.PreTrainedModel - transformers.modeling_utils.PreTrainedModel = NewPreTrainedModel - ref_phi4mm_model = transformers.AutoModelForCausalLM.from_pretrained( + ref_phi4mm_model = transformers.AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, # Flash_attn_2 only supports bf16 or fp16 and set in HF config. torch_dtype='auto', _attn_implementation='flash_attention_2', ).eval() - transformers.modeling_utils.PreTrainedModel = OldPreTrainedModel + # Ensure prepare_inputs_for_generation is available even if the loaded class didn't + if not isinstance(ref_phi4mm_model, transformers.generation.GenerationMixin): + ref_phi4mm_model.__class__ = type( + "PretrainedWithGen", + (ref_phi4mm_model.__class__, transformers.generation.GenerationMixin), + {}, + )This avoids touching global state.
I can push a follow-up patch applying Option A throughout this block if you’d like.
🧹 Nitpick comments (48)
examples/models/core/multimodal/README.md (5)
356-359
: Clarify support scope and fix capitalization ("PyTorch")The note should explicitly state that Gemma 3 is PyTorch LLM API–only, and use the correct capitalization.
-**NOTE: We only support Gemma3 VLMs in Pytorch workflow.** +**NOTE: Gemma 3 VLMs are currently supported only via the PyTorch LLM API backend; TensorRT engine build is not supported.**
360-363
: Tighten the attention-mask description and wordingMinor clarity/grammar improvements; also use consistent model naming.
-Gemma3VL decoder requires a custom attention mask while processing images. During the context phase: -- Text tokens attend to other tokens in a causal fashion (standard autoregressive behavior) -- Image tokens attend to other tokens in a causal fashion AND attend to other tokens from the same image in a bidirectional manner +The Gemma 3 VLM decoder requires a custom attention mask when processing images. During the context phase: +- Text tokens attend to prior tokens in a causal fashion (standard autoregressive behavior). +- Image tokens attend causally to all prior tokens and bidirectionally within the same image.
366-374
: Make backend requirement exclusive and add an incompatibility note
- Make it clear that FlashInfer is the only supported backend for this mask.
- Add a note that Gemma 3’s “must disable chunked prefill” requirement conflicts with Embedding Table Offloading (which relies on chunking).
-We support this custom mask with FlashInfer attention backend. +This custom mask is supported only with the FlashInfer attention backend. @@ -### Requirements +### Requirements @@ -To ensure expected behavior with Gemma3VL, the following configurations are **required**: -- **Attention Backend**: Use the FlashInfer attention backend -- **Chunked Prefill**: Must be disabled -- **KV Cache Reuse**: Must be disabled +To ensure expected behavior with Gemma 3 VLMs, the following configurations are **required**: +- **Attention Backend**: Use the FlashInfer attention backend (other backends are not supported) +- **Chunked Prefill**: Must be disabled. +- **KV Cache Reuse**: Must be disabled. + +Note: Embedding Table Offloading depends on context chunking and is therefore not compatible with Gemma 3 at this time. See "Enabling Embedding Table Offloading" below.
420-423
: Optional: enumerate exact HF IDs or a link for each supported variantListing concrete HF model IDs (e.g., google/gemma-3-4b-it, 12b-it, 27b-it) or linking to the canonical model page will reduce guesswork for users.
15-15
: TOC hint: call out PyTorch-only to set expectations early (optional)Consider signaling the PyTorch-only constraint directly in the TOC entry to prevent users from jumping to TRT build steps by habit.
-- [Gemma3](#gemma3) +- [Gemma3 (PyTorch only)](#gemma3)tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (1)
44-48
: Args formatting differs from sibling test; optional unification.This test passes CLI flags with “=”, while _test_openai_chat_json.py passes “--flag value”. Both likely work; consider standardizing across tests for consistency.
tests/unittest/llmapi/apps/_test_openai_chat_json.py (2)
41-45
: Optional: apply the same resource-limiting flags as the structural_tag test.Helps avoid OOM on smaller GPUs (A10/A30) and keeps CI deterministic.
Apply this diff:
- args = ["--extra_llm_api_options", temp_extra_llm_api_options_file] + args = [ + "--max_batch_size=8", + "--max_seq_len=1024", + "--max_num_tokens=1024", + "--extra_llm_api_options", temp_extra_llm_api_options_file + ]
41-45
: Minor: CLI style consistency.This test uses ["--flag", value] while the structural_tag test uses "--flag=value". Consider standardizing for readability.
tests/integration/defs/perf/pytorch_model_config.py (2)
205-209
: Streamline kv_cache_config merge and avoid redundant reassignment.Use
setdefault
to ensure we always mutate the dict insidebase_config
and drop the extraupdate
. Functionally equivalent and a bit cleaner.- kv_cache_config = base_config.get('kv_cache_config', {}) - if 'kv_cache_dtype' in base_config: - kv_cache_dtype = base_config.pop('kv_cache_dtype', 'auto') - kv_cache_config['dtype'] = kv_cache_dtype - base_config.update({'kv_cache_config': kv_cache_config}) + kv_cache_config = base_config.setdefault('kv_cache_config', {}) + if 'kv_cache_dtype' in base_config: + kv_cache_config['dtype'] = base_config.pop('kv_cache_dtype', 'auto')
45-49
: Makekv_cache_dtype
extraction resilient to formatting.Current split-based parsing is brittle. A small regex avoids index errors and unexpected formats like trailing spaces.
- if 'kv_cache_dtype' in model_label: - base_config.update({ - 'kv_cache_dtype': - model_label.split('kv_cache_dtype:')[1].split('-')[0] - }) + if 'kv_cache_dtype:' in model_label: + m = re.search(r'kv_cache_dtype:([^-]+)', model_label) + if m: + base_config['kv_cache_dtype'] = m.group(1).strip()Add this import at the top of the file if not already present:
import reexamples/llm-api/llm_runtime.py (1)
8-8
: Prefer module-namespace imports for llmapi types; also remove redundantenable_block_reuse=True
.
- Importing the module keeps symbols namespaced and reduces potential name collisions.
enable_block_reuse
defaults to True in KvCacheConfig; passing it explicitly adds noise without effect.-from tensorrt_llm.llmapi import CudaGraphConfig, KvCacheConfig +from tensorrt_llm import llmapi @@ - cuda_graph_config = CudaGraphConfig( + cuda_graph_config = llmapi.CudaGraphConfig( @@ - kv_cache_config=KvCacheConfig(free_gpu_memory_fraction=0.5)) + kv_cache_config=llmapi.KvCacheConfig(free_gpu_memory_fraction=0.5)) @@ - kv_cache_config=KvCacheConfig( - free_gpu_memory_fraction=0.5, - enable_block_reuse=True)) + kv_cache_config=llmapi.KvCacheConfig( + free_gpu_memory_fraction=0.5))Also applies to: 22-26, 32-32, 57-60
tests/integration/defs/accuracy/test_disaggregated_serving.py (7)
34-36
: Avoid shadowing built-inid
; rename torequest_id
.Shadowing
id
is a minor readability smell and can confuse tooling.Apply this diff:
-class Result(GenerationResultBase): - - def __init__(self, id: int, sampling_params: SamplingParams, - outputs: List[CompletionOutput]): - super().__init__(id, sampling_params) +class Result(GenerationResultBase): + + def __init__(self, request_id: int, sampling_params: SamplingParams, + outputs: List[CompletionOutput]): + super().__init__(request_id, sampling_params) self._outputs = outputs self._streaming = False- result = Result(id=0, + result = Result(request_id=0, sampling_params=sampling_params, outputs=[ CompletionOutput(text=response.choices[0].text, index=0) ])Also applies to: 252-257
141-148
: Guard against GPU index overruns whenlaunch_disaggregated_llm
is used directly.When called outside
run_parallel_test
, this function can oversubscribe GPUs viaCUDA_VISIBLE_DEVICES
if gating is missing or misconfigured. Add a quick check vsget_device_count()
before launching.For example:
ctx_total_gpus = ctx_tp * ctx_pp gen_total_gpus = gen_tp * gen_pp + # Optional safety guard + total_needed = ctx_total_gpus * len(ctx_ports) + gen_total_gpus * len(gen_ports) + if total_needed > get_device_count(): + pytest.skip(f"Not enough devices for disaggregated serving: need {total_needed}, have {get_device_count()}")Also applies to: 165-169
135-137
: Nit: URL port parsing is brittle.
url.split(":")[1]
breaks for IPv6 or if a scheme/path sneaks in. Preferrsplit(":", 1)
orurllib.parse
. Low priority since tests uselocalhost:port
.Consider:
- ctx_ports = [int(url.split(":")[1]) for url in ctx_urls] - gen_ports = [int(url.split(":")[1]) for url in gen_urls] + ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls] + gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]
221-247
: Streaming path not implemented; fail fast or implement.You pass
stream=streaming
to the OpenAI client, but the downstream handling assumes non-streaming. Ifstreaming=True
is ever used, this will likely break. Either implement the streaming consumer or fail fast.Apply a defensive guard:
def send_request(prompt: str, sampling_params: SamplingParams, streaming: bool): kwargs = {} + if streaming: + raise NotImplementedError("Streaming completions not yet supported in disaggregated tests.")If you want streaming support, I can sketch a compatible event-consumer for OpenAI-style completions.
682-687
: Remove duplicate@pytest.mark.skip_less_device(8)
decorator.
skip_less_device(8)
appears twice ontest_auto_dtype
, which is redundant.Apply this diff:
@pytest.mark.skip_less_device(8) @parametrize_with_ids("overlap_scheduler", [True, False]) @parametrize_with_ids("mtp_nextn", [0, pytest.param(2, marks=skip_pre_hopper)]) -@pytest.mark.skip_less_device(8) def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
730-736
: Prefer a collection-time skip marker over runtimepytest.skip
for Gemma3 1B test.This test is always skipped; using
@pytest.mark.skip(reason=...)
is cleaner and avoids executing setup code.Apply this diff:
- @pytest.mark.skip_less_device(2) - @pytest.mark.parametrize("overlap_scheduler", [False, True]) - def test_auto_dtype(self, overlap_scheduler): - pytest.skip( - "Currently we require full kvcache for variable sliding window. " - "This test only transfers the kvcache inside the sliding window.") + @pytest.mark.skip_less_device(2) + @pytest.mark.parametrize("overlap_scheduler", [False, True]) + @pytest.mark.skip(reason="Requires full KV cache for variable sliding window; current test only transfers KV within sliding window.") + def test_auto_dtype(self, overlap_scheduler):
2-4
: Nit: Fix small typo in file header comment.“I need to to this” → “I need to do this”.
-# I need to to this by creating a new class that mimics LLM class. Instead of implementing the +# I need to do this by creating a new class that mimics the LLM class. Instead of implementing theexamples/llm-api/quickstart_multimodal.py (5)
125-134
: CLI: Good addition; add lightweight validation for conversation_turns.Flags look fine. Consider rejecting non-positive values early to avoid silent no-op runs.
parser.add_argument( "--conversation_turns", type=int, default=2, - help="Number of conversation turns for automated testing.") + help="Number of conversation turns for automated testing (must be >= 1).") +parser.set_defaults()Or validate after parsing:
args = parser.parse_args() +if args.conversation_turns < 1: + raise ValueError("--conversation_turns must be >= 1")
230-238
: LoRA: remove redundant model_class None check; ensure model_dir type consistency.model_class is guaranteed to be set earlier when args.load_lora is True. The None-check is dead code and adds confusion. Also prefer consistent str() for model_dir.
- lora_request = None - if args.load_lora: - if model_class is None: - raise ValueError( - "model_class must be provided when load_lora is True" - ) - lora_request = model_class.lora_request( - len(inputs), args.modality, llm._hf_model_dir) + lora_request = None + if args.load_lora: + lora_request = model_class.lora_request( + len(inputs), args.modality, str(llm._hf_model_dir))
240-246
: Noise reduction: disable tqdm for per-turn single-request generates.A progress bar per turn is noisy and slower. Pass use_tqdm=False for this inner loop.
- outputs = llm.generate(inputs, - sampling_params, - lora_request=lora_request) + outputs = llm.generate( + inputs, + sampling_params, + use_tqdm=False, + lora_request=lora_request)
256-260
: Conversation formatting is not role-aware; consider structured chat turns.Concatenating text and responses into a single user prompt bypasses chat templates and roles, which can degrade quality or break models relying on role markers. Prefer building a conversation list of messages (user/assistant) and applying the chat template over that structure, or introduce a dedicated multiturn input loader that accepts prior turns.
If you want, I can sketch a minimal helper that:
- accumulates List[ConversationMessage] with roles,
- applies apply_chat_template(model_type, tokenizer, processor, conversation=list, add_generation_prompt=True),
- reuses the same multi_modal_data across turns.
267-271
: Return the captured generated_outputs for programmatic callers.You already collect generated_outputs “for return” but discard it. Returning it enables tests to assert per-turn content without parsing stdout.
- for i, output in enumerate(generated_outputs): + for i, output in enumerate(generated_outputs): print( f"[{i}] Prompt: {output['user_input']!r}, Generated text: {output['assistant_response']!r}" ) - return + return generated_outputstests/unittest/_torch/speculative/test_eagle3.py (1)
16-16
: Unconditional skip is fine for triage; add a reminder to revert.Recommend adding a short TODO with the nvbugs ID to make it obvious this should be re-enabled once fixed, or convert to a targeted skipif (e.g., by device/arch) if the failure scope is known.
tensorrt_llm/_torch/models/checkpoints/hf/weight_loader.py (3)
124-127
: Thread pool switch is appropriate for I/O-bound prefetch; simplify cpu count and make workers tunable.Reading files is I/O-bound and releases the GIL; threads are a good choice. Minor cleanups:
- Drop the
multiprocessing
import in favor ofos.cpu_count()
.- Make
max_workers
configurable via env to adapt to different storage backends without code changes.Apply:
- max_workers = min(multiprocessing.cpu_count() * 2, 16, - len(local_file_names)) - with ThreadPoolExecutor(max_workers=max_workers) as executor: + # Default heuristic: 2x CPUs capped at 16; allow overrides for fast/slow storage. + default_workers = min(((os.cpu_count() or 1) * 2), 16, len(local_file_names)) + max_workers = int(os.environ.get("TLLM_PREFETCH_MAX_WORKERS", default_workers)) + with ThreadPoolExecutor(max_workers=max_workers) as executor: list(executor.map(self._prefetch_one_file, local_file_names))And remove the unused
multiprocessing
import (see separate diff on Line 1).
106-112
: Harden prefetch against transient filesystem errors.Between
exists()
andopen()
, files can be moved/removed. Swallowing I/O errors here avoids failing an optional optimization step.Apply:
def _prefetch_one_file(self, file_name): - if os.path.exists(file_name): - logger.info(f"Prefetching {file_name} to memory...") - with open(file_name, 'rb') as f: - f.read() - logger.info(f"Finished prefetching {file_name}.") + if os.path.exists(file_name): + logger.info(f"Prefetching {file_name} to memory...") + try: + with open(file_name, 'rb') as f: + # Read once to populate page cache; we discard the bytes object. + _ = f.read() + except OSError as e: + logger.warning(f"Prefetch failed for {file_name}: {e}") + return + logger.info(f"Finished prefetching {file_name}.")
33-45
: Consider per-rank prefetch gating.
enable_prefetch
is decided using the total size of all weight files, but each rank only prefetches a subset. Using per-rank size could avoid unnecessarily disabling prefetch on multi-rank jobs with large global checkpoints.If feasible, compute the prefetch size for
local_file_names
(the per-rank slice) before deciding to prefetch, or move the memory check intoprefetch_files()
using that subset. No need to block the PR if this is intended; just a potential perf improvement.docker/Dockerfile.multi (1)
180-182
: Ensure helper script is executable inside the image.If
get_triton_tag.sh
is invoked by the build scripts, make sure it has exec perms (Docker preserves mode, but this avoids host-FS surprises).Apply:
COPY ./jenkins/scripts/get_triton_tag.sh ./jenkins/scripts/get_triton_tag.sh COPY ./docker/Dockerfile.multi ./docker/Dockerfile.multi +RUN chmod +x ./jenkins/scripts/get_triton_tag.sh RUN bash ./triton_backend/inflight_batcher_llm/scripts/build.sh
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
768-776
: Make the assertion scalar and reuse the computed max; current message prints a Tensor reprAs written, the f-string embeds a 0-dim Tensor and the assert compares a Tensor to an int. While this often works, it’s cleaner and more readable to convert to Python scalars and compute once.
Apply this diff:
- error_message = ( - f"The max KV cache length of input sequences ({self.kv_lens[:self.num_seqs].max()}) " - f"exceeds the KV cache manager's maximum supported length " - f"({self.kv_cache_manager.max_seq_len}).") - - assert self.kv_lens[:self.num_seqs].max( - ) <= self.kv_cache_manager.max_seq_len, error_message + max_kv_cache_len = int(self.kv_lens[:self.num_seqs].max().item()) + error_message = ( + f"The max KV cache length of input sequences ({max_kv_cache_len}) " + f"exceeds the KV cache manager's maximum supported length " + f"({self.kv_cache_manager.max_seq_len})." + ) + assert max_kv_cache_len <= self.kv_cache_manager.max_seq_len, error_messagetests/integration/test_lists/waives.txt (3)
243-244
: Duplicate SKIP for the same test with different bug IDs
llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp
appears twice (Lines 243 and 304) with different NVBugs links. This is confusing and increases maintenance debt.Pick one bug reference (prefer the newer one if it supersedes the old) and remove the duplicate entry. Example:
-llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp SKIP (https://nvbugs/5410399) llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp SKIP (https://nvbugs/5461796)
Also applies to: 304-304
266-267
: Duplicate SKIP for llava v1.6 mistral multimodal quickstartThe two entries target the same test variant. Keep one and, if helpful, include both bug links in a single comment.
For example:
-examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False] SKIP (https://nvbugs/5444095) +examples/test_multimodal.py::test_llm_multimodal_general[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False] SKIP (https://nvbugs/5444060, https://nvbugs/5444095)And remove the other duplicate line.
Also applies to: 322-323
1-331
: Add CI duplicate detection for waives.txtThe provided script confirms duplicate entries in
tests/integration/test_lists/waives.txt
. To keep the list tidy and prevent regressions, please:
- Remove the redundant waives at the following locations:
llmapi/test_llm_examples.py::test_llmapi_speculative_decoding_mtp
(lines 243 & 304)test_e2e.py::test_ptp_quickstart_multimodal[llava-v1.6-mistral-7b-llava-v1.6-mistral-7b-hf-image-False]
(lines 266 & 322)- Integrate the duplicate-detection script into your CI or pre-commit hook using the provided Bash snippet. This will automatically flag any repeated SKIP entries before they land in the repo.
Let me know if you’d like help wiring this into your CI configuration.
examples/llm-api/quickstart_example.py (1)
1-1
: Add NVIDIA copyright headerPer repo guidelines, Python sources should carry the NVIDIA copyright header.
Add at the top of the file (before imports):
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 # Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and limitations under the License.tensorrt_llm/_torch/models/modeling_llama.py (3)
186-188
: Safeguard FP4 tuple -> Fp4QuantizedTensor conversion.The tuple-to-container conversion is fine, but please confirm Fp4QuantizedTensor supports positional construction. If it’s not a @DataClass (or signature differs), this will throw. If uncertain, use a defensive init.
- if isinstance(attn_output, tuple): - attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1]) + if isinstance(attn_output, tuple): + # attn_output: (fp4_tensor, scaling_factor) + attn_output = Fp4QuantizedTensor(attn_output[0], attn_output[1]) + # Optionally, add a runtime check during bring-up: + # assert isinstance(attn_output, (torch.Tensor, Fp4QuantizedTensor))Would you like me to scan the repo to verify the Fp4QuantizedTensor constructor signature?
556-567
: Prefer explicit “no fusion” over None for fusion_op.AllReduceParams typically expects an enum; passing None relies on downstream permissiveness. Use AllReduceFusionOp.NONE to be explicit and avoid type pitfalls.
- hidden_states, residual = self.all_reduce( + hidden_states, residual = self.all_reduce( hidden_states, all_reduce_params=AllReduceParams( - fusion_op=None, + fusion_op=AllReduceFusionOp.NONE, residual=residual, ))
768-799
: Align post-MLP fusion path with explicit “no fusion” enum.Same concern as above; prefer AllReduceFusionOp.NONE over None.
- hidden_states, residual = self.all_reduce( + hidden_states, residual = self.all_reduce( hidden_states, all_reduce_params=AllReduceParams( - fusion_op=None, + fusion_op=AllReduceFusionOp.NONE, residual=residual, ))tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
740-743
: Line length > 120 chars (Ruff E501).Minor style nit. Consider breaking the long logger.info f-strings to satisfy E501.
Example tweak:
- logger.info(f"Run warmup for batch size={bs}, pure {'context' if num_tokens_per_request > 1 else 'generation'} phase") + phase = 'context' if num_tokens_per_request > 1 else 'generation' + logger.info(f"Run warmup for batch size={bs}, pure {phase} phase")- logger.info(f"Creating CUDA graph instances for {len(self._cuda_graph_batch_sizes)} batch sizes.") + num_sizes = len(self._cuda_graph_batch_sizes) + logger.info(f"Creating CUDA graph instances for {num_sizes} batch sizes.")Also applies to: 771-773
cpp/tensorrt_llm/thop/moeOp.cpp (2)
392-394
: Callsite update to stream-aware getWorkspaceInfo looks correct; ensure header for isCapturing is visible.The switch to binding a const-ref and passing the CUDA stream is good. However, since getWorkspaceInfo now calls tensorrt_llm::common::isCapturing(stream), make sure this TU includes the declaration to avoid ODR/implicit-decl issues on some compilers.
Apply this diff near the other includes to make the dependency explicit:
#include "tensorrt_llm/common/workspace.h" +#include "tensorrt_llm/common/cudaUtils.h" #include "tensorrt_llm/kernels/cutlass_kernels/fp8_blockscale_gemm/fp8_blockscale_gemm.h"
705-706
: Member naming: use m-prefix for member variables.Coding guidelines require mPrefix for members. workspace_info should be renamed to mWorkspaceInfo. Update all references (call sites returning the const-ref and internal assignments).
Apply these diffs:
@@ - WorkspaceInfo workspace_info; + WorkspaceInfo mWorkspaceInfo;@@ - WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total, + WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total, static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);@@ - WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total, + WorkspaceInfo const& workspace_info = getWorkspaceInfo(num_rows, hidden_size, inter_size, num_experts_total, static_cast<int>(experts_per_token), base_activation_type, parallelism_config, min_latency_mode, stream);@@ - return workspace_info; + return mWorkspaceInfo;@@ - if (is_capturing || workspace_info.workspace.numel() < total_workspace_size) + if (is_capturing || mWorkspaceInfo.workspace.numel() < static_cast<int64_t>(total_workspace_size)) { if (is_capturing) { TLLM_LOG_DEBUG( - "Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size); + "Allocating MoE workspace with %zu bytes size during cuda graph capture", total_workspace_size); } else { - TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes", - workspace_info.workspace.numel(), total_workspace_size); + TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %lld bytes to %zu bytes", + static_cast<long long>(mWorkspaceInfo.workspace.numel()), total_workspace_size); } - workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)}, + mWorkspaceInfo.workspace = torch::empty({static_cast<int64_t>(total_workspace_size)}, torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false)); } - workspace_info.src_to_dest_map - = common::nextWorkspacePtr(static_cast<int8_t*>(workspace_info.workspace.data_ptr()), moe_workspace_size); + mWorkspaceInfo.src_to_dest_map + = common::nextWorkspacePtr(static_cast<int8_t*>(mWorkspaceInfo.workspace.data_ptr()), moe_workspace_size);Also applies to: 392-394, 411-426, 550-552, 784-788
tests/integration/defs/test_e2e.py (1)
626-685
: Throughput increase assertion might be too strict for all environments.The 1.3x step-up per concurrency level could be brittle across devices/MIG partitions. Consider relaxing to >1.1x or asserting monotonic increase only to reduce false negatives in CI variability.
tensorrt_llm/_torch/models/modeling_mistral.py (1)
456-461
: Type hint tuple[...] is Python 3.9+; repo targets 3.8+.Per guidelines, Python must target 3.8+. Use typing.Tuple for forward/backward compatibility.
Apply this diff (typing is already imported at top):
- ) -> tuple[torch.Tensor, torch.Tensor]: + ) -> Tuple[torch.Tensor, torch.Tensor]:tensorrt_llm/_torch/models/modeling_phi4mm.py (2)
26-36
: PreTrainedModel shim and special tokens: LGTM with a caveat.Special tokens centralized is good. The temporary NewPreTrainedModel solves HF 4.53.1 interface needs, but see below for a safer pattern to avoid global monkey-patching.
86-88
: mm_processor_kwargs is unused; either wire it through or drop it.Ruff flagged F841. If the processor supports extra kwargs, forward them; otherwise, remove the variable.
Apply one of the following:
- Wire through:
- text_prompt, mm_data, mm_processor_kwargs = inputs.get("prompt"), \ - inputs.get("multi_modal_data", {}), inputs.get("mm_processor_kwargs", {}) + text_prompt = inputs.get("prompt") + mm_data = inputs.get("multi_modal_data", {}) + mm_processor_kwargs = inputs.get("mm_processor_kwargs", {}) @@ - inputs = self.processor(text=[text_prompt], - images=images, - audios=audios, - return_tensors='pt').to(self.device) + inputs = self.processor( + text=[text_prompt], + images=images, + audios=audios, + return_tensors='pt', + **mm_processor_kwargs, + ).to(self.device)
- Or drop the unused local:
- text_prompt, mm_data, mm_processor_kwargs = inputs.get("prompt"), \ - inputs.get("multi_modal_data", {}), inputs.get("mm_processor_kwargs", {}) + text_prompt = inputs.get("prompt") + mm_data = inputs.get("multi_modal_data", {})Also applies to: 100-104
examples/llm-api/_tensorrt_engine/quickstart_example.py (2)
6-9
: Optional: Make “engine” intent explicit (if API supports it).Given this lives under _tensorrt_engine/, consider showing an explicit engine load so readers don’t assume PyTorch build. If LLM supports passing an engine directory or a from_engine() constructor, update the snippet to highlight that path. If not, disregard.
I can propose a concrete update once you confirm the preferred public API (e.g., LLM(engine_dir=...), LLM.from_engine(...), or runtime="trt").
25-30
: Wrap or trim exceedingly long comment lines (Ruff E501).The sample output comments exceed the 120-char limit flagged by Ruff. Breaking them improves readability and keeps linters quiet.
Apply this diff to wrap the example comments:
- # Got output like - # Prompt: 'Hello, my name is', Generated text: '\n\nJane Smith. I am a student pursuing my degree in Computer Science at [university]. I enjoy learning new things, especially technology and programming' - # Prompt: 'The president of the United States is', Generated text: 'likely to nominate a new Supreme Court justice to fill the seat vacated by the death of Antonin Scalia. The Senate should vote to confirm the' - # Prompt: 'The capital of France is', Generated text: 'Paris.' - # Prompt: 'The future of AI is', Generated text: 'an exciting time for us. We are constantly researching, developing, and improving our platform to create the most advanced and efficient model available. We are' + # Example outputs: + # Prompt: 'Hello, my name is' + # Generated text: 'Jane Smith. I am a student pursuing my degree in + # Computer Science at [university]...' + # Prompt: 'The capital of France is' + # Generated text: 'Paris.' + # Prompt: 'The future of AI is' + # Generated text: 'an exciting time for us...'tests/integration/defs/accuracy/test_llm_api_pytorch.py (3)
444-456
: Add an explicit FP8 assertion and clarify CUDA graph capture size.Minor but helpful:
- Assert the quant algo is FP8 to fail fast if a wrong artifact is routed.
- Since you’re using
max_batch_size=1
for CUDA graphs butLLM(..., max_batch_size=16)
, consider explicitly pinning batch sizes to[1]
for clarity.Apply:
@@ - def test_fp8_eagle3_tp8(self, eagle3_one_model): + def test_fp8_eagle3_tp8(self, eagle3_one_model): model_path = f"{llm_models_root()}/modelopt-hf-model-hub/Llama-3.3-70B-Instruct-fp8" @@ - pytorch_config = dict( - disable_overlap_scheduler=True, - cuda_graph_config=CudaGraphConfig(max_batch_size=1)) + pytorch_config = dict( + disable_overlap_scheduler=True, + cuda_graph_config=CudaGraphConfig(batch_sizes=[1], max_batch_size=1)) @@ - with LLM(model_path, + with LLM(model_path, max_batch_size=16, tensor_parallel_size=8, speculative_config=spec_config, kv_cache_config=kv_cache_config, **pytorch_config) as llm: + assert llm.args.quant_config.quant_algo == QuantAlgo.FP8 task = CnnDailymail(self.MODEL_NAME) task.evaluate(llm)
784-794
: Auto-dtype for MistralSmall24B: memory guard present; consider documenting why 0.75 is chosen.The 80GB skip and
free_gpu_memory_fraction=0.75
look sane for 24B. Optional: add a brief comment on why 0.75 is needed (empirical headroom vs warmup/OOM).
795-807
: MistralSmall24B FP8: assert kv-cache quant if expected; otherwise add a note.You assert weights/activations FP8. If the prequantized artifact also sets KV to FP8, add an assertion for
kv_cache_quant_algo == QuantAlgo.FP8
to catch regressions. If KV remains bf16, add a one-liner comment to avoid confusion.Proposed tweak if KV is FP8:
@@ - with LLM(model_path, kv_cache_config=kv_cache_config) as llm: - assert llm.args.quant_config.quant_algo == QuantAlgo.FP8 + with LLM(model_path, kv_cache_config=kv_cache_config) as llm: + assert llm.args.quant_config.quant_algo == QuantAlgo.FP8 + # If artifact enables KV FP8 as well, keep this; else drop it. + assert llm.args.quant_config.kv_cache_quant_algo == QuantAlgo.FP8If KV is intentionally not FP8, add:
@@ - assert llm.args.quant_config.quant_algo == QuantAlgo.FP8 + assert llm.args.quant_config.quant_algo == QuantAlgo.FP8 + # Note: KV cache remains bf16 for this artifact.
int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size()); | ||
|
||
WorkspaceInfo info{}; | ||
info.workspace = torch::empty({static_cast<long>(total_workspace_size)}, | ||
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false)); | ||
info.src_to_dest_map | ||
= common::nextWorkspacePtr(static_cast<int8_t*>(info.workspace.data_ptr()), moe_workspace_size); | ||
bool is_capturing = tensorrt_llm::common::isCapturing(stream); | ||
// Always allocate workspace when capturing cuda graph to avoid illegal memory access during replay | ||
if (is_capturing || workspace_info.workspace.numel() < total_workspace_size) | ||
{ | ||
if (is_capturing) | ||
{ | ||
TLLM_LOG_DEBUG( | ||
"Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size); | ||
} | ||
else | ||
{ | ||
TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes", | ||
workspace_info.workspace.numel(), total_workspace_size); | ||
} | ||
workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)}, | ||
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use size_t for byte sizes and correct printf-specifiers; avoid narrowing to long.
- calculateTotalWorkspaceSize returns size_t; keep it size_t.
- Prefer %zu for size_t in logs, and cast int64 prints explicitly to long long (%lld).
- Avoid static_cast(...) for tensor shapes; Torch expects int64_t dimension.
Apply this diff:
- int64_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
+ size_t const total_workspace_size = common::calculateTotalWorkspaceSize(workspaces.data(), workspaces.size());
@@
- TLLM_LOG_DEBUG(
- "Allocating MoE workspace with %ld bytes size during cuda graph capture", total_workspace_size);
+ TLLM_LOG_DEBUG(
+ "Allocating MoE workspace with %zu bytes size during cuda graph capture", total_workspace_size);
@@
- TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %ld bytes to %ld bytes",
- workspace_info.workspace.numel(), total_workspace_size);
+ TLLM_LOG_DEBUG("MoE workspace size is not enough, increase the size from %lld bytes to %zu bytes",
+ static_cast<long long>(mWorkspaceInfo.workspace.numel()), total_workspace_size);
@@
- workspace_info.workspace = torch::empty({static_cast<long>(total_workspace_size)},
+ mWorkspaceInfo.workspace = torch::empty({static_cast<int64_t>(total_workspace_size)},
torch::dtype(torch::kInt8).device(torch::kCUDA).requires_grad(false));
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/thop/moeOp.cpp around lines 765-783, change
total_workspace_size to size_t (it’s returned as size_t), update log format
specifiers to use %zu for size_t values, and when printing int64 values like
workspace_info.workspace.numel() cast explicitly to long long and use %lld;
finally avoid static_cast<long>(...) when building the torch::empty size — use
an int64_t cast (e.g., static_cast<int64_t>(total_workspace_size)) so the tensor
dimension type matches Torch’s expected int64_t.
```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py | ||
:language: python | ||
:linenos: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix literalinclude relative path (currently off by one level).
The code block includes the script via "../../examples/…", but this document lives under docs/source/legacy. To reach the repo-level examples directory, you need to go up three levels (not two). The link below (Line 9) already uses "../../../…", which further indicates the mismatch.
Apply this diff:
-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
:language: python
:linenos:
Build docs locally to catch include errors:
---
🏁 Script executed:
```shell
#!/bin/bash
# Strict docs build to fail on warnings
python -m pip install -r docs/requirements.txt
sphinx-build -nW docs/source docs/_build/html
Length of output: 11225
Let’s bypass the PATH issue and invoke Sphinx as a module:
🏁 Script executed:
#!/bin/bash
# Strict docs build to fail on warnings (invoke via module to avoid PATH problems)
python -m sphinx -nW docs/source docs/_build/html
Length of output: 2359
🏁 Script executed:
#!/bin/bash
# Verify existence of example file via both relative paths from the docs file
BASE="docs/source/legacy"
INCORRECT="../../examples/llm-api/_tensorrt_engine/quickstart_example.py"
CORRECT="../../../examples/llm-api/_tensorrt_engine/quickstart_example.py"
echo "Checking incorrect path (../../examples/...):"
if [[ -f "${BASE}/${INCORRECT}" ]]; then
echo "FOUND: ${BASE}/${INCORRECT}"
else
echo "MISSING: ${BASE}/${INCORRECT}"
fi
echo "Checking corrected path (../../../examples/...):"
if [[ -f "${BASE}/${CORRECT}" ]]; then
echo "FOUND: ${BASE}/${CORRECT}"
else
echo "MISSING: ${BASE}/${CORRECT}"
fi
Length of output: 949
Fix literalinclude path in docs/source/legacy/tensorrt_quickstart.md
The current {literalinclude}
goes up only two levels and fails to locate the example; it must go up three levels to reach the repository’s examples
directory (verified via shell check).
Apply this patch:
--- docs/source/legacy/tensorrt_quickstart.md
@@ lines 4-7
-```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
+```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
:language: python
:linenos:
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🤖 Prompt for AI Agents
In docs/source/legacy/tensorrt_quickstart.md around lines 4 to 7 the
{literalinclude} path incorrectly goes up two directories and cannot find the
example; change the include to go up three directories so it points to
../../../examples/llm-api/_tensorrt_engine/quickstart_example.py instead of
../../examples/llm-api/_tensorrt_engine/quickstart_example.py, keeping the
:language: python and :linenos: options intact.
@@ -0,0 +1,33 @@ | |||
from tensorrt_llm import LLM, SamplingParams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add NVIDIA SPDX header per repo guidelines.
Examples are source files and should carry the NVIDIA SPDX header.
Apply this diff to prepend the header:
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
from tensorrt_llm import LLM, SamplingParams
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from tensorrt_llm import LLM, SamplingParams | |
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
# | |
# SPDX-License-Identifier: Apache-2.0 | |
from tensorrt_llm import LLM, SamplingParams |
🤖 Prompt for AI Agents
In examples/llm-api/_tensorrt_engine/quickstart_example.py around lines 1 to 1,
the file is missing the required NVIDIA SPDX header; prepend the repository's
standard NVIDIA SPDX header block (e.g., SPDX-License-Identifier and
copyright/ownership lines as specified by project guidelines) at the very top of
the file before any imports so the example carries the required license
metadata.
from tensorrt_llm import BuildConfig, SamplingParams | ||
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use public API import for LLM; avoid private module path
End-user examples should import LLM from the public package namespace. Importing from _tensorrt_engine
is an internal detail and may change without notice.
Apply this diff:
-from tensorrt_llm import BuildConfig, SamplingParams
-from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
+from tensorrt_llm import BuildConfig, SamplingParams, LLM
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from tensorrt_llm import BuildConfig, SamplingParams | |
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change | |
from tensorrt_llm import BuildConfig, SamplingParams, LLM |
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_example.py around lines 1 to 2, the code imports
LLM from a private module path (tensorrt_llm._tensorrt_engine) instead of the
public package API; change the import to pull LLM from the public package (e.g.,
from tensorrt_llm import LLM alongside BuildConfig and SamplingParams) so
examples rely only on the stable public surface area and avoid referencing
internal modules.
inputs = default_multimodal_input_loader( | ||
tokenizer=llm.tokenizer, | ||
model_dir=llm._hf_model_dir, | ||
model_type=model_type, | ||
modality=args.modality, | ||
prompts=[cur_prompt], | ||
media=args.media, | ||
image_data_format="pt", | ||
num_frames=8, | ||
device="cpu") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: nested-media modalities will assert in loader; also honor CLI image_format/num_frames/device.
default_multimodal_input_loader asserts when len(prompts)=1 and media is a list-of-lists (e.g., modality="image_audio"). Your current call passes prompts=[cur_prompt] and media=args.media unchanged, which will trip the assert for nested media. Additionally, the code ignores user CLI values and hardcodes "pt"/8/"cpu".
Apply this refactor to (1) select a single sample for nested-media modalities, (2) pass user-specified format/frames/device, and (3) keep model_dir type consistent:
- inputs = default_multimodal_input_loader(
- tokenizer=llm.tokenizer,
- model_dir=llm._hf_model_dir,
- model_type=model_type,
- modality=args.modality,
- prompts=[cur_prompt],
- media=args.media,
- image_data_format="pt",
- num_frames=8,
- device="cpu")
+ # For nested-media (e.g., image_audio = [ [img,aud], [img,aud], ... ]),
+ # pick one sample to pair with a single-turn prompt. For flat media
+ # (image/video/audio), 1 prompt + N media is supported by the loader.
+ media_for_turn = args.media
+ if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list):
+ media_for_turn = [media_for_turn[0]]
+
+ inputs = default_multimodal_input_loader(
+ tokenizer=llm.tokenizer,
+ model_dir=str(llm._hf_model_dir),
+ model_type=model_type,
+ modality=args.modality,
+ prompts=[cur_prompt],
+ media=media_for_turn,
+ image_data_format=image_format,
+ num_frames=args.num_frames,
+ device=args.device)
Follow-up: If you want to reuse the same nested media across turns, consider extracting the first sample once outside the loop and reusing it to avoid repeated conditionals.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
inputs = default_multimodal_input_loader( | |
tokenizer=llm.tokenizer, | |
model_dir=llm._hf_model_dir, | |
model_type=model_type, | |
modality=args.modality, | |
prompts=[cur_prompt], | |
media=args.media, | |
image_data_format="pt", | |
num_frames=8, | |
device="cpu") | |
# For nested-media (e.g., image_audio = [[img, aud], [img, aud], …]), | |
# pick one sample to pair with a single-turn prompt. For flat media | |
# (image/video/audio), 1 prompt + N media is supported by the loader. | |
media_for_turn = args.media | |
if isinstance(media_for_turn, list) and media_for_turn and isinstance(media_for_turn[0], list): | |
media_for_turn = [media_for_turn[0]] | |
inputs = default_multimodal_input_loader( | |
tokenizer=llm.tokenizer, | |
model_dir=str(llm._hf_model_dir), | |
model_type=model_type, | |
modality=args.modality, | |
prompts=[cur_prompt], | |
media=media_for_turn, | |
image_data_format=image_format, | |
num_frames=args.num_frames, | |
device=args.device) |
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_multimodal.py around lines 219 to 229, the call
to default_multimodal_input_loader will assert for nested-media modalities when
prompts=[cur_prompt] because media may be a list-of-lists; also
image_data_format/num_frames/device are hardcoded and model_dir type must remain
consistent. Fix by: if media is nested (e.g., a list and its first element is a
list) and you are sending a single prompt, select the corresponding single-media
sample (media = media[0]) before calling the loader; replace hardcoded "pt", 8
and "cpu" with args.image_format, args.num_frames, and args.device respectively;
and ensure the model_dir argument uses the same type the loader expects (use
llm._hf_model_dir as-is or cast consistently). Keep this selection logic minimal
and consider hoisting the single-sample extraction outside any loop if you want
to reuse the same nested media across turns.
chat_completion = client.chat.completions.create( | ||
model=model_name, | ||
messages=messages, | ||
max_completion_tokens=100, | ||
max_completion_tokens=256, | ||
response_format={ | ||
"type": | ||
"structural_tag", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Use the correct parameter name: max_tokens (not max_completion_tokens).
chat.completions.create expects max_tokens; max_completion_tokens is unsupported and may be ignored or raise.
Apply this diff:
- max_completion_tokens=256,
+ max_tokens=256,
Run to find other occurrences to fix:
🏁 Script executed:
#!/bin/bash
rg -nP '\bmax_completion_tokens\b' -C2 tests/
Length of output: 14891
Replace max_completion_tokens
with max_tokens
in all affected tests
The following test files use the unsupported max_completion_tokens
parameter and must be updated to max_tokens
:
• tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py (line 163)
• tests/unittest/llmapi/apps/_test_openai_chat.py (lines 103, 162, 201, 223, 236, 258, 269, 322, 338, 402, 414, 438, 448, 458, 484, 497, 520)
• tests/unittest/llmapi/apps/_test_openai_reasoning.py (lines 48, 95, 108)
• tests/unittest/llmapi/apps/_test_openai_multi_gpu.py (line 123)
• tests/unittest/llmapi/apps/_test_openai_consistent_chat.py (lines 109, 154, 200)
• tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py (lines 85, 90, 145, 150, 198, 203, 253, 258, 310, 321)
Apply this change everywhere:
- max_completion_tokens=<value>,
+ max_tokens=<value>,
This ensures the client’s chat.completions.create
calls use the correct max_tokens
parameter.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
chat_completion = client.chat.completions.create( | |
model=model_name, | |
messages=messages, | |
max_completion_tokens=100, | |
max_completion_tokens=256, | |
response_format={ | |
"type": | |
"structural_tag", | |
chat_completion = client.chat.completions.create( | |
model=model_name, | |
messages=messages, | |
max_tokens=256, | |
response_format={ | |
"type": | |
"structural_tag", |
🤖 Prompt for AI Agents
In tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py around lines
160 to 166, the chat.completions.create call uses the unsupported parameter
max_completion_tokens; replace that parameter name with max_tokens preserving
its value and placement. Also apply the same replacement across all affected
test files and line ranges listed in the review comment so every
chat.completions.create call uses max_tokens instead of max_completion_tokens.
e49078f
to
7da226a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/modules/linear.py (3)
625-630
: Fix: make FP8 block-scale shape handling safetensors-friendly; also pin device for sharded load.Calling
.dim()
/.squeeze()
onweights[0][scale_name]
breaks when it’s a safetensors slice (PySafeSlice). Use a helper that inspects shape viaget_shape()
/.shape
and drops singleton dims through indexing. Also passdevice=module.weight_scale.device
to avoid CPU/CUDA mismatches when modules are already on GPU.Apply this diff within the selected lines:
- full_weight_scale = weights[0][scale_name] - # modelopt fp8_pb_wo can have 2 extra singleton dimensions - if full_weight_scale.dim() == 4: - full_weight_scale = full_weight_scale.squeeze(1).squeeze(-1) - weight_scale = load_weight_shard(full_weight_scale, module.tp_size, - module.tp_rank, module.tp_mode) + full_weight_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name]) + weight_scale = load_weight_shard( + full_weight_scale, + module.tp_size, + module.tp_rank, + module.tp_mode, + device=module.weight_scale.device, + )Add this helper inside
FP8BlockScalesLinearMethod
(outside the selected range):def _normalize_fp8_block_scale_like(self, scale_like): """ Accept a torch.Tensor or a safetensors slice-like. If the shape is [R, 1, C, 1], lazily drop singleton dims to [R, C] without materializing. """ get_shape = getattr(scale_like, "get_shape", None) shape = tuple(get_shape()) if callable(get_shape) else tuple(getattr(scale_like, "shape", ())) if len(shape) == 4 and shape[1] == 1 and shape[-1] == 1: # Indexing preserves safetensors laziness return scale_like[:, 0, :, 0] return scale_like
643-660
: Refactor Q/K/V scale normalization to helper; avoid safetensors-incompatible.dim()
/.squeeze()
and ensure correct device.Same issue as above for
full_q_scale/full_k_scale/full_v_scale
. Normalize with the helper and shard ontomodule.weight_scale.device
to prevent device mismatch duringcopy_
.Apply this diff within the selected lines:
- full_q_scale = weights[0][scale_name] - full_k_scale = weights[1][scale_name] - full_v_scale = weights[2][scale_name] - # modelopt fp8_pb_wo can have 2 extra singleton dimensions - if full_q_scale.dim() == 4: - full_q_scale = full_q_scale.squeeze(1).squeeze(-1) - if full_k_scale.dim() == 4: - full_k_scale = full_k_scale.squeeze(1).squeeze(-1) - if full_v_scale.dim() == 4: - full_v_scale = full_v_scale.squeeze(1).squeeze(-1) - q_scale = load_weight_shard(full_q_scale, module.tp_size, - module.tp_rank, module.tp_mode) - k_scale = load_weight_shard(full_k_scale, module.tp_size, - module.tp_rank, module.tp_mode) - v_scale = load_weight_shard(full_v_scale, module.tp_size, - module.tp_rank, module.tp_mode) + full_q_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name]) + full_k_scale = self._normalize_fp8_block_scale_like(weights[1][scale_name]) + full_v_scale = self._normalize_fp8_block_scale_like(weights[2][scale_name]) + q_scale = load_weight_shard( + full_q_scale, module.tp_size, module.tp_rank, module.tp_mode, + device=module.weight_scale.device) + k_scale = load_weight_shard( + full_k_scale, module.tp_size, module.tp_rank, module.tp_mode, + device=module.weight_scale.device) + v_scale = load_weight_shard( + full_v_scale, module.tp_size, module.tp_rank, module.tp_mode, + device=module.weight_scale.device) fused_fp8_block_scale = torch.cat((q_scale, k_scale, v_scale))
671-682
: Normalize left/right FP8 scales via helper and shard on the correct device.Eliminate direct
.dim()
/.squeeze()
to support safetensors slices and pass the destination device toload_weight_shard
.Apply this diff within the selected lines:
- full_left_scale = weights[0][scale_name] - full_right_scale = weights[1][scale_name] - # modelopt fp8_pb_wo can have 2 extra singleton dimensions - if full_left_scale.dim() == 4: - full_left_scale = full_left_scale.squeeze(1).squeeze(-1) - if full_right_scale.dim() == 4: - full_right_scale = full_right_scale.squeeze(1).squeeze(-1) - left_scale = load_weight_shard(full_left_scale, module.tp_size, - module.tp_rank, module.tp_mode) - right_scale = load_weight_shard(full_right_scale, module.tp_size, - module.tp_rank, module.tp_mode) + full_left_scale = self._normalize_fp8_block_scale_like(weights[0][scale_name]) + full_right_scale = self._normalize_fp8_block_scale_like(weights[1][scale_name]) + left_scale = load_weight_shard( + full_left_scale, module.tp_size, module.tp_rank, module.tp_mode, + device=module.weight_scale.device) + right_scale = load_weight_shard( + full_right_scale, module.tp_size, module.tp_rank, module.tp_mode, + device=module.weight_scale.device) fused_scale = torch.cat([left_scale, right_scale], dim=0)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/modules/linear.py (1)
1-1
: Add NVIDIA copyright header per repo guidelines.Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.
Apply at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/attention_backend/trtllm.py
(1 hunks)tensorrt_llm/_torch/modules/linear.py
(3 hunks)tests/integration/test_lists/waives.txt
(3 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/attention_backend/trtllm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/linear.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)tp_rank
(54-55)tensorrt_llm/mapping.py (1)
tp_rank
(338-339)
PR_Github #17076 [ run ] triggered by Bot |
PR_Github #17075 [ run ] completed with state |
/bot run |
PR_Github #17076 [ run ] completed with state |
PR_Github #17077 [ run ] triggered by Bot |
PR_Github #17077 [ run ] completed with state |
/bot run |
PR_Github #17081 [ run ] triggered by Bot |
Since previous bot run already passed the full CI pipeline, but always encounter |
PR_Github #17081 [ run ] completed with state |
/bot skip --comment "passed full CI pipeline" |
PR_Github #17086 [ skip ] triggered by Bot |
PR_Github #17086 [ skip ] completed with state |
Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…VIDIA#6937) Signed-off-by: peaceh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: qqiao <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…NVIDIA#6913) Signed-off-by: Aurelien Chartier <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: junq <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…images (NVIDIA#7157) Signed-off-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Dimitrios Bariamis <[email protected]> Co-authored-by: Dimitrios Bariamis <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
2576c53
to
7040a83
Compare
/bot skip --comment "passed full CI pipeline" |
PR_Github #17135 [ skip ] triggered by Bot |
PR_Github #17135 [ skip ] completed with state |
Summary by CodeRabbit
New Features
Documentation
Improvements
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.