[TRTLLM-7958][doc] add 1.0 release notes #7605

nv-guomingz · 2025-09-08T07:27:09Z

Summary by CodeRabbit

New Features
- Release 1.0: broader model support (large multimodal and MoE models), LoRA and PyTorch backend enhancements, improved KV-cache, streaming/dynamic decoding, FP8 and other quantization improvements, attention/MLA scheduling, disaggregation, expanded benchmarking.
Infrastructure
- Upgraded base images and core dependencies for improved stability and performance.
Breaking Changes / API
- PyTorch promoted as the default backend; runtime defaults and public configuration shapes changed; new performance/quantization options exposed.
Bug Fixes
- Fixes for MLA, disaggregated serving, scheduling, memory, and workflow issues.
Known Issues
- Platform-specific disaggregation and multi-node caveats remain.
Documentation
- Added comprehensive Release 1.0 notes and updated API docs.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-08T07:27:15Z

📝 Walkthrough

Walkthrough

Adds a new top-level "TensorRT-LLM Release 1.0" section to docs/source/release-notes.md documenting new model support, extensive feature/infrastructure/benchmarking/documentation updates, and multiple API BREAKING CHANGES and public API declaration edits; only the release-notes documentation file was modified.

Changes

Cohort / File(s)	Summary
Release Notes `docs/source/release-notes.md`	Adds an additive "TensorRT-LLM Release 1.0" section marking the PyTorch-based architecture and LLM API as stable; lists new model support (Mistral3.1 VLM; TensorRT-Engine Qwen3 dense; phi-4-multimodal; EXAONE 4.0; Qwen3 MoE); extensive features/enhancements (LoRA updates, PyTorch backend improvements and promotion to default, attention DP tweaks, KV cache/events, MLA optimizations, FP8/KV cache changes, external multimodal embeddings, disaggregation, streaming/dynamic decoding, LLMAPI surface changes); benchmarking additions (MOE benchmarking, speculative metrics, request timeline, latency/throughput metrics, bench serving integration); documentation refactor emphasizing PyTorch workflow and stabilized APIs; infrastructure dependency bumps (PyTorch 25.06 + Triton 25.06 base images, ModelOpt 0.33, xgrammar 0.1.21, transformers 4.53); and detailed API CHANGES including multiple BREAKING CHANGES, renames, removals, and new options (e.g., `allreduce_strategy` on `TorchLlmArgs`, dynamic quantization on `LlmArgs`, several field renames, removal/relocation of `KvCacheConfig` and other types, request performance metrics exposure). No other files modified.

Sequence Diagram(s)

(Skipped — changes are documentation-only and do not introduce or modify runtime control flow.)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description contains only the repository template and placeholder headings without a filled "Description" or "Test Coverage" section and lacks a concise summary or testing information, so it is largely incomplete and does not satisfy the repository's required template usage.	Please populate the "Description" with a short summary of what changed and why, add a "Test Coverage" section listing relevant tests or a justification if none are required, mark or explain the PR checklist items (e.g., "docs-only change — tests skipped"), and include a link or pointer to the release-notes file/commit so reviewers can quickly find the full content.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[TRTLLM-7958][doc] add 1.0 release notes" is concise, follows the repository's ticket/type prefix conventions, and accurately summarizes the primary change (adding v1.0 release notes), making it clear to reviewers scanning history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

docs/source/release-notes.md (4)

39-66: Normalize capitalization/terminology (FP8/FP4, ModelOpt, n‑gram, disagg).

Keeps terms consistent with prior notes.

-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8 x MXFP4 in PyTorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support DeepEP FP4 post‑quant all‑to‑all dispatch
-  - Fuse w4a8 moe pre-quant scale on Hopper
+  - Fuse W4A8 MoE pre‑quant scale on Hopper
-  - Add support for per expert activation scaling factors
+  - Add support for per‑expert activation scaling factors
-  - Auto-enable ngram with concurrency <= 32
+  - Auto‑enable n‑gram with concurrency <= 32
-  - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
+  - Support structural tags in C++ runtime and upgrade xgrammar to 0.1.21
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism (PP) in the PyTorch backend
-  - Support n-gram speculative decoding with disagg
+  - Support n‑gram speculative decoding with disaggregated serving
-  - Add LLGuidance Support for PyTorch Backend
+  - Add LLGuidance support for PyTorch backend

83-94: Capitalization fix for PyTorch.

-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default

171-172: Clarify Known Issue wording (cuda-python version and error message).

Improves readability and avoids ambiguity.

-- On bare-metal Ubuntu 22.04 or 24.04, please install the `cuda-python==12.9.1` package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error `ImportError: cannot import name 'cuda' from 'cuda'`.
+- On bare‑metal Ubuntu 22.04 or 24.04, install `cuda-python==12.9.1` after installing the TensorRT‑LLM wheel. This resolves an incompatibility with the default cuda-python 13.x that results in `ImportError: cannot import name 'cuda' from 'cuda'`.

104-117: Add migration pointers for BREAKING CHANGEs and Known Issue workaround.

Link to relevant docs/upgrade guides to reduce support churn.

For each BREAKING CHANGE, add a short “Migration” hint with a link (e.g., default backend change → trtllm-serve docs; KvCacheConfig unification → LLM API guide).
For the cuda-python Known Issue, add a link to the install/troubleshooting page where the workaround is tracked.

Also applies to: 171-172

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88d1bde and 6f7e39a.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

docs/source/release-notes.md (1)

96-103: Update transformers version to latest stable
All NGC images (nvcr.io/nvidia/pytorch:25.06-py3, nvcr.io/nvidia/tritonserver:25.06-py3) and dependencies (PyTorch 2.8.0, ModelOpt 0.33, xgrammar 0.1.21) are published and current; transformers 4.51.3 is superseded by 4.56.0 (released Aug 29, 2025), so bump the documented version or clarify its pin.

⛔ Skipped due to learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

docs/source/release-notes.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

docs/source/release-notes.md (4)

11-15: Polish model names and phrasing.

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Qwen3 MoE support in the TensorRT backend

18-33: Fix casing/typos in early “Features” bullets (PyTorch, DP, CUDA Graph, NVFP4, etc.).

-  - Add support for sm121
+  - Add support for SM121
-  - Support pytorch LoRA adapter eviction
-  - Add LoRA support for pytorch backend in trtllm-serve 
-  - Add support of scheduling attention dp request
+  - Support PyTorch LoRA adapter eviction
+  - Add LoRA support for the PyTorch backend in trtllm-serve
+  - Add scheduling support for attention DP requests
-  - Remove padding of FusedMoE in attention DP
+  - Remove padding of Fused MoE in attention DP
-  - Support torch compile for attention dp
+  - Support torch.compile for attention DP
-  - Add KV events support for sliding window attention
+  - Add KV events support for sliding‑window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
-  - Add Piecewise cuda graph support for MLA
+  - Add piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
+  - Support multiCtasKvMode for high‑throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Enable KV cache reuse during request generation
-  - Enable Multi-block mode for Hopper spec dec XQA kernel
+  - Enable multi‑block mode for Hopper speculative‑decoding XQA kernel

105-117: BREAKING CHANGE bullets: add missing spaces and fix API names.

- - **BREAKING CHANGE**Promote PyTorch to be the default LLM backend
- - **BREAKING CHANGE**Change default backend to PyTorch in trtllm-serve
- - **BREAKING CHANGE**Unify KvCacheConfig in LLM class for pytorch backend
- - **BREAKING CHANGE**Rename cuda_graph_config padding_enabled field
- - **BREAKING CHANGE**Rename mixed_sampler to enable_mixed_sampler
- - **BREAKING CHANGE**Rename LLM.autotuner_enabled to enable_autotuner
+ - **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
+ - **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
+ - **BREAKING CHANGE** Unify KvCacheConfig in the LLM class for the PyTorch backend
+ - **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
+ - **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
+ - **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
- - Add back allreduce_strategy parameter into TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization 
- - Add request_perf_metrics to LLMAPI
+ - Add back allreduce_strategy parameter to TorchLlmArgs
+ - Add LlmArgs option to force using dynamic quantization
+ - Add request_perf_metrics to the LLM API

119-170: Proofread “Fixed Issues” for typos/casing (illegal, Llama 4, MoE, FP8/FP4, etc.).

- - Fix illeagel memory access in MLA (#6437)
+ - Fix illegal memory access in MLA (#6437)
- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP > 1 (#6447)
- - Fix eagle3 two model disaggregated serving test (#6014)
+ - Fix Eagle‑3 two‑model disaggregated serving test (#6014)
- - Fix mgmn postprocess error (#5835)
+ - Fix MGMN post‑process error (#5835)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fall back to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix moe regression for sm120 (#5823)
+ - Fix MoE regression for SM120 (#5823)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix illegal memory access in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN not divisible by 16 and add SM89 DeepGEMM BMM support (#5531)

Recommend a quick pass to normalize remaining model/acronym casing throughout this block.

🧹 Nitpick comments (5)

docs/source/release-notes.md (5)

34-47: Standardize acronyms and product names (FP8/NVFP4/ModelOpt/DeepEP).

-  - Add support for fused gate_up_proj scales for FP8 blockwise
+  - Add support for fused gate_up_proj scales for FP8 block‑wise
-  - Support FP8 row-wise dense GEMM in torch flow
+  - Support FP8 row‑wise dense GEMM in the PyTorch workflow
-  - Enable fp8 SwiGLU to minimize host overhead
+  - Enable FP8 SwiGLU to minimize host overhead
-  - Add Deepseek R1 FP8 Support on Blackwell
+  - Add DeepSeek R1 FP8 support on Blackwell
-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8×MXFP4 in PyTorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
+  - Open source MoE MXFP8‑MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support DeepEP FP4 post‑quant all‑to‑all dispatch
-  - Fuse w4a8 moe pre-quant scale on Hopper
+  - Fuse W4A8 MoE pre‑quant scale on Hopper
-  - Support Weight-Only-Quantization in PyTorch Workflow
+  - Support weight‑only quantization in the PyTorch workflow
-  - Add support for per expert activation scaling factors
+  - Add support for per‑expert activation scaling factors

49-66: Clarify/capitalize a few feature bullets.

-  - Add support for YARN in NemotronNAS models
+  - Add support for YARN in NemotronNAS models
-  - Switch to internal version of MMProjector in Gemma3
+  - Switch to the internal version of MMProjector in Gemma3
-  - Disable add special tokens for Llama3.3 70B
+  - Disable add_special_tokens for Llama 3.3 70B
-  - Auto-enable ngram with concurrency <= 32
+  - Auto‑enable n‑gram with concurrency <= 32
-  - Support turning on/off spec decoding dynamically
+  - Support enabling/disabling speculative decoding dynamically
-  - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
+  - Support structural_tag in the C++ runtime and upgrade XGrammar to 0.1.21
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with PP in the PyTorch backend
-  - Add status tags to LLM API reference
+  - Add status tags to the LLM API reference
-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multimodal models
-  - Add support for two-model engine KV cache reuse
+  - Add support for two‑model engine KV cache reuse
-  - Add Eagle-3 support for qwen3 dense model
+  - Add Eagle‑3 support for Qwen3 dense model
-  - Migrate Eagle-3 and draft/target speculation to Drafter
+  - Migrate Eagle‑3 and draft/target speculation to Drafter

79-81: Triton/LLM API naming consistency.

-  - Remove support for llmapi + TRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Remove support for the LLM API with the TensorRT backend in Triton
+  - Add request_perf_metrics to the Triton LLM API backend

83-95: Benchmark bullets: fix PyTorch casing.

-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default

171-172: Tighten wording for Known Issues (cuda-python).

- - On bare-metal Ubuntu 22.04 or 24.04, please install the `cuda-python==12.9.1` package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error `ImportError: cannot import name 'cuda' from 'cuda'`.
+ - On bare‑metal Ubuntu 22.04 or 24.04, install `cuda-python==12.9.1` after installing the TensorRT‑LLM wheel. This resolves an incompatibility with the default cuda‑python 13 that causes: `ImportError: cannot import name 'cuda' from 'cuda'`.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f7e39a and bb3046b.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

docs/source/release-notes.md (2)

7-9: Nice structure for the 1.0 section.

Clear headings and grouping. Carry on with the polish below.

97-103: Verify NGC container tags and pinned dependencies
Ensure nvcr.io/nvidia/pytorch:25.06-py3 and nvcr.io/nvidia/tritonserver:25.06-py3 are already published on NGC, and that PyTorch 2.8.0, NVIDIA ModelOpt 0.33, xgrammar 0.1.21 and transformers 4.51.3 match the actual release constraints—update the bullets in docs/source/release-notes.md (lines 97–103) if any mismatch.

docs/source/release-notes.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

docs/source/release-notes.md (4)

10-16: Polish model names and phrasing in “Model Support”.

Minor casing/wording fixes for consistency with repo terms.

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
-  - Add phi-4-multimodal model support
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
+  - Add Phi‑4 multimodal model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Qwen3 MoE support in the TensorRT backend

18-82: Fix typos/capitalization and tighten phrasing in “Features”.

User-facing list; normalize acronyms and terms.

-  - Add support for sm121
+  - Add support for SM121
-  - Add LoRA support for PyTorch backend in trtllm-serve 
+  - Add LoRA support for the PyTorch backend in trtllm‑serve
-  - Add support of scheduling attention dp request
+  - Add scheduling support for attention DP requests
-  - Remove padding of FusedMoE in attention DP
+  - Remove padding for FusedMoE in attention DP
-  - Support torch compile for attention dp
+  - Support torch.compile for attention DP
-  - Add KV events support for sliding window attention
+  - Add KV events support for sliding‑window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
+  - Add TRT‑LLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRT‑LLM MoE
-  - Add Piecewise cuda graph support for MLA
+  - Add piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
+  - Support multiCtasKvMode for high‑throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Enable KV cache reuse during request generation
-  - Add vLLM KV Pool support for XQA kernel
+  - Add vLLM KV pool support for XQA kernel
-  - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
+  - Allow sending >2 GiB through MPI by using mpi4py.util.pkl5
-  - Add support for fused gate_up_proj scales for FP8 blockwise
+  - Add support for fused gate_up_proj scales for FP8 block‑wise
-  - Support FP8 row-wise dense GEMM in torch flow
+  - Support FP8 row‑wise dense GEMM in PyTorch flow
-  - Enable fp8 SwiGLU to minimize host overhead
+  - Enable FP8 SwiGLU to minimize host overhead
-  - Add Deepseek R1 FP8 Support on Blackwell
+  - Add DeepSeek R1 FP8 support on Blackwell
-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8×MXFP4 in PyTorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
+  - Open‑source MoE MXFP8‑MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support DeepEP FP4 post‑quant all‑to‑all dispatch
-  - Fuse w4a8 moe pre-quant scale on Hopper
+  - Fuse w4a8 MoE pre‑quant scale on Hopper
-  - Support Weight-Only-Quantization in PyTorch Workflow
+  - Support weight‑only quantization in PyTorch workflow
-  - Add support for per expert activation scaling factors
+  - Add support for per‑expert activation scaling factors
-  - Add ReDrafter support for Qwen
+  - Add ReDrafter support for Qwen
-  - Enable CUDA graphs for Nemotron-H
+  - Enable CUDA Graphs for Nemotron‑H
-  - Add support for YARN in NemotronNAS models
+  - Add support for YaRN in NemotronNAS models
-  - Switch to internal version of MMProjector in Gemma3
+  - Switch to the internal MMProjector in Gemma3
-  - Disable add special tokens for Llama3.3 70B
+  - Disable add_special_tokens for Llama 3.3 70B
-  - Auto-enable ngram with concurrency <= 32
+  - Auto‑enable n‑gram with concurrency <= 32
-  - Support turning on/off spec decoding dynamically
+  - Support toggling speculative decoding dynamically
-  - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
+  - Support structural tags in the C++ runtime and upgrade xgrammar to 0.1.21
-  - Add support for external multimodal embeddings
+  - Add support for external multi‑modal embeddings
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregated serving with pipeline parallelism in the PyTorch backend
-  - Add status tags to LLM API reference
+  - Add status tags to the LLM API reference
-  - Support JSON Schema in OpenAI-Compatible API
+  - Support JSON Schema in the OpenAI‑compatible API
-  - Support chunked prefill on spec decode 2 model
+  - Support chunked prefill on SpecDecode‑2 models
-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multi‑modal models
-  - Add support for two-model engine KV cache reuse
+  - Add support for two‑model engine KV cache reuse
-  - Add Eagle-3 support for qwen3 dense model
+  - Add EAGLE‑3 support for Qwen3 dense model
-  - Migrate Eagle-3 and draft/target speculation to Drafter
+  - Migrate EAGLE‑3 and draft/target speculation to Drafter
-  - Enable guided decoding with overlap scheduler
+  - Enable guided decoding with the overlap scheduler
-  - Support n-gram speculative decoding with disagg
+  - Support n‑gram speculative decoding with disaggregated serving
-  - Add beam search support to the PyTorch Workflow
+  - Add beam search support to the PyTorch workflow
-  - Add LLGuidance Support for PyTorch Backend
+  - Add LLGuidance support for the PyTorch backend
-  - Add NGrams V2 support
+  - Add NGrams v2 support
-  - Add MTP support for Online EPLB
+  - Add MTP support for online EPLB
-  - Support disaggregated serving in TRTLLM Sampler
+  - Support disaggregated serving in the TRT‑LLM Sampler
-  - Add core infrastructure to enable loading of custom checkpoint formats
+  - Add core infrastructure to load custom checkpoint formats
-  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
+  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow running DeepEP on memory‑constrained GPUs
-  - Use huge page mapping for host accessible memory on GB200
+  - Use huge‑page mapping for host‑accessible memory on GB200
-  - Add user-provided speculative decoding support
+  - Add user‑provided speculative decoding support
-  - Add streaming scaffolding_llm.generate_async support
+  - Add streaming support for scaffolding_llm.generate_async
-  - Detokenize option in /v1/completions request
+  - Add detokenize option in /v1/completions requests
-  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
+  - Integrate TRT‑LLM Gen FP4 block‑scale MoE with the PyTorch workflow kernel autotuner
-  - Remove support for llmapi + TRT backend in Triton
+  - Remove support for LLM API + TRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Add request_perf_metrics to the Triton LLM API backend

105-117: BREAKING CHANGE bullets: spacing, casing, API name normalization.

- - **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
- - **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
- - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ - **BREAKING CHANGE** Promote PyTorch as the default LLM backend
+ - **BREAKING CHANGE** Change default backend to PyTorch in trtllm‑serve
+ - **BREAKING CHANGE** Unify KvCacheConfig in the LLM class for the PyTorch backend
- - **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
+ - **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
- - **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
+ - **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
- - **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
+ - **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
- - Add back allreduce_strategy parameter into TorchLlmArgs
+ - Add back allreduce_strategy parameter to TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization 
+ - Add LlmArgs option to force dynamic quantization
- - Add request_perf_metrics to LLMAPI
+ - Add request_perf_metrics to the LLM API
- - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
+ - Remove batch_manager::KvCacheConfig; use executor::KvCacheConfig instead

119-170: Fixed Issues: typos/capitalization and clarity.

Standardize model/library names and fix misspellings.

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
+ - Fix bugs caused by None attention_bias during Qwen3 engine conversion (#6344)
- - Fix bug of Qwen3 when using fp4 on sm120 (#6065)
+ - Fix Qwen3 bug when using FP4 on SM120 (#6065)
- - Fix TMA error with GEMM+AR on TP=2 (#6075)
+ - Fix TMA error with GEMM+AR at TP=2 (#6075)
- - Fix KV Cache overrides in trtllm-bench (#6103)
+ - Fix KV cache overrides in trtllm‑bench (#6103)
- - Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
+ - Fix MoE benchmark to rotate buffers and prevent L2 cache reuse (#4135)
- - Fix eagle3 two model disaggregated serving test (#6014)
+ - Fix EAGLE‑3 two‑model disaggregated serving test (#6014)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fall back to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix disagg + speculative decoding (#5558)
+ - Fix disaggregated serving + speculative decoding (#5558)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix moe regression for sm120 (#5823)
+ - Fix MoE regression for SM120 (#5823)
- - Fix Qwen2.5VL FP8 support (#5029)
+ - Fix Qwen2.5‑VL FP8 support (#5029)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix illegal memory access in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN not divisible by 16; add SM89 DeepGEMM BMM support (#5531)
- - Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
+ - Fix MoE workspace info by storing the Torch tensor instead of data_ptr (#5900)
- - Fix mistral unit tests due to transformers upgrade (#5904)
+ - Fix Mistral unit tests due to transformers upgrade (#5904)
- - Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
+ - Fix Llama 3.1 405B hanging issue (#5698, #5925)
- - Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
+ - Fix alltoall for Llama 4 (apply_router_weight_on_input=True) (#5902)
- - Fix moe autotune fallback failed to query default heuristic (#5520) 
+ - Fix MoE autotune fallback failing to query default heuristic (#5520)
- - Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
+ - Fix block‑scale FP8 support for DeepSeek V3 on Blackwell (#5514)

🧹 Nitpick comments (2)

docs/source/release-notes.md (2)

83-94: Benchmark bullets: minor casing and consistency.

-  - Add speculative metrics for trtllm-bench
+  - Add speculative‑decoding metrics for trtllm‑bench
-  - Add the ability to write a request timeline for trtllm-bench
+  - Add request‑timeline output for trtllm‑bench
-  - Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
+  - Add no_kv_cache_reuse option and streaming support for trtllm‑serve bench
-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm‑bench to support new PyTorch default
-  - Make benchmark_serving part of the library
+  - Make benchmark_serving part of the library

171-173: Known issue wording and version check.

Clarify the sentence; confirm version pin is correct for 1.0.

- - On bare-metal Ubuntu 22.04 or 24.04, please install the `cuda-python==12.9.1` package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error `ImportError: cannot import name 'cuda' from 'cuda'`.
+ - On bare‑metal Ubuntu 22.04 or 24.04, install `cuda‑python==12.9.1` after installing the TensorRT‑LLM wheel. This resolves an incompatibility with the default cuda‑python 13 that leads to `ImportError: cannot import name 'cuda' from 'cuda'`.

Please verify that 12.9.1 is still the recommended pin for TRT‑LLM 1.0 on those distros and that cuda‑python 13 remains incompatible.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb3046b and befdca1.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

docs/source/release-notes.md (1)

96-103: I wasn’t able to confirm those dependency versions from the public docs alone. To get exact versions from the TensorRT-LLM 1.0 containers, which approach would you like?

a) Pull the nvcr.io/nvidia/tensorrt-llm:1.0 image and inspect installed package versions (e.g. via pip list).

b) Locate and share the 1.0 container’s manifest or Dockerfile so I can grep for the version strings.

docs/source/release-notes.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

docs/source/release-notes.md (2)

11-15: Polish model names and phrasing in “Model Support”.

Consistency/readability fixes; aligns with repo style.

Apply:

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
-  - Add phi-4-multimodal model support
-  - Add EXAONE 4.0 model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
+  - Add Phi‑4 multimodal model support
+  - Add EXAONE 4.0 model support
+  - Add Qwen3 MoE support in the TensorRT backend

18-41: Normalize terminology, capitalization, and fix typos across “Features”.

Standardize PyTorch/DP/NVFP4/FP8 casing, fix typos, clarify ambiguous terms, and remove trailing spaces.

Apply:

-  - Add support for sm121
+  - Add support for SM121
   - Add LoRA support for Gemma3
-  - Support PyTorch LoRA adapter eviction
-  - Add LoRA support for PyTorch backend in trtllm-serve 
-  - Add support of scheduling attention dp request
-  - Remove padding of FusedMoE in attention DP
-  - Support torch compile for attention dp
-  - Add KV events support for sliding window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
-  - Add Piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Support PyTorch LoRA adapter eviction
+  - Add LoRA support for the PyTorch backend in trtllm-serve
+  - Add scheduling support for attention DP requests
+  - Remove padding of FusedMoE in attention DP
+  - Support torch.compile for attention DP
+  - Add KV events support for sliding‑window attention
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
+  - Add piecewise CUDA Graph support for MLA
+  - Support multiCtasKvMode for high‑throughput MLA kernels
+  - Enable KV cache reuse during request generation
   - Add ADP schedule balance optimization
   - Add chunked prefill support for MLA (Blackwell)
-  - Enable Multi-block mode for Hopper spec dec XQA kernel
+  - Enable multi‑block mode for Hopper speculative‑decoding XQA kernel
   - Add vLLM KV Pool support for XQA kernel
   - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
-  - Add support for fused gate_up_proj scales for FP8 blockwise
-  - Support FP8 row-wise dense GEMM in torch flow
-  - Enable fp8 SwiGLU to minimize host overhead
-  - Add Deepseek R1 FP8 Support on Blackwell
-  - Add support for MXFP8xMXFP4 in pytorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
-  - Fuse w4a8 moe pre-quant scale on Hopper
-  - Support Weight-Only-Quantization in PyTorch Workflow
-  - Add support for per expert activation scaling factors
+  - Add support for fused gate_up_proj scales for FP8 block‑wise
+  - Support FP8 row‑wise dense GEMM in the PyTorch workflow
+  - Enable FP8 SwiGLU to minimize host overhead
+  - Add DeepSeek R1 FP8 support on Blackwell
+  - Add support for MXFP8×MXFP4 in PyTorch
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
+  - Open‑source MoE MXFP8‑MXFP4 implementation
+  - Add support for ModelOpt FP8_PB_WO quantization scheme
+  - Support DeepEP FP4 post‑quant all‑to‑all dispatch
+  - Fuse W4A8 MoE pre‑quant scale on Hopper
+  - Support weight‑only quantization in the PyTorch workflow
+  - Add support for per‑expert activation scaling factors
   - Add ReDrafter support for Qwen
-  - Enable CUDA Graph for Nemotron-H
-  - Add support for YARN in NemotronNAS models
+  - Enable CUDA Graphs for Nemotron‑H
+  - Add support for YARN in NemotronNAS models
   - Switch to internal version of MMProjector in Gemma3
-  - Disable add special tokens for Llama3.3 70B
-  - Auto-enable ngram with concurrency <= 32
-  - Support turning on/off spec decoding dynamically
+  - Disable adding special tokens for Llama 3.3 70B
+  - Auto‑enable n‑gram with concurrency <= 32
+  - Support turning speculative decoding on/off dynamically
   - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
   - Add support for external multimodal embeddings
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism (PP) on the PyTorch backend
   - Add status tags to LLM API reference
   - Support JSON Schema in OpenAI‑Compatible API
-  - Support chunked prefill on spec decode 2 model
-  - Add KV cache reuse support for multimodal models 
+  - Support chunked prefill on speculative‑decoding v2 model
+  - Add KV cache reuse support for multimodal models
   - Support nanobind bindings
   - Add support for two-model engine KV cache reuse
-  - Add Eagle-3 support for qwen3 dense model
+  - Add Eagle‑3 support for Qwen3 dense model
   - Migrate Eagle-3 and draft/target speculation to Drafter
   - Enable guided decoding with overlap scheduler
-  - Support n-gram speculative decoding with disagg
+  - Support n‑gram speculative decoding with disaggregated serving
   - Add beam search support to the PyTorch Workflow
-  - Add LLGuidance Support for PyTorch Backend
+  - Add LLGuidance support for PyTorch backend
   - Add NGrams V2 support
   - Add MTP support for Online EPLB
-  - Support disaggregated serving in TRTLLM Sampler
+  - Support disaggregated serving in TRTLLM sampler
   - Add core infrastructure to enable loading of custom checkpoint formats
-  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
-  - Use huge page mapping for host accessible memory on GB200
+  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow running DeepEP on memory‑constrained GPUs
+  - Use hugepage mapping for host‑accessible memory on GB200
   - Add user-provided speculative decoding support
   - Add streaming scaffolding_llm.generate_async support
   - Detokenize option in /v1/completions request
-  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
-  - Remove support for llmapi + TRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Integrate TRT‑LLM Gen FP4 block‑scale MoE with PyTorch workflow kernel autotuner
+  - Remove support for LLM API + TensorRT backend in Triton
+  - Add request_perf_metrics to the Triton LLM API backend
+```


Also applies to: 45-63, 65-82

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (4)</summary><blockquote>

<details>
<summary>docs/source/release-notes.md (4)</summary><blockquote>

`83-94`: **Benchmark section: minor wording/casing fixes.**

Apply:

```diff
   - Add Acceptance Rate calculation to benchmark_serving 
   - Add wide-ep benchmarking scripts
-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default
   - Add support for TRTLLM CustomDataset
   - Make benchmark_serving part of the library

104-119: API changes: casing/grammar consistency and nomenclature.

Apply:

- - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ - **BREAKING CHANGE** Unify KVCacheConfig in LLM class for the PyTorch backend
- - Add back allreduce_strategy parameter into TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization
+ - Add back allreduce_strategy parameter to TorchLlmArgs
+ - Add LlmArgs option to force using dynamic quantization
- - Remove deprecated LoRA LLM args, that are already specified in lora_config
+ - Remove deprecated LoRA LLM args that are already specified in lora_config
- - Add request_perf_metrics to LLMAPI
+ - Add request_perf_metrics to the LLM API
- - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
+ - Remove batch_manager::KVCacheConfig and use executor::KVCacheConfig instead 
- - Remove ptuning knobs from TorchLlmArgs
+ - Remove P‑Tuning knobs from TorchLlmArgs

121-172: Fixed Issues: standardize names/casing and clarify one ambiguous bullet.

Apply:

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fix eagle3 two model disaggregated serving test (#6014)
+ - Fix Eagle‑3 two‑model disaggregated serving test (#6014)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fallback to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix moe regression for sm120 (#5823)
+ - Fix MoE regression for SM120 (#5823)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix the illegal memory access issue in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix handling when tileN is not divisible by 16, and add support for SM89 DeepGEMM BMM (#5531)
- - Fix mistral unit tests due to transformers upgrade (#5904)
+ - Fix Mistral unit tests due to transformers upgrade (#5904)
- - Fix mtp vanilla draft inputs (#5568) 
+ - Fix MTP vanilla draft inputs (#5568)

174-175: Known issues: tighten wording.

Apply:

-- On bare-metal Ubuntu 22.04 or 24.04, please install the `cuda-python==12.9.1` package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error `ImportError: cannot import name 'cuda' from 'cuda'`.
+- On bare‑metal Ubuntu 22.04 or 24.04, install `cuda-python==12.9.1` after installing the TensorRT‑LLM wheel. This resolves an incompatibility with the default cuda-python 13.x that leads to `ImportError: cannot import name 'cuda' from 'cuda'`.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2f04f and 68066b7.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (11)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

docs/source/release-notes.md (1)

97-102: NGC container tags verified — both nvcr.io/nvidia/pytorch:25.06-py3 and nvcr.io/nvidia/tritonserver:25.06-py3 are available on NGC and their bundled PyTorch and TensorRT versions align with the release notes.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

docs/source/release-notes.md (4)

11-15: Polish model names and phrasing in “Model Support”.

Align naming/casing and phrasing for clarity and consistency.

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
-  - Add phi-4-multimodal model support
-  - Add EXAONE 4.0 model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
+  - Add Phi-4 Multimodal model support
+  - Add EXAONE 4.0 model support
+  - Add Qwen3 MoE support in the TensorRT backend

21-41: Fix typos/casing and normalize technical terms in “Features” (part 1).

Correct PyTorch / DP / torch.compile / NVFP4 / KV cache / typo.

-  - Add LoRA support for PyTorch backend in trtllm-serve 
-  - Add support of scheduling attention dp request
+  - Add LoRA support for the PyTorch backend in trtllm-serve
+  - Add scheduling support for attention DP requests
-  - Remove padding of FusedMoE in attention DP
-  - Support torch compile for attention dp
-  - Add KV events support for sliding window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
+  - Remove padding of FusedMoE in attention DP
+  - Support torch.compile for attention DP
+  - Add KV events support for sliding‑window attention
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
-  - Add Piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Add piecewise CUDA Graph support for MLA
+  - Support multiCtasKvMode for high‑throughput MLA kernels
+  - Enable KV cache reuse during request generation
-  - Enable fp8 SwiGLU to minimize host overhead
+  - Enable FP8 SwiGLU to minimize host overhead
-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8xMXFP4 in PyTorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
+  - Open‑source MoE MXFP8‑MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support DeepEP FP4 post‑quant all‑to‑all dispatch
-  - Fuse w4a8 moe pre-quant scale on Hopper
+  - Fuse W4A8 MoE pre‑quant scale on Hopper
-  - Support Weight-Only-Quantization in PyTorch Workflow
+  - Support weight‑only quantization in the PyTorch workflow
-  - Add support for per expert activation scaling factors
+  - Add support for per‑expert activation scaling factors

105-116: BREAKING/API bullets: spacing/casing + API name normalization.

Normalize “PyTorch”, “KvCacheConfig”, “LlmArgs”, “LLM API”; streamline grammar.

- - **BREAKING CHANGE** Promote PyTorch to be the default LLM backend
+ - **BREAKING CHANGE** Promote PyTorch as the default LLM backend
- - **BREAKING CHANGE** Change default backend to PyTorch in trtllm-serve
+ - **BREAKING CHANGE** Change the default backend to PyTorch in trtllm-serve
- - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ - **BREAKING CHANGE** Unify KvCacheConfig in the LLM class for the PyTorch backend
- - **BREAKING CHANGE** Rename cuda_graph_config padding_enabled field
+ - **BREAKING CHANGE** Rename cuda_graph_config.padding_enabled
- - **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
- - **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
+ - **BREAKING CHANGE** Rename mixed_sampler to enable_mixed_sampler
+ - **BREAKING CHANGE** Rename LLM.autotuner_enabled to enable_autotuner
- - Add back allreduce_strategy parameter into TorchLlmArgs
+ - Add back allreduce_strategy parameter to TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization
+ - Add LlmArgs option to force dynamic quantization
- - Add request_perf_metrics to LLMAPI
+ - Add request_perf_metrics to the LLM API

121-172: Fix obvious typos/casing in “Fixed Issues”.

Address user‑visible misspellings while avoiding semantics.

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fallback to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix Qwen2.5VL FP8 support (#5029)
+ - Fix Qwen2.5‑VL FP8 support (#5029)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN not divisible by 16; add SM89 DeepGEMM BMM support (#5531)
- - Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
+ - Fix the Llama 3.1 405B hanging issue. (#5698) (#5925)
- - Fix broken cyclic reference detect (#5417) 
+ - Fix broken cyclic reference detection (#5417)
- - Fix permission for local user issues in NGC docker container. (#5373)
+ - Fix local user permissions in NGC Docker container. (#5373)
- - Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) 
+ - Fix mPtrExpertCounts allocation in MoE TensorRT‑LLM backend (NVFP4) (#5519)
- - Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
+ - Fix block‑scale FP8 support for DeepSeek V3 on Blackwell. (#5514)
- - Fix the issue MoE autotune fallback failed to query default heuristic (#5520) 
+ - Fix MoE autotune fallback failing to query default heuristic (#5520)

🧹 Nitpick comments (6)

docs/source/release-notes.md (6)

48-66: Style/clarity nits in “Features” (part 2).

Consistent model/API names; avoid abbreviations; fix spacing.

-  - Switch to internal version of MMProjector in Gemma3
-  - Disable add special tokens for Llama3.3 70B
-  - Auto-enable ngram with concurrency <= 32
-  - Support turning on/off spec decoding dynamically
+  - Switch to the internal MMProjector in Gemma 3
+  - Disable add_special_tokens for Llama 3.3 70B
+  - Auto‑enable n‑gram with concurrency ≤ 32
+  - Support turning speculative decoding on/off dynamically
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism in the PyTorch backend
-  - Support JSON Schema in OpenAI-Compatible API
+  - Support JSON Schema in the OpenAI‑compatible API
-  - Support n-gram speculative decoding with disagg
+  - Support n‑gram speculative decoding with disaggregated serving
-  - Add LLGuidance Support for PyTorch Backend
+  - Add LLGuidance support for the PyTorch backend

71-81: Phrasing/casing nits (part 3).

Tighten wording and casing; remove trailing spaces.

-  - Support disaggregated serving in TRTLLM Sampler
+  - Support disaggregated serving in the TRTLLM sampler
-  - Add core infrastructure to enable loading of custom checkpoint formats
+  - Add core infrastructure to enable loading custom checkpoint formats
-  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
+  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow DeepEP on memory‑constrained GPUs
-  - Use huge page mapping for host accessible memory on GB200
+  - Use huge‑page mapping for host‑accessible memory on GB200
-  - Add user-provided speculative decoding support
+  - Add user‑provided speculative decoding support
-  - Add streaming scaffolding_llm.generate_async support
+  - Add streaming scaffolding_llm.generate_async support
-  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
+  - Integrate TRT‑LLM Gen FP4 block‑scale MoE with the PyTorch workflow kernel autotuner
-  - Remove support for llmapi + TRT backend in Triton
+  - Remove support for LLM API + TensorRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Add request_perf_metrics to the Triton LLM API backend

83-94: Benchmark section: casing/consistency.

Minor polish; keep tool names as code where applicable.

-  - Add support for benchmarking individual gemms in MOE benchmark (#6080)
+  - Add support for benchmarking individual GEMMs in MoE benchmark (#6080)
-  - Add speculative metrics for trtllm-bench
+  - Add speculative‑decoding metrics for trtllm-bench
-  - Add Acceptance Rate calculation to benchmark_serving 
+  - Add acceptance‑rate calculation to benchmark_serving
-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default

116-119: API bullets: grammar and scoping.

Minor fixes.

- - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
+ - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- - Remove TrtGptModelOptionalParams 
+ - Remove TrtGptModelOptionalParams
- - Remove ptuning knobs from TorchLlmArgs
+ - Remove P‑Tuning knobs from TorchLlmArgs

173-176: Known issues: clarify PP notation.

Make pipeline‑parallelism notation explicit.

- - When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue. 
+ - When using disaggregated serving with pipeline parallelism (e.g., PP=2) and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.

7-7: Optional: add release date under the 1.0 header.

Helps downstream users correlate with container tags and PyPI wheels.

-## TensorRT-LLM Release 1.0
+## TensorRT-LLM Release 1.0
+Release date: 2025‑09‑??

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68066b7 and 0fe01e1.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (12)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

docs/source/release-notes.md (1)

96-103: NGC container tags confirmed. Both nvcr.io/nvidia/pytorch:25.06-py3 and nvcr.io/nvidia/tritonserver:25.06-py3 are publicly available on NVIDIA NGC.

laikhtewari

@nv-guomingz @juney-nvidia where do we want to talk about LLM API stability?

docs/source/release-notes.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

docs/source/release-notes.md (4)

11-15: Polish model names and phrasing (consistency).

Minor, user-facing. Align naming/casing and phrasing.

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
-  - Add phi-4-multimodal model support
-  - Add EXAONE 4.0 model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
+  - Add Phi-4 multimodal model support
+  - Add EXAONE 4.0 model support
+  - Add Qwen3 MoE support in the TensorRT backend

18-33: Fix typos/capitalization in “Features” (PyTorch, DP, torch.compile, NVFP4, KV cache, typo).

Clean up widely visible text; keep repo terminology consistent.

-  - Add LoRA support for PyTorch backend in trtllm-serve 
-  - Add support of scheduling attention dp request
+  - Add LoRA support for the PyTorch backend in trtllm-serve
+  - Add scheduling support for attention DP requests
-  - Remove padding of FusedMoE in attention DP
+  - Remove padding for FusedMoE in attention DP
-  - Support torch compile for attention dp
+  - Support torch.compile for attention DP
-  - Add KV events support for sliding window attention
+  - Add KV events support for sliding-window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
-  - Add Piecewise CUDA Graph support for MLA
+  - Add piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
+  - Support multiCtasKvMode for high‑throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Enable KV cache reuse during request generation
-  - Support FP8 row-wise dense GEMM in torch flow
+  - Support FP8 row‑wise dense GEMM in the PyTorch flow
-  - Enable fp8 SwiGLU to minimize host overhead
+  - Enable FP8 SwiGLU to minimize host overhead
-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8×MXFP4 in PyTorch
-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
+  - Open‑source MoE MXFP8‑MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme

Also applies to: 36-41

105-116: API section: casing/grammar (“PyTorch”, “LlmArgs”, “LLM API”, preposition).

Tidy public API names.

- - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for the PyTorch backend
- - Add back allreduce_strategy parameter into TorchLlmArgs
+ - Add back allreduce_strategy parameter to TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization
+ - Add LlmArgs option to force using dynamic quantization
- - Add request_perf_metrics to LLMAPI
+ - Add request_perf_metrics to the LLM API

122-172: Fixed Issues: typos/casing and clearer phrasing.

User-facing; fix brand/model names and a confusing math phrase.

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fix eagle3 two model disaggregated serving test (#6014)
+ - Fix Eagle‑3 two‑model disaggregated serving test (#6014)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
+ - Fix all‑to‑all for Llama 4 (apply_router_weight_on_input=True) (#5902)
- - Fix Qwen2.5VL FP8 support (#5029)
+ - Fix Qwen2.5‑VL FP8 support (#5029)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix illegal memory access in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN % 16 != 0 case and add SM89 DeepGEMM BMM support (#5531)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fall back to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) 
+ - Fix mPtrExpertCounts allocation in MoE TRT‑LLM backend (NVFP4) (#5519)
- - Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
+ - Fix block‑scale FP8 support for DeepSeek V3 on Blackwell. (#5514)

🧹 Nitpick comments (3)

docs/source/release-notes.md (3)

51-66: Clarify/capitalize feature bullets; avoid abbreviations in docs.

Tighten wording; expand “pp/disagg”; standardize product names.

-  - Disable add special tokens for Llama3.3 70B
+  - Disable add_special_tokens for Llama 3.3 70B
-  - Auto-enable ngram with concurrency <= 32
+  - Auto‑enable n‑gram with concurrency ≤ 32
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism (PP) with the PyTorch backend
-  - Support chunked prefill on spec decode 2 model
+  - Support chunked prefill on Speculative Decode v2 models
-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multimodal models
-  - Add support for two-model engine KV cache reuse
+  - Add support for two‑model engine KV cache reuse
-  - Add Eagle-3 support for qwen3 dense model
+  - Add Eagle‑3 support for Qwen3 dense model
-  - Support n-gram speculative decoding with disagg
+  - Support n‑gram speculative decoding with disaggregated serving
-  - Support disaggregated serving in TRTLLM Sampler
+  - Support disaggregated serving in the TRTLLM sampler
-  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
+  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow running DeepEP on memory‑constrained GPUs
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Add request_perf_metrics to the Triton LLM API backend

Also applies to: 71-81

83-95: Benchmark: fix “PyTorch” casing.

One visible nit.

-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default

116-119: Minor: trailing space + consistency.

Tighten phrasing and remove stray space.

- - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
+ - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- - Remove TrtGptModelOptionalParams 
+ - Remove TrtGptModelOptionalParams

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0fe01e1 and 93d5840.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (12)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

docs/source/release-notes.md (1)

174-177: Known Issues: LGTM.

Clear and actionable. Thanks for calling out the cuda-python pin and the PP+KV reuse hang.

docs/source/release-notes.md

nv-guomingz · 2025-09-11T15:54:54Z

@nv-guomingz @juney-nvidia where do we want to talk about LLM API stability?

@Superjomn for vis

tensorrt-cicd · 2025-09-17T03:12:02Z

PR_Github #18875 [ skip ] triggered by Bot

tensorrt-cicd · 2025-09-17T03:30:29Z

PR_Github #18875 [ skip ] completed with state SUCCESS
Skipping testing for commit 6673690

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

docs/source/release-notes.md (4)

13-17: Polish model names and phrasing in “Model Support”.

Consistency and clarity tweaks.

-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
-  - Add phi-4-multimodal model support
+  - Add Phi-4‑Multimodal model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Qwen3 MoE support in the TensorRT backend

20-35: Fix typos/capitalization in “Features”.

User-facing, keep terms canonical.

-  - Add support for sm121
+  - Add support for SM121
-  - Add LoRA support for PyTorch backend in trtllm-serve 
+  - Add LoRA support for the PyTorch backend in trtllm-serve
-  - Add support of scheduling attention dp request
+  - Add scheduling support for attention DP requests
-  - Support torch compile for attention dp
+  - Support torch.compile for attention DP
-  - Add KV events support for sliding window attention
+  - Add KV events support for sliding‑window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
-  - Add Piecewise CUDA Graph support for MLA
+  - Add piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
+  - Support multiCtasKvMode for high‑throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Enable KV cache reuse during request generation
-  - Enable Multi-block mode for Hopper spec dec XQA kernel
+  - Enable multi‑block mode for Hopper speculative‑decoding XQA kernel

112-120: API names and wording consistency.

- **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ **BREAKING CHANGE** Unify KvCacheConfig in LLM class for the PyTorch backend
- Add back allreduce_strategy parameter into TorchLlmArgs
+ Add back allreduce_strategy parameter to TorchLlmArgs
- Add LLmArgs option to force using dynamic quantization
+ Add LlmArgs option to force using dynamic quantization
- Add request_perf_metrics to LLMAPI
+ Add request_perf_metrics to the LLM API

126-156: Proofread “Fixed Issues” for capitalization and clarity.

Representative fixes below; please sweep the whole block similarly.

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fallback to cubins for fp8 fmha kernels on Ada (#5779)
+ - Fallback to CUBINs for FP8 FMHA kernels on Ada (#5779)
- - Fix disagg + speculative decoding (#5558)
+ - Fix disaggregated serving + speculative decoding (#5558)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix illegal memory access in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN alignment requirement (tileN % 16 == 0); add SM89 DeepGEMM BMM support (#5531)

🧹 Nitpick comments (4)

docs/source/release-notes.md (4)

41-45: Normalize brand/casing (PyTorch, ModelOpt, DeepEP).

-  - Add support for MXFP8xMXFP4 in pytorch
+  - Add support for MXFP8×MXFP4 in PyTorch
-  - Add support for Modelopt fp8_pb_wo quantization scheme
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support DeepEP FP4 post‑quant All‑to‑All dispatch

58-58: Grammar: disaggregation + PP + backend.

-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism (PP) in the PyTorch backend

62-62: Trim trailing whitespace.

-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multimodal models

121-121: Trim trailing whitespace.

- - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead 
+ - Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6673690 and 1aebc2c.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (13)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

docs/source/release-notes.md (2)
98-101: Docs section reads well — nice clarity on stability.

102-104: Use published TritonServer trtllm-python image (tags confirmed)

Both NGC tags nvcr.io/nvidia/pytorch:25.06-py3 and nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3 are public; update the backend base image to nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3.
 - The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.06-py3`.
-- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
+- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3`.

Signed-off-by: nv-guomingz <[email protected]>

Signed-off-by: pcastonguay <[email protected]>

Add relnote for multi-node single GPU per node exception Signed-off-by: Sharan Chetlur <[email protected]>

Signed-off-by: nv-guomingz <[email protected]>

nv-guomingz · 2025-09-17T06:18:26Z

/bot skip --comment "docs only changed"

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

docs/source/release-notes.md (2)
13-17: Polish model names and phrasing (Model Support).

Consistency/readability fixes.
-  - Add Mistral3.1 VLM model support
-  - Add TensorRT-Engine Qwen3 (dense) model support
+  - Add Mistral 3.1 VLM model support
+  - Add Qwen3 (dense) model support for the TensorRT engine
-  - Add phi-4-multimodal model support
+  - Add Phi-4-Multimodal model support
-  - Add EXAONE 4.0 model support
+  - Add EXAONE 4.0 model support
-  - Add Qwen3 MoE support to TensorRT backend
+  - Add Qwen3 MoE support in the TensorRT backend
103-104: Use a published TritonServer image tag.

Replace the generic 25.06-py3 with the published 25.06-trtllm-python-py3 tag.
- - The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-py3`.
+ - The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.06-trtllm-python-py3`.
Confirmed tags:

PyTorch 25.06-py3 exists. (catalog.ngc.nvidia.com)

TritonServer 25.06-trtllm-python-py3 exists and is the latest 25.06 variant. (catalog.ngc.nvidia.com)

🧹 Nitpick comments (6)

docs/source/release-notes.md (6)

9-9: Product name/style in intro.

Use project name consistently and avoid numerals in prose.

- TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.
+ TensorRT-LLM 1.0 brings two major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

20-41: Typos/capitalization/terminology pass (Features, part 1).

Standardize acronyms, API names, and grammar.

-  - Add support for sm121
+  - Add support for SM121
   - Add LoRA support for Gemma3
   - Support PyTorch LoRA adapter eviction
-  - Add LoRA support for PyTorch backend in trtllm-serve 
-  - Add support of scheduling attention dp request
+  - Add LoRA support for the PyTorch backend in trtllm-serve
+  - Add support for scheduling attention DP requests
   - Remove padding of FusedMoE in attention DP
-  - Support torch compile for attention dp
-  - Add KV events support for sliding window attention
-  - Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
-  - Add Piecewise CUDA Graph support for MLA
-  - Support mutliCtasKvMode for high-throughput MLA kernels
-  - Enable kvcache to be reused during request generation
+  - Support torch.compile for attention DP
+  - Add KV events support for sliding-window attention
+  - Add TRTLLM MoE NVFP4 CUBINs for mid‑ to high‑concurrency; attention DP for TRTLLM MoE
+  - Add piecewise CUDA Graph support for MLA
+  - Support multiCtasKvMode for high‑throughput MLA kernels
+  - Enable KV cache to be reused during request generation
   - Add ADP schedule balance optimization
   - Add chunked prefill support for MLA (Blackwell)
-  - Enable Multi-block mode for Hopper spec dec XQA kernel
+  - Enable multi‑block mode for Hopper speculative‑decoding XQA kernel
   - Add vLLM KV Pool support for XQA kernel
   - Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
   - Add support for fused gate_up_proj scales for FP8 blockwise
-  - Support FP8 row-wise dense GEMM in torch flow
-  - Enable fp8 SwiGLU to minimize host overhead
-  - Add Deepseek R1 FP8 Support on Blackwell
-  - Add support for MXFP8xMXFP4 in pytorch
+  - Support FP8 row‑wise dense GEMM in the PyTorch workflow
+  - Enable FP8 SwiGLU to minimize host overhead
+  - Add DeepSeek R1 FP8 support on Blackwell
+  - Add support for MXFP8xMXFP4 in PyTorch

Also applies to: 43-52

42-57: Typos/capitalization/terminology pass (Features, part 2).

Continue consistency fixes.

-  - Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
-  - Opensource MOE MXFP8-MXFP4 implementation
-  - Add support for Modelopt fp8_pb_wo quantization scheme
-  - Support deepEP fp4 post quant all2all dispatch
+  - Support NVFP4 model and FP8 KV cache for MLA chunked prefill (Blackwell)
+  - Open‑source MOE MXFP8‑MXFP4 implementation
+  - Add support for ModelOpt fp8_pb_wo quantization scheme
+  - Support DeepEP FP4 post‑quant AllToAll dispatch
   - Fuse w4a8 moe pre-quant scale on Hopper
   - Support Weight-Only-Quantization in PyTorch Workflow
-  - Add support for per expert activation scaling factors
+  - Add support for per‑expert activation scaling factors
   - Add ReDrafter support for Qwen
   - Enable CUDA Graph for Nemotron-H
   - Add support for YARN in NemotronNAS models
   - Switch to internal version of MMProjector in Gemma3
-  - Disable add special tokens for Llama3.3 70B
-  - Auto-enable ngram with concurrency <= 32
+  - Disable adding special tokens for Llama 3.3 70B
+  - Auto‑enable n‑gram with concurrency <= 32
   - Support turning on/off spec decoding dynamically
   - Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
   - Add support for external multimodal embeddings
-  - Add support for disaggregation with pp with pytorch backend
+  - Add support for disaggregation with pipeline parallelism (PP) with the PyTorch backend
   - Add status tags to LLM API reference
   - Support JSON Schema in OpenAI-Compatible API
   - Support chunked prefill on spec decode 2 model
-  - Add KV cache reuse support for multimodal models 
+  - Add KV cache reuse support for multimodal models
   - Support nanobind bindings
   - Add support for two-model engine KV cache reuse
-  - Add Eagle-3 support for qwen3 dense model
+  - Add Eagle‑3 support for Qwen3 dense model
   - Migrate Eagle-3 and draft/target speculation to Drafter
   - Enable guided decoding with overlap scheduler
-  - Support n-gram speculative decoding with disagg
-  - Add beam search support to the PyTorch Workflow
-  - Add LLGuidance Support for PyTorch Backend
-  - Add NGrams V2 support
+  - Support n‑gram speculative decoding with disaggregated serving
+  - Add beam search support to the PyTorch workflow
+  - Add LLGuidance support for PyTorch backend
+  - Add n‑grams v2 support
   - Add MTP support for Online EPLB
   - Support disaggregated serving in TRTLLM Sampler
   - Add core infrastructure to enable loading of custom checkpoint formats
-  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
-  - Use huge page mapping for host accessible memory on GB200
+  - Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow running DeepEP on memory‑constrained GPUs
+  - Use huge pages for host‑accessible memory on GB200
   - Add user-provided speculative decoding support
   - Add streaming scaffolding_llm.generate_async support
   - Detokenize option in /v1/completions request
-  - Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
-  - Remove support for llmapi + TRT backend in Triton
-  - Add request_perf_metrics to triton LLMAPI backend 
+  - Integrate TRT‑LLM Gen FP4 block scale MoE with PyTorch workflow kernel autotuner
+  - Remove support for LLM API + TensorRT backend in Triton
+  - Add request_perf_metrics to the Triton LLM API backend
   - Add support for Triton request cancellation

Also applies to: 58-66, 68-83

93-95: Spell/case “PyTorch”.

-  - Update trtllm-bench to support new Pytorch default
+  - Update trtllm-bench to support new PyTorch default

112-121: API naming/casing consistency (PyTorch, LlmArgs, LLM API).

- - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for pytorch backend
+ - **BREAKING CHANGE** Unify KvCacheConfig in LLM class for the PyTorch backend
- - Add back allreduce_strategy parameter into TorchLlmArgs
+ - Add back allreduce_strategy parameter to TorchLlmArgs
- - Add LLmArgs option to force using dynamic quantization
+ - Add LlmArgs option to force using dynamic quantization
- - Add request_perf_metrics to LLMAPI
+ - Add request_perf_metrics to the LLM API

126-177: Proofread “Fixed Issues” for proper names and clearer phrasing.

Selected fixes below; apply similar casing (Acronyms, models, SMxx) across the list.

- - Fix nemotronNAS loading for TP>1 (#6447)
+ - Fix NemotronNAS loading for TP>1 (#6447)
- - Fix eagle3 two model disaggregated serving test (#6014)
+ - Fix Eagle‑3 two‑model disaggregated serving test (#6014)
- - Fix GEMM+AR fusion on blackwell (#5563)
+ - Fix GEMM+AR fusion on Blackwell (#5563)
- - Fix llama4 multimodal support (#5809)
+ - Fix Llama 4 multimodal support (#5809)
- - Fix Llama4 Scout FP4 crash issue (#5925)
+ - Fix Llama 4 Scout FP4 crash issue (#5925)
- - Fix the illegal memory access issue in moe gemm on SM120 (#5636)
+ - Fix the illegal memory access issue in MoE GEMM on SM120 (#5636)
- - Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
+ - Fix tileN not divisible by 16; add SM89 DeepGEMM BMM support (#5531)
- - Fix mistral unit tests due to transformers upgrade (#5904)
+ - Fix Mistral unit tests due to transformers upgrade (#5904)
- - Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
+ - Fix AllToAll for Llama 4 (apply_router_weight_on_input=True) (#5902)
- - Fix cancel request logic (#5800)
+ - Fix request cancellation logic (#5800)
- - Fix broken cyclic reference detect (#5417) 
+ - Fix broken cyclic reference detection (#5417)
- - Fix permission for local user issues in NGC docker container. (#5373)
+ - Fix permission for local user issues in NGC Docker container. (#5373)
- - Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519) 
+ - Fix mPtrExpertCounts allocation in MoE TRT‑LLM backend (NVFP4) (#5519)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1aebc2c and c5d0c92.

📒 Files selected for processing (1)

docs/source/release-notes.md (1 hunks)

🧰 Additional context used

🧠 Learnings (13)

📓 Common learnings

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

📚 Learning: 2025-08-27T17:50:13.264Z

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-11T20:09:24.389Z

Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
PR: NVIDIA/TensorRT-LLM#6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T00:16:56.457Z

Learnt from: farshadghodsian
PR: NVIDIA/TensorRT-LLM#7101
File: docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md:36-36
Timestamp: 2025-08-21T00:16:56.457Z
Learning: TensorRT-LLM container release tags in documentation should only reference published NGC container images. The README badge version may be ahead of the actual published container versions.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-17T09:01:27.402Z

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-27T14:23:55.566Z

Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:17-17
Timestamp: 2025-08-27T14:23:55.566Z
Learning: The TensorRT-LLM project requires Python 3.10+ as evidenced by the use of TypeAlias from typing module, match/case statements, and union type | syntax throughout the codebase, despite some documentation still mentioning Python 3.8+.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

docs/source/release-notes.md

📚 Learning: 2025-08-15T06:46:53.813Z

Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

docs/source/release-notes.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

docs/source/release-notes.md (3)

98-101: Docs improvements LGTM.

Nice clarity about PyTorch workflow focus and API stability guarantees.

105-108: Version alignment check (ModelOpt/transformers).

Ensure these match release/1.0 requirements and CI images.

ModelOpt “0.33” looks current per upstream release notes (0.33.x). (github.com)

transformers “4.53.1” was acknowledged in prior review comments.

If any drift exists, please sync the RN with the pinned versions in release/1.0.

178-181: Known Issues scope: confirm PP divisibility item.

A prior suggestion proposed documenting “model num_layers must be divisible by PP” as a known issue, with a note that it’s fixed on main (PR #7117). Please confirm whether the fix is included in release/1.0; if not, consider adding the bullet here, otherwise omit to avoid confusion.

tensorrt-cicd · 2025-09-17T06:25:19Z

PR_Github #18908 [ skip ] triggered by Bot

tensorrt-cicd · 2025-09-17T06:38:16Z

PR_Github #18908 [ skip ] completed with state SUCCESS
Skipping testing for commit c5d0c92

Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: pcastonguay <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Co-authored-by: pcastonguay <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]>

Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: pcastonguay <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Co-authored-by: pcastonguay <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Signed-off-by: Wangshanshan <[email protected]>

nv-guomingz requested a review from a team as a code owner September 8, 2025 07:27

nv-guomingz requested review from MartinMarciniszyn, laikhtewari, juney-nvidia, chzblych, Naveassaf, schetlur-nv and pcastonguay September 8, 2025 07:28

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

docs/source/release-notes.md Show resolved Hide resolved

docs/source/release-notes.md Show resolved Hide resolved

docs/source/release-notes.md Outdated Show resolved Hide resolved

docs/source/release-notes.md Show resolved Hide resolved

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from 6f7e39a to bb3046b Compare September 8, 2025 07:35

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

litaotju approved these changes Sep 8, 2025

View reviewed changes

amitz-nv reviewed Sep 8, 2025

View reviewed changes

docs/source/release-notes.md Outdated Show resolved Hide resolved

amitz-nv reviewed Sep 8, 2025

View reviewed changes

docs/source/release-notes.md Show resolved Hide resolved

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from bb3046b to befdca1 Compare September 8, 2025 09:04

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

amitz-nv reviewed Sep 8, 2025

View reviewed changes

docs/source/release-notes.md Show resolved Hide resolved

pcastonguay reviewed Sep 8, 2025

View reviewed changes

docs/source/release-notes.md Outdated Show resolved Hide resolved

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from befdca1 to 4fd3aed Compare September 8, 2025 13:58

MartinMarciniszyn approved these changes Sep 8, 2025

View reviewed changes

schetlur-nv approved these changes Sep 9, 2025

View reviewed changes

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from 4fd3aed to 68066b7 Compare September 10, 2025 02:09

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

laikhtewari reviewed Sep 10, 2025

View reviewed changes

schetlur-nv reviewed Sep 10, 2025

View reviewed changes

docs/source/release-notes.md Show resolved Hide resolved

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

docs/source/release-notes.md Show resolved Hide resolved

chzblych added the Release Blocker PRs that blocking the final release build or branching out the release branch label Sep 12, 2025

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from 6673690 to 1aebc2c Compare September 17, 2025 03:25

nv-guomingz enabled auto-merge (squash) September 17, 2025 03:25

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

nv-guomingz and others added 4 commits September 17, 2025 14:18

[TRTLLM-7958][doc] add 1.0 release notes

b7683f5

Signed-off-by: nv-guomingz <[email protected]>

Include release note for disagg + reuse + pp known issue

4b1891a

Signed-off-by: pcastonguay <[email protected]>

Update release-notes.md

61e9c18

Add relnote for multi-node single GPU per node exception Signed-off-by: Sharan Chetlur <[email protected]>

add documentation section.

c5d0c92

Signed-off-by: nv-guomingz <[email protected]>

nv-guomingz force-pushed the user/guomingz/1.0_rn branch from 1aebc2c to c5d0c92 Compare September 17, 2025 06:18

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

nv-guomingz merged commit de66d37 into NVIDIA:release/1.0 Sep 17, 2025
5 checks passed

nv-guomingz deleted the user/guomingz/1.0_rn branch September 30, 2025 08:01

[TRTLLM-7958][doc] add 1.0 release notes #7605

[TRTLLM-7958][doc] add 1.0 release notes #7605

Uh oh!

Conversation

nv-guomingz commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

laikhtewari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nv-guomingz commented Sep 11, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nv-guomingz commented Sep 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

tensorrt-cicd commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

nv-guomingz commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading