[TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6059 from release/1.1.0rc2) #7610

yuxianq · 2025-09-08T08:56:36Z

Summary by CodeRabbit

New Features
- Enable MLA with FP8-context attention, unlocking additional configuration options.
- Expand FP8 MLA support to more GPU architectures.
Bug Fixes
- Correct output dtype selection in FP8 MLA paths for better compatibility.
- Add a clear assertion when prompt length exceeds the configured maximum.
Refactor
- Centralize quantization flags and scale propagation for attention, improving consistency across execution paths.
Chores
- Improve initialization error logs with deeper tracebacks.
- Enrich kernel diagnostic messages with detailed dtype and architecture info.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yuxianq · 2025-09-08T08:57:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-09-08T09:03:04Z

PR_Github #18020 [ run ] triggered by Bot

coderabbitai · 2025-09-08T09:06:49Z

📝 Walkthrough

Walkthrough

Enables FP8-context MLA by relaxing checks and removing forced output dtype; adds mFP8ContextMLA and KV cache quant mode plumbing; switches Runner allocations to shared_ptr; expands FMHA kernel hash info; centralizes quant scale/out_scale handling in PyTorch Attention; increases initialization traceback depth; adds a max prompt length assertion; updates tests with Hopper gating and MOE backend selection.

Changes

Cohort / File(s)	Summary
AttentionOp core (MLA gating & dtype) `cpp/tensorrt_llm/common/attentionOp.cpp`, `cpp/tensorrt_llm/common/attentionOp.h`	Allows MLA with FP8-context FMHA by only forbidding dense FMHA; removes forced E4M3 output dtype for FP8-context MLA; extends AttentionOp::data() tuple to include `mFP8ContextMLA`.
THOP AttentionOp wiring & allocation `cpp/tensorrt_llm/thop/attentionOp.cpp`	Replaces raw new/reset with `std::make_shared` for Runner creation; introduces and initializes `mKVCacheQuantMode`; broadens `mFP8ContextMLA` SM gating (SM 100 or 120); removes duplicate KV quant mode in nvfp4 path.
FMHA kernel info string `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`	Extends `hashFromRunnerParams` info string to include `dtypeQ`, `dtypeKv`, `dtypeOut`, and `sm` along with `qkvLayout`.
PyTorch Attention quant plumbing `tensorrt_llm/_torch/modules/attention.py`	Adds `has_quant_scale` and `out_scale` (Attention/MLA); ensures `o_proj.create_weights()` is called; centralizes FP8/NVFP4 gating via `has_quant_scale`; threads `self.out_scale` through all forward paths.
Executor engine logging `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Increases traceback depth in fallback init logging from 1 to 10 frames.
Worker prompt length check `tensorrt_llm/executor/worker.py`	Adds assertion in `_deduce_max_tokens` that `len(prompt_token_ids) <= executor_config.max_seq_len`.
Integration tests (gating + MOE backend) `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Replaces `@skip_no_hopper`/device-specific skips with `@skip_pre_hopper`; injects `MoeConfig` with backend `"DEEPGEMM"` for SM ≥ 100 else `"CUTLASS"` into PyTorch configs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant TorchAttention as Attention (PyTorch)
  participant OProj as o_proj
  participant Backend as TRT-LLM AttentionOp

  Note over TorchAttention,OProj: Weight creation and quant flag discovery
  User->>TorchAttention: create_weights()
  TorchAttention->>OProj: create_weights()
  OProj-->>TorchAttention: returns (quant flags, out_scale)
  TorchAttention->>TorchAttention: has_quant_scale = (FP8/NVFP4 flags)\nout_scale = o_proj.out_scale

  Note over TorchAttention,Backend: Forward path with centralized out_scale
  User->>TorchAttention: forward(...)
  TorchAttention->>Backend: attention(..., out_scale=self.out_scale)
  Backend-->>TorchAttention: outputs
  TorchAttention-->>User: result

sequenceDiagram
  autonumber
  participant Host as Host Init
  participant AttnOp as AttentionOp
  participant Runner as FMHA/MLA Runner

  Note over AttnOp: MLA enablement with FP8-context
  Host->>AttnOp: initialize(...)
  AttnOp->>AttnOp: mFP8ContextMLA = (SM in {100,120} && KvCache supports FP8)
  AttnOp->>AttnOp: if MLA enabled ensure !DenseContextFMHA
  AttnOp->>Runner: construct (shared_ptr)
  Runner-->>AttnOp: ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-6994][feat] FP8 Context MLA integration. #7581: Similar enablement of FP8-context with MLA and tuple/data updates in AttentionOp.
[https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… #6978: Adjusts Attention forward path to thread output scales/quant flags.
test: skip llama3.3 70b test on cg4 #6293: Updates test decorators to Hopper-based gating akin to the test changes here.

Suggested labels

Release Blocker

Suggested reviewers

PerkzZheng
litaotju
Superjomn
QiJune

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/modules/attention.py (1)
317-327: Fix potential dtype mismatch under torch.compile path (custom op disables NVFP4 output).

create_output may allocate an FP8 output tensor when has_quant_scale and FP8/FP4 KV cache are enabled. However, attn_custom_op_inplace invokes _attn_impl(..., enable_attn_nvfp4_output=False) under torch.compile, which can yield BF16 output into an FP8 buffer. Make dtype selection conditional on the same enable flag.

Apply:
-    def create_output(self, q: torch.Tensor):
+    def create_output(self, q: torch.Tensor, enable_attn_nvfp4_output: bool = True):
         num_tokens = q.shape[0]
         hidden_size = self.o_proj.in_features
         out_dtype = q.dtype
 
-        if self.attn_backend == "TRTLLM":
+        if self.attn_backend == "TRTLLM" and enable_attn_nvfp4_output:
             if self.has_quant_scale and (self.attn.has_fp8_kv_cache
                                          or self.attn.has_fp4_kv_cache):
                 out_dtype = torch.float8_e4m3fn
         output = q.new_empty([num_tokens, hidden_size], dtype=out_dtype)
         return output
And update the compile path call site:
# In forward_impl(), inside if use_custom_inplace_op:
-    output = self.create_output(q)
+    output = self.create_output(q, enable_attn_nvfp4_output=False)
This keeps buffer dtype aligned with the execution path.

🧹 Nitpick comments (13)

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1-1: Missing NVIDIA Apache-2.0 header (2025).

Per guidelines, prepend the NVIDIA Apache-2.0 header with the current year.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
1009-1013: Traceback limit increased to 10: consider log level.

Good for debugging, but verbose for INFO. Suggest logging the full traceback at DEBUG, or gate by an env flag.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (2)
1-15: Update copyright year.

Header shows 2020–2023; update to include 2025.
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
544-547: Log readability: print dtype names instead of ints.

The info string logs dtypeQ/Kv/Out as integers. Prefer symbolic names for quicker debugging.

Example:
-        std::string info = "dtypeQ=" + std::to_string(static_cast<int>(mDtypeQ)) + ", dtypeKv="
-            + std::to_string(static_cast<int>(mDtypeKv)) + ", dtypeOut=" + std::to_string(static_cast<int>(mDtypeOut))
+        std::string info = "dtypeQ=" + toString(mDtypeQ) + ", dtypeKv="
+            + toString(mDtypeKv) + ", dtypeOut=" + toString(mDtypeOut)
             + ", sm=" + std::to_string(mSM) + ", qkvLayout=" + std::to_string(static_cast<int>(params.mQkvLayout))
(Add a small toString(Data_type) helper if not present.)
tensorrt_llm/executor/worker.py (1)
519-525: Minor: variable name typo.

Consider renaming splited_prompt_len → split_prompt_len for clarity (optional).
-            splited_prompt_len = int(len(prompt_token_ids) / cp_size)
-            default_max_tokens = max_seq_len - splited_prompt_len - query_token_len
+            split_prompt_len = int(len(prompt_token_ids) / cp_size)
+            default_max_tokens = max_seq_len - split_prompt_len - query_token_len
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
1239-1241: Avoid repetition: factor MOE backend selection into a helper.

The "DEEPGEMM if SM>=100 else CUTLASS" logic is duplicated across tests. Suggest a small helper to keep tests DRY.

Example (place near the top of this file):
def _moe_backend_for_ci():
    return "DEEPGEMM" if get_sm_version() >= 100 else "CUTLASS"
Then here:
moe_config=MoeConfig(backend=_moe_backend_for_ci())
1329-1331: Same refactor applies here.

Use the shared helper to choose the MOE backend.

1353-1355: Same refactor applies here.

Use the shared helper to choose the MOE backend.

1397-1399: Same refactor applies here.

Use the shared helper to choose the MOE backend.
tensorrt_llm/_torch/modules/attention.py (1)
298-299: Guard against re-initialization of o_proj weights.

Attention.create_weights() now unconditionally calls self.o_proj.create_weights(). If init already created weights (default path), this may reinitialize or conflict unless Linear.create_weights is idempotent.

Please confirm Linear.create_weights is idempotent (e.g., via an internal _weights_created guard). If not, guard:
-        self.o_proj.create_weights()
+        if not getattr(self.o_proj, "_weights_created", False):
+            self.o_proj.create_weights()
cpp/tensorrt_llm/common/attentionOp.cpp (1)
2573-2574: Fix wording in user-visible error message

Change “currently not support dense fmha” to “does not currently support dense FMHA” for clarity.
-        TLLM_CHECK_WITH_INFO(!mDenseContextFMHA, "MLA(Deepseek v2) currently not support dense fmha");
+        TLLM_CHECK_WITH_INFO(!mDenseContextFMHA, "MLA (Deepseek v2) does not currently support dense FMHA");
cpp/tensorrt_llm/thop/attentionOp.cpp (2)
709-723: Validate workspace dtype and size in bytes

The check uses numel() (elements) against workspace_size (bytes). If a caller passes a non-Byte tensor, the comparison and resize logic become inconsistent. Guard for dtype Byte or compute in bytes.
-    if (workspace_.has_value())
+    if (workspace_.has_value())
     {
-        if (workspace_.value().numel() < workspace_size)
+        auto ws = workspace_.value();
+        TORCH_CHECK(ws.dtype() == torch::kByte, "workspace must be a torch.uint8 (Byte) tensor");
+        if (ws.numel() < workspace_size) // numel == bytes for Byte tensors
         {
             TLLM_LOG_WARNING("Attention workspace size is not enough, increase the size from %ld bytes to %ld bytes",
-                workspace_.value().numel(), workspace_size);
-            workspace_.value().resize_({workspace_size});
+                ws.numel(), workspace_size);
+            ws.resize_({workspace_size});
         }
-        workspace = workspace_.value();
+        workspace = ws;
     }
1-16: Header year nit

Guidelines ask to prepend the NVIDIA Apache-2.0 header with the current year; file shows 1993-2024. Consider updating to include 2025 where applicable.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 219e955 and 7961672.

📒 Files selected for processing (8)

cpp/tensorrt_llm/common/attentionOp.cpp (1 hunks)
cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (3 hunks)
tensorrt_llm/_torch/modules/attention.py (9 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}