[TRTLLM-5966][feat] Helix: add full MLA support for Helix #8104

MatthiasKohl · 2025-09-30T17:01:23Z

Description

This PR adds full Helix parallelism support to the MLA attention module:

adding Helix post-process kernels
adding updates required for MLA kernels to use the RoPE position IDs in generation, when generated token is at different position than previous KV cache values
adds tests for post-process kernels and MLA module comparing Helix vs. no-Helix implementation

Test Coverage

tests/unittest/_torch/modules/test_mla_helix.py : Full Helix MLA test
tests/unittest/_torch/thop/parallel/test_helix_postprocess.py : Helix post-process unit test

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
[ x ] Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- GPU-accelerated Helix post-processing exposed as a Torch operator.
- Attention updated to support CP Helix across context and generation, including position offset handling.
- New public collective for CP all-gather in distributed workflows.
Performance
- Improved GEMM configuration selection for large shapes and clearer runtime error messages.
Tests
- Added comprehensive unit and multi-GPU distributed tests for Helix CP attention and post-processing across dtypes, sizes, and edge cases.

Signed-off-by: Matthias Jouanneaux <[email protected]>

MatthiasKohl · 2025-09-30T17:04:35Z

/bot run

coderabbitai · 2025-09-30T17:07:07Z

📝 Walkthrough

Walkthrough

Adds Helix post-processing GPU kernel and Torch op, integrates CP Helix flow in attention with post-processing and optional position offsets in MLA RoPE, refactors distributed allgather and exposes cp_allgather, tweaks GEMM runner selection and error message, updates build, and adds comprehensive unit tests.

Changes

Cohort / File(s)	Summary
Helix Post-Processing Core (CUDA) `cpp/tensorrt_llm/kernels/helixKernels.cu`, `cpp/tensorrt_llm/kernels/helixKernels.h`	Introduces templated GPU post-processing kernel and host launcher for Helix; defines params struct; adds BF16/F16 instantiations; includes alignment/size checks and warp reduction helper.
Torch Op and Build Integration (Helix Post-Process) `cpp/tensorrt_llm/thop/helixPostProcessOp.cpp`, `cpp/tensorrt_llm/thop/CMakeLists.txt`, `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py`	Adds Torch CUDA op helix_post_process with validations and stream launch; registers op in Torch; includes source in th_common target; provides fake op for meta/inference.
MLA RoPE Offset Support `cpp/tensorrt_llm/kernels/mlaKernels.cu`	Extends kernel and invocation to accept optional helix_position_offsets for position id selection in RoPE.
Distributed Ops Refactor and API `tensorrt_llm/_torch/distributed/ops.py`, `tensorrt_llm/_torch/distributed/__init__.py`	Refactors internal allgather to explicit group/rank; adds public wrappers allgather and cp_allgather; updates exports.
Attention CP/Helix Integration `tensorrt_llm/_torch/modules/attention.py`	Integrates CP sizing/config into attention and MLA paths; adds Helix all-to-all and post-processing usage; threads position_ids and latent_cache_gen; adjusts heads/reshapes for CP.
GEMM Runner Adjustments `cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp`	Improves error message to include code; updates tileN selection heuristic when N>256.
Unit Tests: Helix Post-Processing `tests/unittest/_torch/thop/parallel/test_helix_postprocess.py`	Adds correctness and validation tests for helix_post_process across dtypes/shapes/scales with baseline comparison and error cases.
Unit Tests: Distributed MLA Helix `tests/unittest/_torch/modules/test_mla_helix.py`	Adds multi-GPU distributed tests for MLA Helix scenarios, including KV setup, RoPE config, execution, and validation against reference.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Py as PyTorch
  participant Op as trtllm::helix_post_process (Torch Op)
  participant K as helixPostProcess<T> (Host)
  participant GPU as helix_postprocess_kernel<T> (CUDA)
  Py->>Op: helix_post_process(gathered_o, gathered_stats, scale)
  Op->>Op: Validate shapes/dtypes/alignment
  Op->>K: Build HelixPostProcParams<T>, launch on stream
  K->>GPU: Configure grid/block, launch
  GPU->>GPU: Warp-reduce corrected sums
  GPU->>GPU: Accumulate per-token/head blocks
  GPU-->>K: Write output [num_tokens, num_heads*kv_lora_rank]
  K-->>Op: Kernel complete
  Op->>Op: Optional scale multiply
  Op-->>Py: Return output tensor
  note over GPU,K: New Helix post-processing pathway

sequenceDiagram
  autonumber
  participant Attn as Attention/MLA Forward
  participant Rope as applyMLARopeAndAssignQKVKernelOptContext
  participant Pos as helix_position_offsets
  Attn->>Rope: Launch kernel(..., helix_position_offsets)
  alt offsets provided
    Rope->>Pos: Read offset[global_token_idx]
    Rope-->>Attn: Use offset for RoPE
  else no offsets
    Rope-->>Attn: Use local_token_idx for RoPE
  end
  note over Rope: Modified position id selection

sequenceDiagram
  autonumber
  participant Attn as Attention (CP Helix)
  participant Dist as alltoall_helix / cp_allgather
  participant Op as helix_post_process
  Attn->>Dist: Exchange per-CP shard outputs/stats
  Dist-->>Attn: Gathered O and stats
  Attn->>Op: helix_post_process(gathered_o, gathered_stats, scale)
  Op-->>Attn: Post-processed O
  Attn-->>Attn: Continue projection/output mapping
  note over Attn,Op: New CP Helix data exchange and post-process

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.95% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly identifies the JIRA ticket TRTLLM-5966, indicates a feature addition, and summarizes the primary change of adding full MLA support for Helix, which matches the pull request’s main objective.
Description Check	✅ Passed	The pull request description includes the required ## Description, ## Test Coverage, and ## PR Checklist sections, with clear explanations of the change, associated tests, and checklist items, matching the repository template structure.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (5)

cpp/tensorrt_llm/kernels/helixKernels.h (1)

30-44: Document new public interfaces

HelixPostProcParams and helixPostProcess are new exported symbols; our header rules require Doxygen comments describing their contract. Please add //! documentation blocks so downstream users know how to populate the params and what the launcher does. As per coding guidelines
cpp/tensorrt_llm/thop/helixPostProcessOp.cpp (1)
72-77: Ensure macro hygiene and consider using a function.

The CALL_CPP_OP macro creates local variables and invokes a function, which could lead to name collisions or unexpected behavior if used multiple times. Consider converting this to a templated helper function for better type safety and to avoid potential macro pitfalls.

Consider replacing the macro with a templated function:
template<typename T>
void invokeHelixPostProcess(torch::Tensor& output, 
                            torch::Tensor const& gathered_o, 
                            torch::Tensor const& gathered_stats,
                            int cp_size, int num_tokens, int num_heads, int kv_lora_rank,
                            cudaStream_t stream) {
    tensorrt_llm::kernels::HelixPostProcParams<T> params{
        reinterpret_cast<T*>(output.mutable_data_ptr()),
        reinterpret_cast<T const*>(gathered_o.data_ptr()),
        reinterpret_cast<float2 const*>(gathered_stats.data_ptr()),
        static_cast<int>(cp_size), static_cast<int>(num_tokens),
        static_cast<int>(num_heads), static_cast<int>(kv_lora_rank)
    };
    tensorrt_llm::kernels::helixPostProcess(params, stream);
}
Then replace lines 79-90:
-#define CALL_CPP_OP(T)                                                                                                 \
-    tensorrt_llm::kernels::HelixPostProcParams<T> params{reinterpret_cast<T*>(output.mutable_data_ptr()),              \
-        reinterpret_cast<T const*>(gathered_o.data_ptr()), reinterpret_cast<float2 const*>(gathered_stats.data_ptr()), \
-        static_cast<int>(cp_size), static_cast<int>(num_tokens), static_cast<int>(num_heads),                          \
-        static_cast<int>(kv_lora_rank)};                                                                               \
-    tensorrt_llm::kernels::helixPostProcess(params, stream);
-
     if (gathered_o.scalar_type() == at::ScalarType::Half)
     {
-        CALL_CPP_OP(__half);
+        invokeHelixPostProcess<__half>(output, gathered_o, gathered_stats, cp_size, num_tokens, num_heads, kv_lora_rank, stream);
     }
     else if (gathered_o.scalar_type() == at::ScalarType::BFloat16)
     {
 #ifdef ENABLE_BF16
-        CALL_CPP_OP(__nv_bfloat16);
+        invokeHelixPostProcess<__nv_bfloat16>(output, gathered_o, gathered_stats, cp_size, num_tokens, num_heads, kv_lora_rank, stream);
 #else
         TLLM_THROW("BFloat16 must be enabled to use helix_post_process with bf16 tensors.");
 #endif
     }
tests/unittest/_torch/thop/parallel/test_helix_postprocess.py (1)
175-201: Handle unused variable in alignment test correctly.

The static analysis tool flags line 197's output variable as unused, but this is a false positive. The variable is assigned to verify that the operation succeeds without raising an error. The current pattern is acceptable, though you could make the intent clearer.

Consider making the intent more explicit by assigning to _ or adding a comment:
         try:
-            output = torch.ops.trtllm.helix_post_process(
+            _ = torch.ops.trtllm.helix_post_process(
                 gathered_o, gathered_stats, 1.0)
-            # Should not raise an error
+            # Success: Should not raise an error for valid alignment
         except RuntimeError as e:
             pytest.fail(f"Should not raise error for valid alignment: {e}")
tensorrt_llm/_torch/modules/attention.py (2)
823-823: Document the TODO for CP-aware weight loading.

The TODO comment on line 823 notes that weight loading needs to be CP-aware for splitting v_b_proj. This is an important future task.

The TODO at line 823 indicates that weight loading for v_b_proj needs CP awareness. This could lead to incorrect behavior if weights are not split according to cp_size.

Would you like me to open a new issue to track implementing CP-aware weight loading for v_b_proj?

1467-1469: Unused parameters in forward_generation signature.

Static analysis correctly identifies that compressed_kv and k_pe parameters are unused in forward_generation. These parameters are passed for consistency with forward_context but are not used in the generation path where q_nope and q_pe are derived directly from q.

Consider removing unused parameters or adding a comment explaining why they're in the signature:
 def forward_generation(
     self,
     q: torch.Tensor,
-    compressed_kv: torch.Tensor,
-    k_pe: torch.Tensor,
+    compressed_kv: torch.Tensor,  # Unused: q already contains all needed information
+    k_pe: torch.Tensor,  # Unused: q already contains all needed information
     position_ids: torch.Tensor,
Or if the parameters are vestigial, consider removing them entirely and updating all call sites.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1560cca and cfa8989.

📒 Files selected for processing (12)

cpp/tensorrt_llm/kernels/helixKernels.cu (1 hunks)
cpp/tensorrt_llm/kernels/helixKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.cu (3 hunks)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/helixPostProcessOp.cpp (1 hunks)
tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (1 hunks)
tensorrt_llm/_torch/distributed/__init__.py (1 hunks)
tensorrt_llm/_torch/distributed/ops.py (4 hunks)
tensorrt_llm/_torch/modules/attention.py (30 hunks)
tests/unittest/_torch/modules/test_mla_helix.py (1 hunks)
tests/unittest/_torch/thop/parallel/test_helix_postprocess.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}