Skip to content

Conversation

@limin2021
Copy link
Collaborator

@limin2021 limin2021 commented Jul 31, 2025

[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell arch

  • New Features

    • Add CuTe DSL Fp8 block scaling ops: gemm, batched gemm, group gemm, for Blackwell architectures.
    • Provided new options in model constructors and configuration to control the use of advanced FP8 block scaling ops. By default, it is not used.
  • Tests

    • Added comprehensive unit and integration tests to validate the new FP8 GEMM and BMM operators, including multi-GPU and Blackwell-specific scenarios.
    • Updated and expanded fused MoE test coverage to support both legacy and new CuTe DSL kernels.
    • Add model accuracy test.
pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py -k "test_cute_dsl_fp8_block_scale_gemm"

pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py -k "test_cute_dsl_fp8_block_scale_bmm"

pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_group_gemm.py -k "test_cute_dsl_fp8_block_scale_group_gemm"

# module test
pytest -s -o log_cli=true tests/unittest/_torch/modules/test_fused_moe.py -k "test_fused_moe_fp8_blockwise_cute_dsl"

pytest -s -o log_cli=true tests/unittest/_torch/modules/test_fused_moe.py -k "test_fused_moe_fp8_blockwise_cute_dsl_multi_gpu"

# model UT
python3 examples/llm-api/quickstart_advanced.py --model_dir /home/scratch.trt_llm_data/llm-models/DeepSeek-V3-Lite/fp8 --tp_size 1 --moe_tp_size 1 --moe_ep_size 1 --moe_backend CUTEDSL --use_cute_dsl_blockscaling_mm --use_cute_dsl_blockscaling_bmm

# model UT
pytest -s -o log_cli=true "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_cute_dsl_fp8_block_scales"

  • Todo
    • Update the autotune of group gemm after yukun's supporting MR is merged (to make autotune module support creating tensor with data dependency)
    • Verify the performance on Deepseek model to change some CuTe DSL op's impl to be default impl if they have better perf.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@limin2021 limin2021 requested a review from a team as a code owner July 31, 2025 10:35
@limin2021 limin2021 requested review from byshiue and hyukn July 31, 2025 10:35
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

📝 Walkthrough

Walkthrough

This change introduces new FP8 GEMM and batched GEMM operators using the CuTe DSL and CUTLASS libraries for NVIDIA Blackwell architecture, with corresponding integration in model, module, and test code. It adds environment-variable-driven flags to enable these kernels, updates weight loading and quantization logic, and extends test coverage for both single-GPU and multi-GPU scenarios.

Changes

Cohort / File(s) Change Summary
Custom FP8 GEMM Operators (CuTe/CUTLASS/Blackwell)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Adds CuteDSLFp8BlackwellLinear and CuteDSLFp8BlackwellBmm classes for FP8 GEMM/BMM using CuTe DSL and CUTLASS; registers new custom CUDA operators and supporting JIT functions.
Linear and GatedMLP Module Integration
tensorrt_llm/_torch/modules/linear.py, tensorrt_llm/_torch/modules/gated_mlp.py
Adds use_cute_dsl_blockscaling_mm flag to constructors and propagates to submodules; conditionally uses new GEMM operator in Linear; updates print/debug output.
Attention and MLA Module Integration
tensorrt_llm/_torch/modules/attention.py
Adds use_cute_dsl_blockscaling_mm and use_cute_dsl_blockscaling_bmm flags from environment; conditionally uses new BMM operator in forward pass and weight creation.
DeepSeekV3 Model Integration
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Adds environment flag-driven kernel selection to all relevant classes; updates weight loading logic for Blackwell/DeepGEMM; propagates flags to submodules; adjusts debug output.
Fused MoE Integration and Quantization
tensorrt_llm/_torch/modules/fused_moe/quantization.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
Adds a new quantization method class for DeepGEMM; updates method selection logic; adds debug prints; no logic changes in Cute DSL MoE.
Unit and Integration Test Additions/Updates
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py, tests/unittest/_torch/modules/test_fused_moe.py, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Adds/updates tests for new FP8 GEMM/BMM kernels (including multi-GPU); parameterizes tests over new flags; updates decorators for Blackwell architecture; extends reference models to support new kernel.
Miscellaneous Formatting/Import Updates
tensorrt_llm/llmapi/llm.py, tensorrt_llm/evaluate/lm_eval.py
Minor formatting and import refactoring; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Model
    participant Linear
    participant CustomOp (CuteDSLFp8BlackwellLinear)
    participant CUDA Kernel

    User->>Model: Forward(input)
    Model->>Linear: Forward(input)
    Linear->>CustomOp: cute_dsl_fp8_gemm_blackwell(input, weight, ...)
    CustomOp->>CUDA Kernel: Launch FP8 GEMM kernel
    CUDA Kernel-->>CustomOp: Output tensor
    CustomOp-->>Linear: Output tensor
    Linear-->>Model: Output
    Model-->>User: Output
Loading
sequenceDiagram
    participant User
    participant MLA
    participant CustomOp (CuteDSLFp8BlackwellBmm)
    participant CUDA Kernel

    User->>MLA: forward_generation(...)
    MLA->>CustomOp: cute_dsl_fp8_bmm_blackwell(input, weight, ...)
    CustomOp->>CUDA Kernel: Launch FP8 BMM kernel
    CUDA Kernel-->>CustomOp: Output tensor
    CustomOp-->>MLA: Output tensor
    MLA-->>User: Output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Community want to contribute

Suggested reviewers

  • FrankD412
  • syuoni
  • hlu1
  • nv-guomingz

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7924f93 and 88e9e2e.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tests/unittest/_torch/modules/test_fused_moe.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (4)

11-29: Remove commented imports to improve code clarity.

The file contains both commented and active imports for the same modules. The active imports using absolute paths are correct. Please remove the commented relative import statements to avoid confusion.

-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)
 from tensorrt_llm._torch.attention_backend.interface import AttentionInputType
 from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec,
                                            DynamicTensorSpec,
                                            OptimizationProfile, TunableRunner,
                                            TuningConfig)
-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel
 from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \
     BlockwiseGemmKernel

1173-1182: Remove commented code in append_ones_wrapper function.

The function contains a commented line that duplicates the functionality. Clean up the code by removing it.

 @cute.jit
 def append_ones_wrapper(a: cute.Tensor):
-    # return cute.append_ones(tensor)
     a_layout = a.layout
     a_layout = cute.append(a_layout,
                            cute.make_layout(1, stride=1),
                            up_to_rank=3)
     new_a = cute.make_tensor(a.iterator, a_layout)
     return new_a

1254-1285: Remove commented "before opt" code section.

The extensive commented code block labeled "before opt" should be removed as it clutters the codebase and the optimized version is already implemented.

Remove lines 1254-1285 entirely to improve code readability.


1372-1406: Consider implementing autotuning or documenting why it's disabled.

The function contains extensive commented code for autotuning support. Either:

  1. Implement the autotuning functionality if it's ready
  2. Remove the commented code and add a TODO comment explaining the plan
  3. Document why fixed parameters are currently preferred

If autotuning is not ready, replace the commented code with:

def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
    # TODO: Implement autotuning support for optimal tactic selection
    # Currently using fixed parameters for initial implementation
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
    return cute_dsl_fp8_gemm_blackwell_runner(
        inputs=[input, weight, input_scale, weight_scale],
        tactic=0,
        use_2cta_instrs=False,
        mma_tiler_mn=(128, 128),
        cluster_shape_mn=(1, 1),
    )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d38c26b and 8cb6c2c.

📒 Files selected for processing (4)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1203-1203: Ambiguous variable name: l

(E741)

🔇 Additional comments (4)
examples/llm-api/quickstart_advanced.py (1)

50-54: LGTM!

The addition of 'CUTEDSL' to the MoE backend choices properly exposes the new CuTe DSL FP8 GEMM backend option to users.

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)

60-70: LGTM!

The test addition properly validates the new CuTe DSL FP8 GEMM kernel on Blackwell architecture with appropriate tolerances for FP8 precision.

tensorrt_llm/_torch/modules/linear.py (2)

1451-1451: LGTM! Well-structured parameter addition.

The new boolean parameter is properly positioned, has a descriptive name, and maintains backward compatibility with the default value.


1498-1498: LGTM! Proper instance attribute assignment.

The parameter is correctly stored as an instance attribute following the established pattern.

m = inputs[0].shape[0]
n = inputs[1].shape[0]
k = inputs[0].shape[1]
l = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use a more descriptive variable name instead of 'l'.

Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name.

-        l = 1
+        batch_size = 1

Also update the reference to this variable in line 1221:

-            l,
+            batch_size,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
l = 1
batch_size = 1
batch_size,
🧰 Tools
🪛 Ruff (0.12.2)

1203-1203: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at line 1203, replace the
single-letter variable name 'l' with a more descriptive name that clearly
indicates its purpose. Also, update all references to this variable, including
the one at line 1221, to use the new descriptive name to maintain consistency
and clarity.

Comment on lines 575 to 580
if module.use_cute_dsl_fp8_block_scale_gemm:
output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
act_input_fp8, module.weight, act_input_sf, module.weight_scale)
else:
output = torch.ops.trtllm.fp8_block_scaling_gemm(
act_input_fp8, module.weight, act_input_sf, module.weight_scale)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider adding runtime validation for the new operator.

The conditional logic is implemented correctly, but consider adding validation to ensure the new operator is available and compatible with the current hardware architecture, especially since the operator name suggests Blackwell-specific functionality.


🏁 Script executed:

#!/bin/bash
# Search for any existing hardware/architecture validation patterns in the codebase
rg -A 5 -B 5 "blackwell|Blackwell" --type py

Length of output: 221000


🏁 Script executed:

#!/bin/bash
# Look for operator availability checks or similar validation patterns
rg -A 3 -B 3 "torch\.ops\.trtllm\." --type py | rg -A 3 -B 3 "(available|exists|check)"

Length of output: 13154


Add runtime checks for Blackwell-specific operator

Wrap the call to cute_dsl_fp8_gemm_blackwell in tensorrt_llm/_torch/modules/linear.py (around lines 575–580) with:

  • an SM-version guard (e.g. only if get_sm_version() >= 100)
  • an operator-availability check (hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell"))

Otherwise, fall back to fp8_block_scaling_gemm. This ensures you won’t attempt a Blackwell-only op on unsupported hardware or when the operator isn’t loaded.

Locations to update:
• tensorrt_llm/_torch/modules/linear.py (around the if module.use_cute_dsl_fp8_block_scale_gemm: block)

Example diff:

-        if module.use_cute_dsl_fp8_block_scale_gemm:
-            output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
-                act_input_fp8, module.weight, act_input_sf, module.weight_scale)
-        else:
-            output = torch.ops.trtllm.fp8_block_scaling_gemm(
-                act_input_fp8, module.weight, act_input_sf, module.weight_scale)
+        is_blackwell = get_sm_version() >= 100
+        has_cute_op = hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell")
+        if module.use_cute_dsl_fp8_block_scale_gemm and is_blackwell and has_cute_op:
+            output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
+                act_input_fp8, module.weight, act_input_sf, module.weight_scale)
+        else:
+            output = torch.ops.trtllm.fp8_block_scaling_gemm(
+                act_input_fp8, module.weight, act_input_sf, module.weight_scale)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/linear.py around lines 575 to 580, the call to
the Blackwell-specific operator cute_dsl_fp8_gemm_blackwell lacks runtime
checks. Add a guard to call this operator only if get_sm_version() is at least
100 and if torch.ops.trtllm has the attribute cute_dsl_fp8_gemm_blackwell. If
either condition is not met, fall back to calling fp8_block_scaling_gemm. This
prevents attempting to run the Blackwell-only operator on unsupported hardware
or when the operator is unavailable.

Signed-off-by: Mindy Li <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1207-1207: Use a more descriptive variable name instead of 'l'

Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name like 'batch_size'.

🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6)

12-17: Remove commented-out code

These commented imports are redundant as they've been replaced with proper absolute imports below. Remove them to improve code clarity.

-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)

23-25: Remove commented-out import

This commented import for BlockwiseGemmKernel should be removed as it's been replaced with the actual import below.

-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel

1257-1286: Remove large block of commented-out code

This extensive commented code block appears to be an earlier implementation. It should be removed to improve code readability and maintainability.

-        # before opt
-        """
-        a, b, a_sf, b_sf = inputs
-        m, n, k = a.shape[0], b.shape[0], a.shape[1]
-        w_n, w_k = b_sf.shape[0], b_sf.shape[1]
-        c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda")
-
-        # torch_tensor -> cute.tensor
-        a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8)
-        b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8)
-        c_tmp = c.as_strided((m, n, 1), (n, 1, m * n))
-        weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k))
-        # [xx, m]
-        input_scale_tmp = a_sf.permute(1, 0).as_strided(
-            (m, w_k, 1), (1, m, w_k * m)
-        )
-
-        mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mA.element_type = cutlass.Float8E4M3FN
-        mB.element_type = cutlass.Float8E4M3FN
-
-        # TODO: mSFA is column major
-        mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=0
-        )
-        mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=1
-        )
-        """

1372-1404: Remove commented autotuner code or add TODO

The commented autotuner configuration suggests this was intended to be tunable. Either remove this code if not needed, or add a TODO comment explaining when autotuning will be implemented.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 0, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#     },
-# )
def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
-    # tuner = AutoTuner.get()
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-    #     [cute_dsl_fp8_gemm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale],
-    # )
-    # return cute_dsl_fp8_gemm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Implement autotuning when performance profiling is complete
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()

1503-1507: Remove commented validation and alternative implementation code

These commented sections should be removed to improve code clarity.

-        # if c.dtype != torch.bfloat16:
-        #     assert False, "c.dtype != bf16"
-        # if c.shape != (l, m, n):
-        #     assert False, "c.shape != (l, m, n)"
-
-        # """
         # torch_tensor -> cute.tensor
         a_tmp = a.permute(1, 2, 0).view(torch.uint8)
         ...
-        # """
-        # a_tmp = a.view(torch.uint8)
-        # b_tmp = b.view(torch.uint8)
-        # mA = from_dlpack(
-        #     a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mB = from_dlpack(
-        #     b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mC = from_dlpack(
-        #     c, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mA.element_type = cutlass.Float8E4M3FN
-        # mB.element_type = cutlass.Float8E4M3FN
-
-        # # Note: mSFA is column major
-        # mSFA = from_dlpack(
-        #     a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mSFB = from_dlpack(
-        #     b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)

Also applies to: 1543-1560


1619-1636: Remove commented autotuner configuration

Similar to the GEMM operator, remove the commented autotuner code or add a TODO if this is planned for future implementation.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 1, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     # constraint_specs=(
-#     #     ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell),
-#     # ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#         # 'cluster_shape_mn': [(1, 1)],
-#     },
-# )
def cute_dsl_fp8_bmm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
    out: torch.Tensor,
) -> None:
-    # tuner = AutoTuner.get()
-
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
-
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-    #     [cute_dsl_fp8_bmm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale, out],
-    # )
-    # cute_dsl_fp8_bmm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale, out],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Implement autotuning when performance profiling is complete
    cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()

Also applies to: 1644-1659

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cb6c2c and 5606444.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1207-1207: Ambiguous variable name: l

(E741)


1448-1448: Ambiguous variable name: l

(E741)


1501-1501: Ambiguous variable name: l

(E741)


1501-1501: Local variable n is assigned to but never used

Remove assignment to unused variable n

(F841)


1501-1501: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)


1502-1502: Local variable w_n is assigned to but never used

Remove assignment to unused variable w_n

(F841)


1678-1678: Ambiguous variable name: l

(E741)


1678-1678: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

Comment on lines 1 to 10
from functools import lru_cache
from typing import List, Optional, Tuple

import cutlass
import cutlass.cute as cute
import nvtx
import torch
from cuda import cuda
from cutlass.cute.runtime import from_dlpack

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Add NVIDIA copyright header

According to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header that includes the current year.


🏁 Script executed:

#!/bin/bash
# Check if other Python files in the repository have copyright headers
head -n 20 tensorrt_llm/_torch/custom_ops/*.py | grep -E "(Copyright|NVIDIA)" | head -10

Length of output: 366


Add missing NVIDIA copyright header
Per the coding guidelines, every source file must include an NVIDIA copyright header with the current year.

• File needing update:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py — insert header at the very top

Suggested header snippet (adjust to match your repo’s exact template):

+ # Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
from functools import lru_cache
...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from functools import lru_cache
from typing import List, Optional, Tuple
import cutlass
import cutlass.cute as cute
import nvtx
import torch
from cuda import cuda
from cutlass.cute.runtime import from_dlpack
# Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from functools import lru_cache
from typing import List, Optional, Tuple
import cutlass
import cutlass.cute as cute
import nvtx
import torch
from cuda import cuda
from cutlass.cute.runtime import from_dlpack
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at the very top before any
imports, add the NVIDIA copyright header with the current year as required by
coding guidelines. Insert the standard header snippet used in your repo to
ensure compliance.

Comment on lines 1448 to 1450
l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
# [b, n, k]
n = inputs[1].shape[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use descriptive variable name instead of 'l'

The variable 'l' is ambiguous. Use 'batch_size' for clarity.

-        l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+        batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
         n = inputs[1].shape[1]

Also update the reference at line 1467:

-            l,
+            batch_size,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
# [b, n, k]
n = inputs[1].shape[1]
# before (lines 1448–1450):
- l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+ batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
# [b, n, k]
n = inputs[1].shape[1]
Suggested change
l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
# [b, n, k]
n = inputs[1].shape[1]
# before (around line 1467):
- l,
+ batch_size,
🧰 Tools
🪛 Ruff (0.12.2)

1448-1448: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1448 to 1450,
rename the variable 'l' to 'batch_size' to improve code clarity and
descriptiveness. Also, update all references to 'l' accordingly, including the
one at line 1467, to maintain consistency and avoid errors.

Comment on lines 1678 to 1683
l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
n = mat_b.shape[1]
if out.dtype != torch.bfloat16:
assert False, "out.dtype != bf16"
if out.shape != (l, m, n):
assert False, "out.shape != (l, m, n)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix variable naming and remove unused variable

Use descriptive variable names and remove the unused variable 'k'. Also, use proper exceptions instead of assert False.

-    l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
+    batch_size, m = mat_a.shape[0], mat_a.shape[1]
     n = mat_b.shape[1]
     if out.dtype != torch.bfloat16:
-        assert False, "out.dtype != bf16"
-    if out.shape != (l, m, n):
-        assert False, "out.shape != (l, m, n)"
+        raise ValueError("out.dtype must be torch.bfloat16")
+    if out.shape != (batch_size, m, n):
+        raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
n = mat_b.shape[1]
if out.dtype != torch.bfloat16:
assert False, "out.dtype != bf16"
if out.shape != (l, m, n):
assert False, "out.shape != (l, m, n)"
batch_size, m = mat_a.shape[0], mat_a.shape[1]
n = mat_b.shape[1]
if out.dtype != torch.bfloat16:
raise ValueError("out.dtype must be torch.bfloat16")
if out.shape != (batch_size, m, n):
raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")
🧰 Tools
🪛 Ruff (0.12.2)

1678-1678: Ambiguous variable name: l

(E741)


1678-1678: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1678 to 1683,
rename variables to more descriptive names, remove the unused variable 'k', and
replace the assert False statements with proper exception raising such as
ValueError with clear error messages for dtype and shape mismatches.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)

1-10: Add missing NVIDIA copyright header

According to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header with the current year.

Add the copyright header at the very beginning of the file:

+# Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
from functools import lru_cache

998-998: Use descriptive variable name instead of 'l'

Single-letter variable 'l' is ambiguous and violates coding standards.

-        l = 1
+        batch_size = 1

Also update the reference at line 1016:

-            l,
+            batch_size,

1239-1241: Fix variable naming and remove unused variables

Use descriptive variable names and remove unused variables.

-        l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+        batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
         n = inputs[1].shape[1]

Also update references at lines 1255, 1258, 1259:

-            l,
+            batch_size,

1292-1293: Remove unused variables in forward method

Variables n, k, and w_n are assigned but never used.

-        l, m, n, k = a.shape[0], a.shape[1], b.shape[1], b.shape[2]
-        w_n, w_k = b_sf.shape[1], b_sf.shape[2]
+        l, m = a.shape[0], a.shape[1]
+        w_k = b_sf.shape[2]

1469-1474: Fix variable naming, remove unused variable, and use proper exceptions

Use descriptive variable names, remove unused variables, and replace assert statements with proper exceptions.

-    l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
+    batch_size, m = mat_a.shape[0], mat_a.shape[1]
     n = mat_b.shape[1]
     if out.dtype != torch.bfloat16:
-        assert False, "out.dtype != bf16"
-    if out.shape != (l, m, n):
-        assert False, "out.shape != (l, m, n)"
+        raise ValueError("out.dtype must be torch.bfloat16")
+    if out.shape != (batch_size, m, n):
+        raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)

12-26: Remove commented imports and document the release kernel usage

The commented imports create unnecessary clutter. If switching between debug and release kernels is needed, consider using a configuration flag instead.

-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)
from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec,
                                           DynamicTensorSpec,
                                           OptimizationProfile, TunableRunner,
                                           TuningConfig)
-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel
+# Using release version of BlockwiseGemmKernel for production
from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \
    BlockwiseGemmKernel

944-946: Add type hints to utility function

Add type hints for better code clarity and IDE support.

-def pad_up(x: int, y: int) -> int:
-    return ((x + y - 1) // y) * y
+def pad_up(x: int, y: int) -> int:
+    """Round up x to the nearest multiple of y."""
+    return ((x + y - 1) // y) * y

969-978: Clean up commented code in append_ones_wrapper

Remove the commented line and improve documentation.

@cute.jit
def append_ones_wrapper(a: cute.Tensor):
-    # return cute.append_ones(tensor)
+    """Append ones to tensor layout up to rank 3."""
    a_layout = a.layout
    a_layout = cute.append(a_layout,
                           cute.make_layout(1, stride=1),
                           up_to_rank=3)
    new_a = cute.make_tensor(a.iterator, a_layout)
    return new_a

1047-1078: Remove large block of commented code

The "before opt" commented code should be removed. If needed for reference, it should be in version control history or documentation.

-        # before opt
-        """
-        a, b, a_sf, b_sf = inputs
-        m, n, k = a.shape[0], b.shape[0], a.shape[1]
-        w_n, w_k = b_sf.shape[0], b_sf.shape[1]
-        c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda")
-
-        # torch_tensor -> cute.tensor
-        a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8)
-        b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8)
-        c_tmp = c.as_strided((m, n, 1), (n, 1, m * n))
-        weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k))
-        # [xx, m]
-        input_scale_tmp = a_sf.permute(1, 0).as_strided(
-            (m, w_k, 1), (1, m, w_k * m)
-        )
-
-        mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mA.element_type = cutlass.Float8E4M3FN
-        mB.element_type = cutlass.Float8E4M3FN
-
-        # TODO: mSFA is column major
-        mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=0
-        )
-        mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=1
-        )
-        """
-
        a, b, a_sf, b_sf = inputs

1163-1195: Remove commented autotuner code and document why it's disabled

Large blocks of commented code should be removed. If autotuning was intentionally disabled, please document the reason.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 0, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#     },
-# )
def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
-    # tuner = AutoTuner.get()
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-    #     [cute_dsl_fp8_gemm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale],
-    # )
-    # return cute_dsl_fp8_gemm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Autotuning is temporarily disabled. Using fixed configuration.
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()

Was autotuning disabled for performance reasons or is this temporary? Consider adding a flag to enable/disable autotuning rather than commenting out the code.


1335-1350: Remove commented alternative implementation

Remove the large block of commented code to improve readability.

-        # """
-        # a_tmp = a.view(torch.uint8)
-        # b_tmp = b.view(torch.uint8)
-        # mA = from_dlpack(
-        #     a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mB = from_dlpack(
-        #     b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mC = from_dlpack(
-        #     c, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mA.element_type = cutlass.Float8E4M3FN
-        # mB.element_type = cutlass.Float8E4M3FN
-
-        # # Note: mSFA is column major
-        # mSFA = from_dlpack(
-        #     a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mSFB = from_dlpack(
-        #     b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-
         # get stream
         stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)

1410-1449: Clean up commented autotuner configuration

Remove the large block of commented autotuner code for better readability.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 1, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     # constraint_specs=(
-#     #     ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell),
-#     # ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#         # 'cluster_shape_mn': [(1, 1)],
-#     },
-# )
def cute_dsl_fp8_bmm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
    out: torch.Tensor,
) -> None:
-    # tuner = AutoTuner.get()
-
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
-
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-    #     [cute_dsl_fp8_bmm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale, out],
-    # )
-    # cute_dsl_fp8_bmm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale, out],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Autotuning is temporarily disabled. Using fixed configuration.
    cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5606444 and 733d253.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
  • tensorrt_llm/_torch/modules/linear.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (3)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

998-998: Ambiguous variable name: l

(E741)


1239-1239: Ambiguous variable name: l

(E741)


1292-1292: Ambiguous variable name: l

(E741)


1292-1292: Local variable n is assigned to but never used

Remove assignment to unused variable n

(F841)


1292-1292: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)


1293-1293: Local variable w_n is assigned to but never used

Remove assignment to unused variable w_n

(F841)


1469-1469: Ambiguous variable name: l

(E741)


1469-1469: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
@limin2021 limin2021 requested review from a team as code owners August 5, 2025 02:29
@limin2021 limin2021 requested a review from Superjomn August 5, 2025 02:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/attention.py (2)

223-224: Clean up commented code

Remove commented code fragments to improve code readability.

            allreduce_strategy=config.allreduce_strategy,
-            force_dynamic_quantization=config.force_dynamic_quantization,
-        )
-        # use_cute_dsl_fp8_block_scale_gemm=config.use_cute_dsl_fp8_block_scale_gemm)
+            force_dynamic_quantization=config.force_dynamic_quantization)

895-896: Remove TODO comment

Remove debug TODO comment.

-            # TODO:
-            # print(f"limin: attention.py, get_sm_version(): {get_sm_version()}, self.use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}")
tests/unittest/_torch/modules/test_fused_moe.py (1)

678-678: Clean up commented code or restore validation logic.

The commented debug prints should be removed, but consider restoring the validation comparisons between different implementations to ensure correctness.

-        # output_origin = fused_moe_origin.forward(x, router_logits)
-    # print("limin: output = ", output)
-    # print("limin: ref_output = ", ref_output)
-    # torch.testing.assert_close(output_origin, output, rtol=1e-2, atol=0.1)
-    # torch.testing.assert_close(output_origin, ref_output, rtol=1e-2, atol=0.1)

Also applies to: 683-686

tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)

1130-1130: Remove debug print statement.

This debug print statement should be removed from production code. It also exceeds the 120-character line length limit.

-        # print(f"limin: DeepseekV3ForCausalLM, model_config.use_cute_dsl_fp8_block_scale_bmm: {model_config.use_cute_dsl_fp8_block_scale_bmm}, model_config.use_cute_dsl_fp8_block_scale_gemm: {model_config.use_cute_dsl_fp8_block_scale_gemm}")

1359-1362: Replace debug prints with proper logging.

Debug print statements should be removed or replaced with proper logging using the logger object that's already imported.

-        moe_backend = self.model_config.moe_backend.upper()
-        print(
-            f"limin: DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}"
-        )
+        moe_backend = self.model_config.moe_backend.upper()
+        logger.debug(f"DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}")

1429-1448: Clean up commented dequantization code.

This large block of commented weight dequantization code should be either:

  1. Removed if no longer needed with the new FP8 implementation
  2. Properly integrated if still required for certain configurations

Leaving extensive commented code reduces maintainability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 899b1bf and 3c8cca4.

📒 Files selected for processing (14)
  • examples/llm-api/quickstart_advanced.py (2 hunks)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (11 hunks)
  • tensorrt_llm/_torch/modules/attention.py (13 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (4 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (3 hunks)
  • tensorrt_llm/_torch/modules/gated_mlp.py (3 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/llmapi/llm.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (3 hunks)
  • tests/unittest/_torch/modules/test_fused_moe.py (6 hunks)
✅ Files skipped from review due to trivial changes (5)
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_args.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/modules/linear.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/modules/gated_mlp.py
  • tensorrt_llm/llmapi/llm.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/modules/gated_mlp.py
  • tensorrt_llm/llmapi/llm.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/modules/test_fused_moe.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (2)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/llmapi/llm.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

  • tensorrt_llm/llmapi/llm.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/attention.py

204-204: Line too long (191 > 120)

(E501)


671-671: Line too long (214 > 120)

(E501)


896-896: Line too long (161 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1130-1130: Line too long (242 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (26)
tensorrt_llm/llmapi/llm.py (1)

137-144: LGTM! Code formatting improvement with future feature preparation.

The reformatting from inline to multi-line list improves readability. The commented code indicates planned support for FP8 block scale GEMM and BMM flags, which aligns with the PR's objective of adding CuTe DSL FP8 operators.

tensorrt_llm/_torch/modules/gated_mlp.py (4)

40-41: LGTM! Proper parameter addition with backward compatibility.

The new use_cute_dsl_fp8_block_scale_gemm parameter is well-structured with appropriate type annotation and safe default value.


47-47: LGTM! Proper instance attribute assignment.

Correctly stores the parameter as an instance attribute following established patterns.


79-81: LGTM! Correct parameter propagation to gate_up_proj.

The FP8 block scale GEMM flag is properly passed to the Linear module constructor.


99-100: LGTM! Consistent parameter propagation to down_proj.

The FP8 block scale GEMM flag is consistently passed to both Linear modules in the GatedMLP.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (4)

12-13: LGTM! Necessary import additions for quantization method selection.

The new imports are required for the _get_quant_method implementation and are properly organized.


36-37: LGTM! Useful debug instrumentation for development.

The commented debug print statements provide valuable tensor shape information for debugging without affecting production performance.


145-155: LGTM! Robust quantization method selection with proper error handling.

The _get_quant_method correctly handles different quantization scenarios:

  • FP8 block scales: returns appropriate CuTe DSL method
  • Unsupported modes: raises informative ValueError
  • No quantization: returns unquantized method

The error handling and method selection logic are well-implemented.


227-230: LGTM! Additional debug instrumentation for FP8 quantization pipeline.

The debug print statements provide valuable tensor shape information for debugging the FP8 quantization forward pass without affecting production code.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

988-988: Hardware targeting correctly updated to Blackwell.

The change from @skip_no_hopper to @skip_pre_blackwell correctly reflects that the new CuTe DSL FP8 block scale kernels require Blackwell architecture, consistent with the PR's introduction of Blackwell-optimized operators.


1016-1017: CUDA graph configuration modernization is correct.

The transition from the boolean use_cuda_graph flag to the structured cuda_graph_config object aligns with the modern API pattern used throughout the codebase, providing better configuration flexibility.


1147-1147: Hardware targeting consistently updated across test variants.

The decorator change matches the single-GPU test variant, ensuring consistent Blackwell architecture targeting across both test methods.


1183-1184: CUDA graph configuration consistently updated across test variants.

The configuration change matches the single-GPU test variant, maintaining API consistency across both test methods.


1020-1021: Do not enable CuTe DSL block‐scale flags in the integration test
The Pydantic MoeConfig used by the LLM API does not include use_cute_dsl_fp8_block_scale_bmm or …_gemm (they’re still commented out in tensorrt_llm/llmapi/llm_args.py), and the API’s valid_keys check only allows its defined fields (e.g. backend, load_balancer). As a result, uncommenting those lines in the test would break argument validation rather than turn on block‐scale operators. Leave them commented until first‐class support is added to the LLM API.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

557-601: Well-structured refactoring for SM 100 support

The refactoring cleanly separates the base functionality from SM 100-specific weight resmoothing and scale transformation logic. This design allows the Cute DSL backend to use the simpler base class while maintaining backward compatibility.

tensorrt_llm/_torch/modules/attention.py (1)

1408-1428: Correct implementation of CuTe DSL FP8 BMM path

The conditional logic properly selects between the new cute_dsl_fp8_bmm_blackwell operator and the legacy fp8_block_scaling_bmm_out based on the configuration flag.

tests/unittest/_torch/modules/test_fused_moe.py (6)

562-563: LGTM! Architecture requirement correctly updated.

The change from @skip_non_hopper_unittest to @skip_pre_blackwell correctly reflects that the new CuTe DSL FP8 operations target Blackwell architecture.


569-570: Verify the restriction of sequence length parameterization.

The test now only runs with seq_len=[256] instead of the full range. This significantly reduces test coverage. Please verify if this restriction is:

  1. Temporary for initial development
  2. Due to known issues with other sequence lengths
  3. Performance optimization

Consider re-enabling the full range once the implementation is stable.


575-580: LGTM! Function name appropriately updated.

The function name change to test_fused_moe_fp8_blockwise_cute_dsl correctly reflects the new CuTe DSL functionality being tested.


671-671: LGTM! Reference model correctly configured.

Adding use_cute_dsl_fp8_block_scale_gemm=True properly configures the reference model to use the new CuTe DSL implementation for consistent testing.


696-696: LGTM! Multi-GPU test correctly updated.

The function name and call are properly updated to use the new CuTe DSL test function, maintaining consistency between single-GPU and multi-GPU tests.

Also applies to: 700-700


974-975: LGTM! Parameter propagation correctly implemented.

The use_cute_dsl_fp8_block_scale_gemm parameter is properly added to the constructor, stored as an instance variable, and correctly passed to the GatedMLP instances, ensuring consistent configuration throughout the reference model.

Also applies to: 985-985, 994-995

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

193-194: LGTM! Parameter correctly added to DeepseekV3Linear.

The use_cute_dsl_fp8_block_scale_gemm parameter is properly added to the constructor and correctly passed to the parent Linear class, maintaining the expected interface.

Also applies to: 209-210


1363-1364: LGTM! Conditional logic correctly implemented.

The conditional check for moe_backend == "DEEPGEMM" with FP8 block scales and SM version 100 correctly targets the Blackwell architecture scenarios where the new functionality is needed.


1273-1274: Verify removal of contiguous() calls after transpose.

The contiguous() calls have been removed after transpose operations. Please verify that downstream code doesn't require contiguous tensors, as this could cause performance issues or errors.

If the contiguous memory layout isn't required, the change is fine. Otherwise, the calls should be restored.

Also applies to: 1315-1316


1496-1497: LGTM! Consistent conditional logic implementation.

The conditional check for DEEPGEMM backend with FP8 block scales and SM version 100 is consistent with the earlier implementation and correctly targets the appropriate hardware/software configuration.

Comment on lines 481 to 483
reduce_output=False,
)
# use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Complete the parameter implementation in Deepseekv3MoE.

The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. This should be completed for consistency.

         reduce_output=False,
-        )
-        # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)
+        use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
reduce_output=False,
)
# use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)
reduce_output=False,
use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm,
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 481 to 483,
the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP
constructor call. Uncomment this parameter and pass
model_config.use_cute_dsl_fp8_block_scale_gemm to ensure the parameter is
properly implemented and consistent with the model configuration.

Comment on lines 670 to 679
self.mlp = GatedMLP(
hidden_size=config.hidden_size,
intermediate_size=config.intermediate_size,
bias=False,
dtype=config.torch_dtype,
config=model_config,
overridden_tp_size=self.mlp_tp_size,
reduce_output=True,
)
# use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Complete the parameter implementation in DeepseekV3DecoderLayer.

The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. For consistency across the codebase, this should be properly implemented.

         reduce_output=True,
-        )
-        # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)
+        use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.mlp = GatedMLP(
hidden_size=config.hidden_size,
intermediate_size=config.intermediate_size,
bias=False,
dtype=config.torch_dtype,
config=model_config,
overridden_tp_size=self.mlp_tp_size,
reduce_output=True,
)
# use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)
self.mlp = GatedMLP(
hidden_size=config.hidden_size,
intermediate_size=config.intermediate_size,
bias=False,
dtype=config.torch_dtype,
config=model_config,
overridden_tp_size=self.mlp_tp_size,
reduce_output=True,
use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm,
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 670 to 679,
the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP
constructor. To fix this, uncomment the parameter and pass
self.model_config.use_cute_dsl_fp8_block_scale_gemm to the GatedMLP
initialization to ensure consistent parameter usage across the codebase.

Comment on lines 202 to 209
# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
# print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")

# limin-todo: WARNING:
self.use_cute_dsl_fp8_block_scale_bmm = True
self.use_cute_dsl_fp8_block_scale_gemm = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace hardcoded flags with proper configuration

The hardcoded flags bypass configuration and force all instances to use the CuTe DSL implementation. This should be configurable through the model config.

-        # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
-        # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
-        # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")
-
-        # limin-todo: WARNING:
-        self.use_cute_dsl_fp8_block_scale_bmm = True
-        self.use_cute_dsl_fp8_block_scale_gemm = True
+        self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
+        self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
# print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")
# limin-todo: WARNING:
self.use_cute_dsl_fp8_block_scale_bmm = True
self.use_cute_dsl_fp8_block_scale_gemm = True
self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)
🧰 Tools
🪛 Ruff (0.12.2)

204-204: Line too long (191 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/attention.py around lines 202 to 209, the flags
use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are
hardcoded to True, bypassing the model configuration. Modify the code to assign
these flags based on the corresponding values from the config object instead of
hardcoding them, ensuring the behavior is configurable per model instance.

Comment on lines 671 to 679
# print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")

# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm

## limin-TODO: WARNING:
self.use_cute_dsl_fp8_block_scale_bmm = True
self.use_cute_dsl_fp8_block_scale_gemm = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace hardcoded flags with proper configuration in MLA class

The hardcoded flags bypass configuration. This should be configurable through the model config.

-        # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")
-
-        # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
-        # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
-
-        ## limin-TODO: WARNING:
-        self.use_cute_dsl_fp8_block_scale_bmm = True
-        self.use_cute_dsl_fp8_block_scale_gemm = True
+        self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
+        self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")
# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
## limin-TODO: WARNING:
self.use_cute_dsl_fp8_block_scale_bmm = True
self.use_cute_dsl_fp8_block_scale_gemm = True
self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)
🧰 Tools
🪛 Ruff (0.12.2)

671-671: Line too long (214 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/attention.py around lines 671 to 679, the flags
use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are
hardcoded to True, bypassing the model configuration. To fix this, remove the
hardcoded assignments and instead assign these flags from the corresponding
attributes in the config object, ensuring they are properly configurable through
the model config.

Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
@limin2021 limin2021 changed the title add cute dsl deepgemm fp8 blowise op [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm_group_mm ops Aug 5, 2025
Signed-off-by: Mindy Li <[email protected]>
@limin2021 limin2021 changed the title [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm_group_mm ops [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm/group_mm ops Aug 5, 2025
@limin2021 limin2021 changed the title [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm/group_mm ops [TRTLLM-6898][feat] add cute_dsl_fp8 gemm, bmm and group_mm ops Aug 5, 2025
@limin2021 limin2021 changed the title [TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group_mm ops [TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025
@limin2021 limin2021 changed the title [TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group gemm ops [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025
@limin2021 limin2021 changed the title [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025
Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some suggestions on how to apply autotuner on the cute_dsl_fp8_bmm_blackwell. You can use tactics to pack all the valid tunable configs.

Comment on lines 1022 to 1040
is_valid = BlockwiseGemmKernel.can_implement(
cutlass.Float8E4M3FN, # ab_dtype,
cutlass.Float32, # acc_dtype,
cutlass.BFloat16, # c_dtype,
use_2cta_instrs,
mma_tiler_mn,
cluster_shape_mn,
m,
n,
k,
l,
a_major,
b_major,
c_major,
)
if is_valid:
return [0]
else:
return []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_valid = BlockwiseGemmKernel.can_implement(
cutlass.Float8E4M3FN, # ab_dtype,
cutlass.Float32, # acc_dtype,
cutlass.BFloat16, # c_dtype,
use_2cta_instrs,
mma_tiler_mn,
cluster_shape_mn,
m,
n,
k,
l,
a_major,
b_major,
c_major,
)
if is_valid:
return [0]
else:
return []
# This will return all the valid config combinations for tuning
use_2cta_instrs_candi = [False]
mma_tiler_mn_candi = [(128, 128)]
cluster_shape_mn_candi = [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)]
return [(use_2cta_instrs, mma_tiler_mn, cluster_shape_mn)
for use_2cta_instrs in [False]
for mma_tiler_mn in [(128, 128)]
for cluster_shape_mn in [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)]
if BlockwiseGemmKernel.can_implement(
cutlass.Float8E4M3FN, # ab_dtype,
cutlass.Float32, # acc_dtype,
cutlass.BFloat16, # c_dtype,
use_2cta_instrs,
mma_tiler_mn,
cluster_shape_mn,
m,
n,
k,
l,
a_major,
b_major,
c_major,
)
]

Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
@limin2021 limin2021 requested a review from a team as a code owner August 14, 2025 06:02
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM on the llmapi changes.

Signed-off-by: Mindy Li <[email protected]>
max_active_clusters: cutlass.Constexpr,
stream: cuda.CUstream,
):
a = append_ones_wrapper(a)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this wrapper given it's already inside "jit" function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is used to permute the tensors to the layout our kernel expect.

Copy link
Collaborator Author

@limin2021 limin2021 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not inside "jit" function. So I add a jit wrapper for them.

Copy link

@fengxie fengxie Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular use I'm commemting is inside "jit" function I believe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood what you meant . You mean we could move the @cute.jit in the following code?

@cute.jit
def append_ones_wrapper(a: cute.Tensor):
    a_layout = a.layout
    a_layout = cute.append(a_layout,
                           cute.make_layout(1, stride=1),
                           up_to_rank=3)
    new_a = cute.make_tensor(a.iterator, a_layout)
    return new_a

Copy link
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!. Most changes associated with the tuning part are good for me.
The tunable runner CuteDSLFp8BlackwellGroupGemm still requires the same modifications to be compatible with the autotuner.
And it will be better to add some comments in get_valid_tactics and forward of each runner for detailed explanations.

@limin2021 limin2021 changed the title [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops on Blackwell arch Aug 15, 2025
@limin2021 limin2021 changed the title [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops on Blackwell arch [TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture Aug 15, 2025
@limin2021
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15365 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15365 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11586 completed with status: 'FAILURE'

@chzblych chzblych removed their request for review September 2, 2025 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants