[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture #6514

limin2021 · 2025-07-31T10:35:25Z

[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell arch

New Features
- Add CuTe DSL Fp8 block scaling ops: gemm, batched gemm, group gemm, for Blackwell architectures.
- Provided new options in model constructors and configuration to control the use of advanced FP8 block scaling ops. By default, it is not used.
Tests
- Added comprehensive unit and integration tests to validate the new FP8 GEMM and BMM operators, including multi-GPU and Blackwell-specific scenarios.
- Updated and expanded fused MoE test coverage to support both legacy and new CuTe DSL kernels.
- Add model accuracy test.

pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py -k "test_cute_dsl_fp8_block_scale_gemm"

pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py -k "test_cute_dsl_fp8_block_scale_bmm"

pytest -s -o log_cli=true tests/unittest/_torch/thop/test_fp8_block_scale_group_gemm.py -k "test_cute_dsl_fp8_block_scale_group_gemm"

# module test
pytest -s -o log_cli=true tests/unittest/_torch/modules/test_fused_moe.py -k "test_fused_moe_fp8_blockwise_cute_dsl"

pytest -s -o log_cli=true tests/unittest/_torch/modules/test_fused_moe.py -k "test_fused_moe_fp8_blockwise_cute_dsl_multi_gpu"

# model UT
python3 examples/llm-api/quickstart_advanced.py --model_dir /home/scratch.trt_llm_data/llm-models/DeepSeek-V3-Lite/fp8 --tp_size 1 --moe_tp_size 1 --moe_ep_size 1 --moe_backend CUTEDSL --use_cute_dsl_blockscaling_mm --use_cute_dsl_blockscaling_bmm

# model UT
pytest -s -o log_cli=true "tests/integration/defs/accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_cute_dsl_fp8_block_scales"

Todo
- Update the autotune of group gemm after yukun's supporting MR is merged (to make autotune module support creating tensor with data dependency)
- Verify the performance on Deepseek model to change some CuTe DSL op's impl to be default impl if they have better perf.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Mindy Li <[email protected]>

coderabbitai · 2025-07-31T10:35:31Z

📝 Walkthrough

Walkthrough

This change introduces new FP8 GEMM and batched GEMM operators using the CuTe DSL and CUTLASS libraries for NVIDIA Blackwell architecture, with corresponding integration in model, module, and test code. It adds environment-variable-driven flags to enable these kernels, updates weight loading and quantization logic, and extends test coverage for both single-GPU and multi-GPU scenarios.

Changes

Cohort / File(s)	Change Summary
Custom FP8 GEMM Operators (CuTe/CUTLASS/Blackwell) `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Adds `CuteDSLFp8BlackwellLinear` and `CuteDSLFp8BlackwellBmm` classes for FP8 GEMM/BMM using CuTe DSL and CUTLASS; registers new custom CUDA operators and supporting JIT functions.
Linear and GatedMLP Module Integration `tensorrt_llm/_torch/modules/linear.py`, `tensorrt_llm/_torch/modules/gated_mlp.py`	Adds `use_cute_dsl_blockscaling_mm` flag to constructors and propagates to submodules; conditionally uses new GEMM operator in `Linear`; updates print/debug output.
Attention and MLA Module Integration `tensorrt_llm/_torch/modules/attention.py`	Adds `use_cute_dsl_blockscaling_mm` and `use_cute_dsl_blockscaling_bmm` flags from environment; conditionally uses new BMM operator in forward pass and weight creation.
DeepSeekV3 Model Integration `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Adds environment flag-driven kernel selection to all relevant classes; updates weight loading logic for Blackwell/DeepGEMM; propagates flags to submodules; adjusts debug output.
Fused MoE Integration and Quantization `tensorrt_llm/_torch/modules/fused_moe/quantization.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`	Adds a new quantization method class for DeepGEMM; updates method selection logic; adds debug prints; no logic changes in Cute DSL MoE.
Unit and Integration Test Additions/Updates `tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py`, `tests/unittest/_torch/modules/test_fused_moe.py`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Adds/updates tests for new FP8 GEMM/BMM kernels (including multi-GPU); parameterizes tests over new flags; updates decorators for Blackwell architecture; extends reference models to support new kernel.
Miscellaneous Formatting/Import Updates `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/evaluate/lm_eval.py`	Minor formatting and import refactoring; no logic changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Model
    participant Linear
    participant CustomOp (CuteDSLFp8BlackwellLinear)
    participant CUDA Kernel

    User->>Model: Forward(input)
    Model->>Linear: Forward(input)
    Linear->>CustomOp: cute_dsl_fp8_gemm_blackwell(input, weight, ...)
    CustomOp->>CUDA Kernel: Launch FP8 GEMM kernel
    CUDA Kernel-->>CustomOp: Output tensor
    CustomOp-->>Linear: Output tensor
    Linear-->>Model: Output
    Model-->>User: Output

sequenceDiagram
    participant User
    participant MLA
    participant CustomOp (CuteDSLFp8BlackwellBmm)
    participant CUDA Kernel

    User->>MLA: forward_generation(...)
    MLA->>CustomOp: cute_dsl_fp8_bmm_blackwell(input, weight, ...)
    CustomOp->>CUDA Kernel: Launch FP8 BMM kernel
    CUDA Kernel-->>CustomOp: Output tensor
    CustomOp-->>MLA: Output tensor
    MLA-->>User: Output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Community want to contribute

Suggested reviewers

FrankD412
syuoni
hlu1
nv-guomingz

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7924f93 and 88e9e2e.

📒 Files selected for processing (3)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (9 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/modules/test_fused_moe.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (4)
11-29: Remove commented imports to improve code clarity.

The file contains both commented and active imports for the same modules. The active imports using absolute paths are correct. Please remove the commented relative import statements to avoid confusion.
-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)
 from tensorrt_llm._torch.attention_backend.interface import AttentionInputType
 from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec,
                                            DynamicTensorSpec,
                                            OptimizationProfile, TunableRunner,
                                            TuningConfig)
-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel
 from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \
     BlockwiseGemmKernel
1173-1182: Remove commented code in append_ones_wrapper function.

The function contains a commented line that duplicates the functionality. Clean up the code by removing it.
 @cute.jit
 def append_ones_wrapper(a: cute.Tensor):
-    # return cute.append_ones(tensor)
     a_layout = a.layout
     a_layout = cute.append(a_layout,
                            cute.make_layout(1, stride=1),
                            up_to_rank=3)
     new_a = cute.make_tensor(a.iterator, a_layout)
     return new_a
1254-1285: Remove commented "before opt" code section.

The extensive commented code block labeled "before opt" should be removed as it clutters the codebase and the optimized version is already implemented.

Remove lines 1254-1285 entirely to improve code readability.

1372-1406: Consider implementing autotuning or documenting why it's disabled.

The function contains extensive commented code for autotuning support. Either:

Implement the autotuning functionality if it's ready

Remove the commented code and add a TODO comment explaining the plan

Document why fixed parameters are currently preferred

If autotuning is not ready, replace the commented code with:
def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
    # TODO: Implement autotuning support for optimal tactic selection
    # Currently using fixed parameters for initial implementation
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
    return cute_dsl_fp8_gemm_blackwell_runner(
        inputs=[input, weight, input_scale, weight_scale],
        tactic=0,
        use_2cta_instrs=False,
        mma_tiler_mn=(128, 128),
        cluster_shape_mn=(1, 1),
    )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d38c26b and 8cb6c2c.

📒 Files selected for processing (4)

examples/llm-api/quickstart_advanced.py (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/linear.py (3 hunks)
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1203-1203: Ambiguous variable name: l

(E741)

🔇 Additional comments (4)

examples/llm-api/quickstart_advanced.py (1)

50-54: LGTM!

The addition of 'CUTEDSL' to the MoE backend choices properly exposes the new CuTe DSL FP8 GEMM backend option to users.

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)

60-70: LGTM!

The test addition properly validates the new CuTe DSL FP8 GEMM kernel on Blackwell architecture with appropriate tolerances for FP8 precision.

tensorrt_llm/_torch/modules/linear.py (2)

1451-1451: LGTM! Well-structured parameter addition.

The new boolean parameter is properly positioned, has a descriptive name, and maintains backward compatibility with the default value.

1498-1498: LGTM! Proper instance attribute assignment.

The parameter is correctly stored as an instance attribute following the established pattern.

coderabbitai · 2025-07-31T10:38:18Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+        m = inputs[0].shape[0]
+        n = inputs[1].shape[0]
+        k = inputs[0].shape[1]
+        l = 1


⚠️ Potential issue

Use a more descriptive variable name instead of 'l'.

Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name.

- l = 1 + batch_size = 1

Also update the reference to this variable in line 1221:

- l, + batch_size,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

l = 1

batch_size = 1

batch_size,

🧰 Tools

🪛 Ruff (0.12.2)

1203-1203: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at line 1203, replace the single-letter variable name 'l' with a more descriptive name that clearly indicates its purpose. Also, update all references to this variable, including the one at line 1221, to use the new descriptive name to maintain consistency and clarity.

coderabbitai · 2025-07-31T10:38:18Z

tensorrt_llm/_torch/modules/linear.py

+        if module.use_cute_dsl_fp8_block_scale_gemm:
+            output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
+                act_input_fp8, module.weight, act_input_sf, module.weight_scale)
+        else:
+            output = torch.ops.trtllm.fp8_block_scaling_gemm(
+                act_input_fp8, module.weight, act_input_sf, module.weight_scale)


💡 Verification agent

🧩 Analysis chain

Consider adding runtime validation for the new operator.

The conditional logic is implemented correctly, but consider adding validation to ensure the new operator is available and compatible with the current hardware architecture, especially since the operator name suggests Blackwell-specific functionality.

🏁 Script executed:

#!/bin/bash # Search for any existing hardware/architecture validation patterns in the codebase rg -A 5 -B 5 "blackwell|Blackwell" --type py

Length of output: 221000

🏁 Script executed:

#!/bin/bash # Look for operator availability checks or similar validation patterns rg -A 3 -B 3 "torch\.ops\.trtllm\." --type py | rg -A 3 -B 3 "(available|exists|check)"

Length of output: 13154

Add runtime checks for Blackwell-specific operator

Wrap the call to cute_dsl_fp8_gemm_blackwell in tensorrt_llm/_torch/modules/linear.py (around lines 575–580) with:

an SM-version guard (e.g. only if get_sm_version() >= 100)

an operator-availability check (hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell"))

Otherwise, fall back to fp8_block_scaling_gemm. This ensures you won’t attempt a Blackwell-only op on unsupported hardware or when the operator isn’t loaded.

Locations to update:
• tensorrt_llm/_torch/modules/linear.py (around the if module.use_cute_dsl_fp8_block_scale_gemm: block)

Example diff:

- if module.use_cute_dsl_fp8_block_scale_gemm: - output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell( - act_input_fp8, module.weight, act_input_sf, module.weight_scale) - else: - output = torch.ops.trtllm.fp8_block_scaling_gemm( - act_input_fp8, module.weight, act_input_sf, module.weight_scale) + is_blackwell = get_sm_version() >= 100 + has_cute_op = hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell") + if module.use_cute_dsl_fp8_block_scale_gemm and is_blackwell and has_cute_op: + output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell( + act_input_fp8, module.weight, act_input_sf, module.weight_scale) + else: + output = torch.ops.trtllm.fp8_block_scaling_gemm( + act_input_fp8, module.weight, act_input_sf, module.weight_scale)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/linear.py around lines 575 to 580, the call to the Blackwell-specific operator cute_dsl_fp8_gemm_blackwell lacks runtime checks. Add a guard to call this operator only if get_sm_version() is at least 100 and if torch.ops.trtllm has the attribute cute_dsl_fp8_gemm_blackwell. If either condition is not met, fall back to calling fp8_block_scaling_gemm. This prevents attempting to run the Blackwell-only operator on unsupported hardware or when the operator is unavailable.

Signed-off-by: Mindy Li <[email protected]>

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)

1207-1207: Use a more descriptive variable name instead of 'l'

Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name like 'batch_size'.

🧹 Nitpick comments (6)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6)

12-17: Remove commented-out code

These commented imports are redundant as they've been replaced with proper absolute imports below. Remove them to improve code clarity.

-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)

23-25: Remove commented-out import

This commented import for BlockwiseGemmKernel should be removed as it's been replaced with the actual import below.

-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel

1257-1286: Remove large block of commented-out code

This extensive commented code block appears to be an earlier implementation. It should be removed to improve code readability and maintainability.

-        # before opt
-        """
-        a, b, a_sf, b_sf = inputs
-        m, n, k = a.shape[0], b.shape[0], a.shape[1]
-        w_n, w_k = b_sf.shape[0], b_sf.shape[1]
-        c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda")
-
-        # torch_tensor -> cute.tensor
-        a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8)
-        b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8)
-        c_tmp = c.as_strided((m, n, 1), (n, 1, m * n))
-        weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k))
-        # [xx, m]
-        input_scale_tmp = a_sf.permute(1, 0).as_strided(
-            (m, w_k, 1), (1, m, w_k * m)
-        )
-
-        mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mA.element_type = cutlass.Float8E4M3FN
-        mB.element_type = cutlass.Float8E4M3FN
-
-        # TODO: mSFA is column major
-        mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=0
-        )
-        mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=1
-        )
-        """

1372-1404: Remove commented autotuner code or add TODO

The commented autotuner configuration suggests this was intended to be tunable. Either remove this code if not needed, or add a TODO comment explaining when autotuning will be implemented.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 0, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#     },
-# )
def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
-    # tuner = AutoTuner.get()
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-    #     [cute_dsl_fp8_gemm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale],
-    # )
-    # return cute_dsl_fp8_gemm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Implement autotuning when performance profiling is complete
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()

1503-1507: Remove commented validation and alternative implementation code

These commented sections should be removed to improve code clarity.

-        # if c.dtype != torch.bfloat16:
-        #     assert False, "c.dtype != bf16"
-        # if c.shape != (l, m, n):
-        #     assert False, "c.shape != (l, m, n)"
-
-        # """
         # torch_tensor -> cute.tensor
         a_tmp = a.permute(1, 2, 0).view(torch.uint8)
         ...
-        # """
-        # a_tmp = a.view(torch.uint8)
-        # b_tmp = b.view(torch.uint8)
-        # mA = from_dlpack(
-        #     a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mB = from_dlpack(
-        #     b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mC = from_dlpack(
-        #     c, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mA.element_type = cutlass.Float8E4M3FN
-        # mB.element_type = cutlass.Float8E4M3FN
-
-        # # Note: mSFA is column major
-        # mSFA = from_dlpack(
-        #     a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mSFB = from_dlpack(
-        #     b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)

Also applies to: 1543-1560

1619-1636: Remove commented autotuner configuration

Similar to the GEMM operator, remove the commented autotuner code or add a TODO if this is planned for future implementation.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 1, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     # constraint_specs=(
-#     #     ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell),
-#     # ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#         # 'cluster_shape_mn': [(1, 1)],
-#     },
-# )
def cute_dsl_fp8_bmm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
    out: torch.Tensor,
) -> None:
-    # tuner = AutoTuner.get()
-
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
-
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-    #     [cute_dsl_fp8_bmm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale, out],
-    # )
-    # cute_dsl_fp8_bmm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale, out],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Implement autotuning when performance profiling is complete
    cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()

Also applies to: 1644-1659

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cb6c2c and 5606444.

📒 Files selected for processing (2)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🧠 Learnings (1)

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

1207-1207: Ambiguous variable name: l

(E741)

1448-1448: Ambiguous variable name: l

(E741)

1501-1501: Ambiguous variable name: l

(E741)

1501-1501: Local variable n is assigned to but never used

Remove assignment to unused variable n

(F841)

1501-1501: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

1502-1502: Local variable w_n is assigned to but never used

Remove assignment to unused variable w_n

(F841)

1678-1678: Ambiguous variable name: l

(E741)

1678-1678: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

coderabbitai · 2025-08-04T05:32:15Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

 from functools import lru_cache
 from typing import List, Optional, Tuple

+import cutlass
+import cutlass.cute as cute
+import nvtx
 import torch
+from cuda import cuda
+from cutlass.cute.runtime import from_dlpack



💡 Verification agent

❓ Verification inconclusive

Add NVIDIA copyright header

According to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header that includes the current year.

🏁 Script executed:

#!/bin/bash # Check if other Python files in the repository have copyright headers head -n 20 tensorrt_llm/_torch/custom_ops/*.py | grep -E "(Copyright|NVIDIA)" | head -10

Length of output: 366

Add missing NVIDIA copyright header
Per the coding guidelines, every source file must include an NVIDIA copyright header with the current year.

• File needing update:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py — insert header at the very top

Suggested header snippet (adjust to match your repo’s exact template):

+ # Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + from functools import lru_cache ...

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from functools import lru_cache

from typing import List, Optional, Tuple

import cutlass

import cutlass.cute as cute

import nvtx

import torch

from cuda import cuda

from cutlass.cute.runtime import from_dlpack

# Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

from functools import lru_cache

from typing import List, Optional, Tuple

import cutlass

import cutlass.cute as cute

import nvtx

import torch

from cuda import cuda

from cutlass.cute.runtime import from_dlpack

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at the very top before any imports, add the NVIDIA copyright header with the current year as required by coding guidelines. Insert the standard header snippet used in your repo to ensure compliance.

coderabbitai · 2025-08-04T05:32:15Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+        l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+        # [b, n, k]
+        n = inputs[1].shape[1]


⚠️ Potential issue

Use descriptive variable name instead of 'l'

The variable 'l' is ambiguous. Use 'batch_size' for clarity.

- l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] + batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] n = inputs[1].shape[1]

Also update the reference at line 1467:

- l, + batch_size,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]

# [b, n, k]

n = inputs[1].shape[1]

# before (lines 1448–1450):

- l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]

+ batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]

# [b, n, k]

n = inputs[1].shape[1]

Suggested change

l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]

# [b, n, k]

n = inputs[1].shape[1]

# before (around line 1467):

- l,

+ batch_size,

🧰 Tools

🪛 Ruff (0.12.2)

1448-1448: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1448 to 1450, rename the variable 'l' to 'batch_size' to improve code clarity and descriptiveness. Also, update all references to 'l' accordingly, including the one at line 1467, to maintain consistency and avoid errors.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

coderabbitai · 2025-08-04T05:32:15Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+    l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
+    n = mat_b.shape[1]
+    if out.dtype != torch.bfloat16:
+        assert False, "out.dtype != bf16"
+    if out.shape != (l, m, n):
+        assert False, "out.shape != (l, m, n)"


⚠️ Potential issue

Fix variable naming and remove unused variable

Use descriptive variable names and remove the unused variable 'k'. Also, use proper exceptions instead of assert False.

- l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2] + batch_size, m = mat_a.shape[0], mat_a.shape[1] n = mat_b.shape[1] if out.dtype != torch.bfloat16: - assert False, "out.dtype != bf16" - if out.shape != (l, m, n): - assert False, "out.shape != (l, m, n)" + raise ValueError("out.dtype must be torch.bfloat16") + if out.shape != (batch_size, m, n): + raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]

n = mat_b.shape[1]

if out.dtype != torch.bfloat16:

assert False, "out.dtype != bf16"

if out.shape != (l, m, n):

assert False, "out.shape != (l, m, n)"

batch_size, m = mat_a.shape[0], mat_a.shape[1]

n = mat_b.shape[1]

if out.dtype != torch.bfloat16:

raise ValueError("out.dtype must be torch.bfloat16")

if out.shape != (batch_size, m, n):

raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")

🧰 Tools

🪛 Ruff (0.12.2)

1678-1678: Ambiguous variable name: l

(E741)

1678-1678: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1678 to 1683, rename variables to more descriptive names, remove the unused variable 'k', and replace the assert False statements with proper exception raising such as ValueError with clear error messages for dtype and shape mismatches.

…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>

Signed-off-by: Mindy Li <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
1-10: Add missing NVIDIA copyright header

According to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header with the current year.

Add the copyright header at the very beginning of the file:
+# Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
from functools import lru_cache
998-998: Use descriptive variable name instead of 'l'

Single-letter variable 'l' is ambiguous and violates coding standards.
-        l = 1
+        batch_size = 1
Also update the reference at line 1016:
-            l,
+            batch_size,
1239-1241: Fix variable naming and remove unused variables

Use descriptive variable names and remove unused variables.
-        l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+        batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
         n = inputs[1].shape[1]
Also update references at lines 1255, 1258, 1259:
-            l,
+            batch_size,
1292-1293: Remove unused variables in forward method

Variables n, k, and w_n are assigned but never used.
-        l, m, n, k = a.shape[0], a.shape[1], b.shape[1], b.shape[2]
-        w_n, w_k = b_sf.shape[1], b_sf.shape[2]
+        l, m = a.shape[0], a.shape[1]
+        w_k = b_sf.shape[2]
1469-1474: Fix variable naming, remove unused variable, and use proper exceptions

Use descriptive variable names, remove unused variables, and replace assert statements with proper exceptions.
-    l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
+    batch_size, m = mat_a.shape[0], mat_a.shape[1]
     n = mat_b.shape[1]
     if out.dtype != torch.bfloat16:
-        assert False, "out.dtype != bf16"
-    if out.shape != (l, m, n):
-        assert False, "out.shape != (l, m, n)"
+        raise ValueError("out.dtype must be torch.bfloat16")
+    if out.shape != (batch_size, m, n):
+        raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")

🧹 Nitpick comments (7)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)

12-26: Remove commented imports and document the release kernel usage

The commented imports create unnecessary clutter. If switching between debug and release kernels is needed, consider using a configuration flag instead.

-# from ..attention_backend.interface import AttentionInputType
-# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec,
-#                          OptimizationProfile, TunableRunner, TuningConfig)
-# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape,
-#                      get_last_power_of_2_num_tokens_buckets,
-#                      last_positive_power_of_2)
from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec,
                                           DynamicTensorSpec,
                                           OptimizationProfile, TunableRunner,
                                           TuningConfig)
-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \
-#     BlockwiseGemmKernel
+# Using release version of BlockwiseGemmKernel for production
from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \
    BlockwiseGemmKernel

944-946: Add type hints to utility function

Add type hints for better code clarity and IDE support.

-def pad_up(x: int, y: int) -> int:
-    return ((x + y - 1) // y) * y
+def pad_up(x: int, y: int) -> int:
+    """Round up x to the nearest multiple of y."""
+    return ((x + y - 1) // y) * y

969-978: Clean up commented code in append_ones_wrapper

Remove the commented line and improve documentation.

@cute.jit
def append_ones_wrapper(a: cute.Tensor):
-    # return cute.append_ones(tensor)
+    """Append ones to tensor layout up to rank 3."""
    a_layout = a.layout
    a_layout = cute.append(a_layout,
                           cute.make_layout(1, stride=1),
                           up_to_rank=3)
    new_a = cute.make_tensor(a.iterator, a_layout)
    return new_a

1047-1078: Remove large block of commented code

The "before opt" commented code should be removed. If needed for reference, it should be in version control history or documentation.

-        # before opt
-        """
-        a, b, a_sf, b_sf = inputs
-        m, n, k = a.shape[0], b.shape[0], a.shape[1]
-        w_n, w_k = b_sf.shape[0], b_sf.shape[1]
-        c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda")
-
-        # torch_tensor -> cute.tensor
-        a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8)
-        b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8)
-        c_tmp = c.as_strided((m, n, 1), (n, 1, m * n))
-        weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k))
-        # [xx, m]
-        input_scale_tmp = a_sf.permute(1, 0).as_strided(
-            (m, w_k, 1), (1, m, w_k * m)
-        )
-
-        mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1)
-        mA.element_type = cutlass.Float8E4M3FN
-        mB.element_type = cutlass.Float8E4M3FN
-
-        # TODO: mSFA is column major
-        mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=0
-        )
-        mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic(
-            leading_dim=1
-        )
-        """
-
        a, b, a_sf, b_sf = inputs

1163-1195: Remove commented autotuner code and document why it's disabled

Large blocks of commented code should be removed. If autotuning was intentionally disabled, please document the reason.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 0, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#     },
-# )
def cute_dsl_fp8_gemm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
) -> torch.Tensor:
-    # tuner = AutoTuner.get()
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_gemm_blackwell::gemm",
-    #     [cute_dsl_fp8_gemm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale],
-    # )
-    # return cute_dsl_fp8_gemm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Autotuning is temporarily disabled. Using fixed configuration.
    cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()

Was autotuning disabled for performance reasons or is this temporary? Consider adding a flag to enable/disable autotuning rather than commenting out the code.

1335-1350: Remove commented alternative implementation

Remove the large block of commented code to improve readability.

-        # """
-        # a_tmp = a.view(torch.uint8)
-        # b_tmp = b.view(torch.uint8)
-        # mA = from_dlpack(
-        #     a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mB = from_dlpack(
-        #     b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mC = from_dlpack(
-        #     c, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mA.element_type = cutlass.Float8E4M3FN
-        # mB.element_type = cutlass.Float8E4M3FN
-
-        # # Note: mSFA is column major
-        # mSFA = from_dlpack(
-        #     a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-        # mSFB = from_dlpack(
-        #     b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)
-
         # get stream
         stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)

1410-1449: Clean up commented autotuner configuration

Remove the large block of commented autotuner code for better readability.

-# @autotuner.tuning_config(
-#     name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-#     dynamic_tensor_specs=(DynamicTensorSpec(
-#         0, 1, get_last_power_of_2_num_tokens_buckets,
-#         last_positive_power_of_2), ),
-#     # constraint_specs=(
-#     #     ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell),
-#     # ),
-#     constraint_specs=(ConstraintSpec(
-#         2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ),
-#     configs={
-#         "use_2cta_instrs": [False],
-#         "mma_tiler_mn": [(128, 128)],
-#         "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1),
-#                              (4, 4)],
-#         # 'cluster_shape_mn': [(1, 1)],
-#     },
-# )
def cute_dsl_fp8_bmm_blackwell(
    input: torch.Tensor,
    weight: torch.Tensor,
    input_scale: torch.Tensor,
    weight_scale: torch.Tensor,
    out: torch.Tensor,
) -> None:
-    # tuner = AutoTuner.get()
-
-    # # allocate workspace for profiling
-    # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
-
-    # _, best_tactic, best_config = tuner.choose_one(
-    #     "trtllm::cute_dsl_fp8_bmm_blackwell::gemm",
-    #     [cute_dsl_fp8_bmm_blackwell_runner],
-    #     [input, weight, input_scale, weight_scale, out],
-    # )
-    # cute_dsl_fp8_bmm_blackwell_runner(
-    #     inputs=[input, weight, input_scale, weight_scale, out],
-    #     tactic=best_tactic,
-    #     **best_config,
-    # )
-
+    # TODO: Autotuning is temporarily disabled. Using fixed configuration.
    cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5606444 and 733d253.

📒 Files selected for processing (3)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2 hunks)
tensorrt_llm/_torch/modules/linear.py (3 hunks)
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
tensorrt_llm/_torch/modules/linear.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🧠 Learnings (3)

📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...

Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

998-998: Ambiguous variable name: l

(E741)

1239-1239: Ambiguous variable name: l

(E741)

1292-1292: Ambiguous variable name: l

(E741)

1292-1292: Local variable n is assigned to but never used

Remove assignment to unused variable n

(F841)

1292-1292: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

1293-1293: Local variable w_n is assigned to but never used

Remove assignment to unused variable w_n

(F841)

1469-1469: Ambiguous variable name: l

(E741)

1469-1469: Local variable k is assigned to but never used

Remove assignment to unused variable k

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

Signed-off-by: Mindy Li <[email protected]>

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (6)

tensorrt_llm/_torch/modules/attention.py (2)
223-224: Clean up commented code

Remove commented code fragments to improve code readability.
            allreduce_strategy=config.allreduce_strategy,
-            force_dynamic_quantization=config.force_dynamic_quantization,
-        )
-        # use_cute_dsl_fp8_block_scale_gemm=config.use_cute_dsl_fp8_block_scale_gemm)
+            force_dynamic_quantization=config.force_dynamic_quantization)
895-896: Remove TODO comment

Remove debug TODO comment.
-            # TODO:
-            # print(f"limin: attention.py, get_sm_version(): {get_sm_version()}, self.use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}")
tests/unittest/_torch/modules/test_fused_moe.py (1)
678-678: Clean up commented code or restore validation logic.

The commented debug prints should be removed, but consider restoring the validation comparisons between different implementations to ensure correctness.
-        # output_origin = fused_moe_origin.forward(x, router_logits)
-    # print("limin: output = ", output)
-    # print("limin: ref_output = ", ref_output)
-    # torch.testing.assert_close(output_origin, output, rtol=1e-2, atol=0.1)
-    # torch.testing.assert_close(output_origin, ref_output, rtol=1e-2, atol=0.1)
Also applies to: 683-686
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
1130-1130: Remove debug print statement.

This debug print statement should be removed from production code. It also exceeds the 120-character line length limit.
-        # print(f"limin: DeepseekV3ForCausalLM, model_config.use_cute_dsl_fp8_block_scale_bmm: {model_config.use_cute_dsl_fp8_block_scale_bmm}, model_config.use_cute_dsl_fp8_block_scale_gemm: {model_config.use_cute_dsl_fp8_block_scale_gemm}")
1359-1362: Replace debug prints with proper logging.

Debug print statements should be removed or replaced with proper logging using the logger object that's already imported.
-        moe_backend = self.model_config.moe_backend.upper()
-        print(
-            f"limin: DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}"
-        )
+        moe_backend = self.model_config.moe_backend.upper()
+        logger.debug(f"DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}")
1429-1448: Clean up commented dequantization code.

This large block of commented weight dequantization code should be either:

Removed if no longer needed with the new FP8 implementation

Properly integrated if still required for certain configurations

Leaving extensive commented code reduces maintainability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 899b1bf and 3c8cca4.

📒 Files selected for processing (14)

examples/llm-api/quickstart_advanced.py (2 hunks)
tensorrt_llm/_torch/model_config.py (1 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (11 hunks)
tensorrt_llm/_torch/modules/attention.py (13 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (4 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3 hunks)
tensorrt_llm/_torch/modules/gated_mlp.py (3 hunks)
tensorrt_llm/_torch/modules/linear.py (3 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/llmapi/llm.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (3 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (6 hunks)

✅ Files skipped from review due to trivial changes (5)

examples/llm-api/quickstart_advanced.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_args.py

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/modules/linear.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tensorrt_llm/_torch/modules/gated_mlp.py
tensorrt_llm/llmapi/llm.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tensorrt_llm/_torch/modules/attention.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tensorrt_llm/_torch/modules/gated_mlp.py
tensorrt_llm/llmapi/llm.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tensorrt_llm/_torch/modules/attention.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py

🧠 Learnings (2)

📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tensorrt_llm/llmapi/llm.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.

Applied to files:

tensorrt_llm/llmapi/llm.py

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/attention.py

204-204: Line too long (191 > 120)

(E501)

671-671: Line too long (214 > 120)

(E501)

896-896: Line too long (161 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_deepseekv3.py

1130-1130: Line too long (242 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (26)

tensorrt_llm/llmapi/llm.py (1)

137-144: LGTM! Code formatting improvement with future feature preparation.

The reformatting from inline to multi-line list improves readability. The commented code indicates planned support for FP8 block scale GEMM and BMM flags, which aligns with the PR's objective of adding CuTe DSL FP8 operators.

tensorrt_llm/_torch/modules/gated_mlp.py (4)

40-41: LGTM! Proper parameter addition with backward compatibility.

The new use_cute_dsl_fp8_block_scale_gemm parameter is well-structured with appropriate type annotation and safe default value.

47-47: LGTM! Proper instance attribute assignment.

Correctly stores the parameter as an instance attribute following established patterns.

79-81: LGTM! Correct parameter propagation to gate_up_proj.

The FP8 block scale GEMM flag is properly passed to the Linear module constructor.

99-100: LGTM! Consistent parameter propagation to down_proj.

The FP8 block scale GEMM flag is consistently passed to both Linear modules in the GatedMLP.

tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (4)

12-13: LGTM! Necessary import additions for quantization method selection.

The new imports are required for the _get_quant_method implementation and are properly organized.

36-37: LGTM! Useful debug instrumentation for development.

The commented debug print statements provide valuable tensor shape information for debugging without affecting production performance.

145-155: LGTM! Robust quantization method selection with proper error handling.

The _get_quant_method correctly handles different quantization scenarios:

FP8 block scales: returns appropriate CuTe DSL method

Unsupported modes: raises informative ValueError

No quantization: returns unquantized method

The error handling and method selection logic are well-implemented.

227-230: LGTM! Additional debug instrumentation for FP8 quantization pipeline.

The debug print statements provide valuable tensor shape information for debugging the FP8 quantization forward pass without affecting production code.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

988-988: Hardware targeting correctly updated to Blackwell.

The change from @skip_no_hopper to @skip_pre_blackwell correctly reflects that the new CuTe DSL FP8 block scale kernels require Blackwell architecture, consistent with the PR's introduction of Blackwell-optimized operators.

1016-1017: CUDA graph configuration modernization is correct.

The transition from the boolean use_cuda_graph flag to the structured cuda_graph_config object aligns with the modern API pattern used throughout the codebase, providing better configuration flexibility.

1147-1147: Hardware targeting consistently updated across test variants.

The decorator change matches the single-GPU test variant, ensuring consistent Blackwell architecture targeting across both test methods.

1183-1184: CUDA graph configuration consistently updated across test variants.

The configuration change matches the single-GPU test variant, maintaining API consistency across both test methods.

1020-1021: Do not enable CuTe DSL block‐scale flags in the integration test
The Pydantic MoeConfig used by the LLM API does not include use_cute_dsl_fp8_block_scale_bmm or …_gemm (they’re still commented out in tensorrt_llm/llmapi/llm_args.py), and the API’s valid_keys check only allows its defined fields (e.g. backend, load_balancer). As a result, uncommenting those lines in the test would break argument validation rather than turn on block‐scale operators. Leave them commented until first‐class support is added to the LLM API.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

557-601: Well-structured refactoring for SM 100 support

The refactoring cleanly separates the base functionality from SM 100-specific weight resmoothing and scale transformation logic. This design allows the Cute DSL backend to use the simpler base class while maintaining backward compatibility.

tensorrt_llm/_torch/modules/attention.py (1)

1408-1428: Correct implementation of CuTe DSL FP8 BMM path

The conditional logic properly selects between the new cute_dsl_fp8_bmm_blackwell operator and the legacy fp8_block_scaling_bmm_out based on the configuration flag.

tests/unittest/_torch/modules/test_fused_moe.py (6)

562-563: LGTM! Architecture requirement correctly updated.

The change from @skip_non_hopper_unittest to @skip_pre_blackwell correctly reflects that the new CuTe DSL FP8 operations target Blackwell architecture.

569-570: Verify the restriction of sequence length parameterization.

The test now only runs with seq_len=[256] instead of the full range. This significantly reduces test coverage. Please verify if this restriction is:

Temporary for initial development

Due to known issues with other sequence lengths

Performance optimization

Consider re-enabling the full range once the implementation is stable.

575-580: LGTM! Function name appropriately updated.

The function name change to test_fused_moe_fp8_blockwise_cute_dsl correctly reflects the new CuTe DSL functionality being tested.

671-671: LGTM! Reference model correctly configured.

Adding use_cute_dsl_fp8_block_scale_gemm=True properly configures the reference model to use the new CuTe DSL implementation for consistent testing.

696-696: LGTM! Multi-GPU test correctly updated.

The function name and call are properly updated to use the new CuTe DSL test function, maintaining consistency between single-GPU and multi-GPU tests.

Also applies to: 700-700

974-975: LGTM! Parameter propagation correctly implemented.

The use_cute_dsl_fp8_block_scale_gemm parameter is properly added to the constructor, stored as an instance variable, and correctly passed to the GatedMLP instances, ensuring consistent configuration throughout the reference model.

Also applies to: 985-985, 994-995

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

193-194: LGTM! Parameter correctly added to DeepseekV3Linear.

The use_cute_dsl_fp8_block_scale_gemm parameter is properly added to the constructor and correctly passed to the parent Linear class, maintaining the expected interface.

Also applies to: 209-210

1363-1364: LGTM! Conditional logic correctly implemented.

The conditional check for moe_backend == "DEEPGEMM" with FP8 block scales and SM version 100 correctly targets the Blackwell architecture scenarios where the new functionality is needed.

1273-1274: Verify removal of contiguous() calls after transpose.

The contiguous() calls have been removed after transpose operations. Please verify that downstream code doesn't require contiguous tensors, as this could cause performance issues or errors.

If the contiguous memory layout isn't required, the change is fine. Otherwise, the calls should be restored.

Also applies to: 1315-1316

1496-1497: LGTM! Consistent conditional logic implementation.

The conditional check for DEEPGEMM backend with FP8 block scales and SM version 100 is consistent with the earlier implementation and correctly targets the appropriate hardware/software configuration.

tensorrt_llm/_torch/models/modeling_deepseekv3.py

coderabbitai · 2025-08-05T02:33:25Z

tensorrt_llm/_torch/models/modeling_deepseekv3.py

+            reduce_output=False,
+        )
+        # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)


🛠️ Refactor suggestion

Complete the parameter implementation in Deepseekv3MoE.

The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. This should be completed for consistency.

reduce_output=False, - ) - # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm) + use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

reduce_output=False,

)

# use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)

reduce_output=False,

use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm,

)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 481 to 483, the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP constructor call. Uncomment this parameter and pass model_config.use_cute_dsl_fp8_block_scale_gemm to ensure the parameter is properly implemented and consistent with the model configuration.

coderabbitai · 2025-08-05T02:33:25Z

tensorrt_llm/_torch/models/modeling_deepseekv3.py

+            self.mlp = GatedMLP(
+                hidden_size=config.hidden_size,
+                intermediate_size=config.intermediate_size,
+                bias=False,
+                dtype=config.torch_dtype,
+                config=model_config,
+                overridden_tp_size=self.mlp_tp_size,
+                reduce_output=True,
+            )
+            # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)


🛠️ Refactor suggestion

Complete the parameter implementation in DeepseekV3DecoderLayer.

The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. For consistency across the codebase, this should be properly implemented.

reduce_output=True, - ) - # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm) + use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.mlp = GatedMLP(

hidden_size=config.hidden_size,

intermediate_size=config.intermediate_size,

bias=False,

dtype=config.torch_dtype,

config=model_config,

overridden_tp_size=self.mlp_tp_size,

reduce_output=True,

)

# use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)

self.mlp = GatedMLP(

hidden_size=config.hidden_size,

intermediate_size=config.intermediate_size,

bias=False,

dtype=config.torch_dtype,

config=model_config,

overridden_tp_size=self.mlp_tp_size,

reduce_output=True,

use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm,

)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 670 to 679, the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP constructor. To fix this, uncomment the parameter and pass self.model_config.use_cute_dsl_fp8_block_scale_gemm to the GatedMLP initialization to ensure consistent parameter usage across the codebase.

coderabbitai · 2025-08-05T02:33:26Z

tensorrt_llm/_torch/modules/attention.py

+        # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
+        # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
+        # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")
+
+        # limin-todo: WARNING:
+        self.use_cute_dsl_fp8_block_scale_bmm = True
+        self.use_cute_dsl_fp8_block_scale_gemm = True
+


⚠️ Potential issue

Replace hardcoded flags with proper configuration

The hardcoded flags bypass configuration and force all instances to use the CuTe DSL implementation. This should be configurable through the model config.

- # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm - # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm - # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}") - - # limin-todo: WARNING: - self.use_cute_dsl_fp8_block_scale_bmm = True - self.use_cute_dsl_fp8_block_scale_gemm = True + self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False) + self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm

# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm

# print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")

# limin-todo: WARNING:

self.use_cute_dsl_fp8_block_scale_bmm = True

self.use_cute_dsl_fp8_block_scale_gemm = True

self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)

self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)

🧰 Tools

🪛 Ruff (0.12.2)

204-204: Line too long (191 > 120)

(E501)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/attention.py around lines 202 to 209, the flags use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are hardcoded to True, bypassing the model configuration. Modify the code to assign these flags based on the corresponding values from the config object instead of hardcoding them, ensuring the behavior is configurable per model instance.

coderabbitai · 2025-08-05T02:33:26Z

tensorrt_llm/_torch/modules/attention.py

+        # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")
+
+        # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
+        # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
+
+        ## limin-TODO: WARNING:
+        self.use_cute_dsl_fp8_block_scale_bmm = True
+        self.use_cute_dsl_fp8_block_scale_gemm = True
+


⚠️ Potential issue

Replace hardcoded flags with proper configuration in MLA class

The hardcoded flags bypass configuration. This should be configurable through the model config.

- # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}") - - # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm - # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm - - ## limin-TODO: WARNING: - self.use_cute_dsl_fp8_block_scale_bmm = True - self.use_cute_dsl_fp8_block_scale_gemm = True + self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False) + self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")

# self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm

# self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm

## limin-TODO: WARNING:

self.use_cute_dsl_fp8_block_scale_bmm = True

self.use_cute_dsl_fp8_block_scale_gemm = True

self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)

self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)

🧰 Tools

🪛 Ruff (0.12.2)

671-671: Line too long (214 > 120)

(E501)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/attention.py around lines 671 to 679, the flags use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are hardcoded to True, bypassing the model configuration. To fix this, remove the hardcoded assignments and instead assign these flags from the corresponding attributes in the config object, ensuring they are properly configurable through the model config.

tensorrt_llm/_torch/modules/fused_moe/quantization.py

tests/unittest/_torch/modules/test_fused_moe.py

Signed-off-by: Mindy Li <[email protected]>

…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>

…owise-op Signed-off-by: Mindy Li <[email protected]>

hyukn

Just added some suggestions on how to apply autotuner on the cute_dsl_fp8_bmm_blackwell. You can use tactics to pack all the valid tunable configs.

hyukn · 2025-08-14T02:19:53Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+        is_valid = BlockwiseGemmKernel.can_implement(
+            cutlass.Float8E4M3FN,  # ab_dtype,
+            cutlass.Float32,  # acc_dtype,
+            cutlass.BFloat16,  # c_dtype,
+            use_2cta_instrs,
+            mma_tiler_mn,
+            cluster_shape_mn,
+            m,
+            n,
+            k,
+            l,
+            a_major,
+            b_major,
+            c_major,
+        )
+        if is_valid:
+            return [0]
+        else:
+            return []


Suggested change

is_valid = BlockwiseGemmKernel.can_implement(

cutlass.Float8E4M3FN, # ab_dtype,

cutlass.Float32, # acc_dtype,

cutlass.BFloat16, # c_dtype,

use_2cta_instrs,

mma_tiler_mn,

cluster_shape_mn,

m,

n,

k,

l,

a_major,

b_major,

c_major,

)

if is_valid:

return [0]

else:

return []

# This will return all the valid config combinations for tuning

use_2cta_instrs_candi = [False]

mma_tiler_mn_candi = [(128, 128)]

cluster_shape_mn_candi = [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)]

return [(use_2cta_instrs, mma_tiler_mn, cluster_shape_mn)

for use_2cta_instrs in [False]

for mma_tiler_mn in [(128, 128)]

for cluster_shape_mn in [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)]

if BlockwiseGemmKernel.can_implement(

cutlass.Float8E4M3FN, # ab_dtype,

cutlass.Float32, # acc_dtype,

cutlass.BFloat16, # c_dtype,

use_2cta_instrs,

mma_tiler_mn,

cluster_shape_mn,

m,

n,

k,

l,

a_major,

b_major,

c_major,

)

]

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

Signed-off-by: Mindy Li <[email protected]>

Superjomn

Overall LGTM on the llmapi changes.

tensorrt_llm/llmapi/llm_args.py

Signed-off-by: Mindy Li <[email protected]>

fengxie · 2025-08-14T14:52:51Z

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

+                max_active_clusters: cutlass.Constexpr,
+                stream: cuda.CUstream,
+            ):
+                a = append_ones_wrapper(a)


Do we still need this wrapper given it's already inside "jit" function?

This wrapper is used to permute the tensors to the layout our kernel expect.

They are not inside "jit" function. So I add a jit wrapper for them.

This particular use I'm commemting is inside "jit" function I believe?

Sorry, I misunderstood what you meant . You mean we could move the @cute.jit in the following code？

@cute.jit def append_ones_wrapper(a: cute.Tensor): a_layout = a.layout a_layout = cute.append(a_layout, cute.make_layout(1, stride=1), up_to_rank=3) new_a = cute.make_tensor(a.iterator, a_layout) return new_a

hyukn

Thanks for the work!. Most changes associated with the tuning part are good for me.
The tunable runner CuteDSLFp8BlackwellGroupGemm still requires the same modifications to be compatible with the autotuner.
And it will be better to add some comments in get_valid_tactics and forward of each runner for detailed explanations.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

Signed-off-by: Mindy Li <[email protected]>

…owise-op Signed-off-by: Mindy Li <[email protected]>

limin2021 · 2025-08-15T01:53:09Z

/bot run

tensorrt-cicd · 2025-08-15T01:59:08Z

PR_Github #15365 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-15T02:32:58Z

PR_Github #15365 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11586 completed with status: 'FAILURE'

limin2021 added 2 commits July 31, 2025 02:48

add cute dsl fp8 blockwise gemm op

8f682d8

Signed-off-by: Mindy Li <[email protected]>

add release_gemm code.

8cb6c2c

Signed-off-by: Mindy Li <[email protected]>

limin2021 requested a review from a team as a code owner July 31, 2025 10:35

limin2021 requested review from byshiue and hyukn July 31, 2025 10:35

limin2021 requested review from Barry-Delaney, litaotju and yuxianq July 31, 2025 10:36

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

add bmm custom op.

5606444

Signed-off-by: Mindy Li <[email protected]>

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

limin2021 added 2 commits August 3, 2025 22:33

Merge remote-tracking branch 'origin/main' into merge-cute-dsl-deepge…

733d253

…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>

update linear.

4db6d16

Signed-off-by: Mindy Li <[email protected]>

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

limin2021 added 3 commits August 3, 2025 22:53

update tests.

9dd89c9

Signed-off-by: Mindy Li <[email protected]>

update test name.

899b1bf

Signed-off-by: Mindy Li <[email protected]>

save.

3c8cca4

Signed-off-by: Mindy Li <[email protected]>

limin2021 requested review from a team as code owners August 5, 2025 02:29

limin2021 requested a review from Superjomn August 5, 2025 02:29

coderabbitai bot reviewed Aug 5, 2025

View reviewed changes

limin2021 added 3 commits August 4, 2025 21:39

make fused_moe_cute_dsl work on blackwell.

3270b8d

Signed-off-by: Mindy Li <[email protected]>

rename.

66e4a7d

Signed-off-by: Mindy Li <[email protected]>

remove print.

9cd1f27

Signed-off-by: Mindy Li <[email protected]>

limin2021 changed the title ~~add cute dsl deepgemm fp8 blowise op~~ [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm_group_mm ops Aug 5, 2025

limin2021 added 2 commits August 5, 2025 00:32

refactor moe loading logics.

d80cabe

Signed-off-by: Mindy Li <[email protected]>

minor.

0a74c99

Signed-off-by: Mindy Li <[email protected]>

limin2021 changed the title ~~[TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm_group_mm ops~~ [TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm/group_mm ops Aug 5, 2025

limin2021 changed the title ~~[TRTLLM-6898][feat] add cute_dsl_fp8_gemm/bmm/group_mm ops~~ [TRTLLM-6898][feat] add cute_dsl_fp8 gemm, bmm and group_mm ops Aug 5, 2025

limin2021 changed the title ~~[TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group_mm ops~~ [TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025

limin2021 changed the title ~~[TRTLLM-6898][feat] Add CuteDSL fp8 gemm, bmm and group gemm ops~~ [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025

limin2021 changed the title ~~[TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops~~ [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops Aug 13, 2025

limin2021 added 2 commits August 13, 2025 00:21

Merge remote-tracking branch 'origin/main' into merge-cute-dsl-deepge…

a39d3d8

…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>

Merge remote-tracking branch 'origin' into merge-cute-dsl-deepgemm-bl…

73878f1

…owise-op Signed-off-by: Mindy Li <[email protected]>

hyukn reviewed Aug 14, 2025

View reviewed changes

limin2021 added 3 commits August 13, 2025 19:55

recover linear.py

04d25b6

Signed-off-by: Mindy Li <[email protected]>

minor.

919e886

Signed-off-by: Mindy Li <[email protected]>

remove env and use configs.

94f7ea6

Signed-off-by: Mindy Li <[email protected]>

limin2021 requested a review from a team as a code owner August 14, 2025 06:02

limin2021 requested review from Shixiaowei02 and chzblych August 14, 2025 06:02

limin2021 added 6 commits August 13, 2025 23:15

clean up codes.

f4a127c

Signed-off-by: Mindy Li <[email protected]>

clean custom op.

ccc78e1

Signed-off-by: Mindy Li <[email protected]>

remove useless cute dsl kernel

4a85b63

Signed-off-by: Mindy Li <[email protected]>

minor.

d4daff8

Signed-off-by: Mindy Li <[email protected]>

remove useless code.

2cd2f00

Signed-off-by: Mindy Li <[email protected]>

add mm, bmm autotune.

3a9ccb7

Signed-off-by: Mindy Li <[email protected]>

Superjomn approved these changes Aug 14, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_args.py Outdated Show resolved Hide resolved

fix typo.

2012495

Signed-off-by: Mindy Li <[email protected]>

fengxie reviewed Aug 14, 2025

View reviewed changes

hyukn approved these changes Aug 15, 2025

View reviewed changes

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py Show resolved Hide resolved

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py Show resolved Hide resolved

limin2021 changed the title ~~[TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops~~ [TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops on Blackwell arch Aug 15, 2025

limin2021 changed the title ~~[TRTLLM-6898] [feat] Add CuteDSL fp8 gemm, bmm and group gemm ops on Blackwell arch~~ [TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture Aug 15, 2025

limin2021 added 2 commits August 14, 2025 18:28

add comments.

88248b8

Signed-off-by: Mindy Li <[email protected]>

Merge remote-tracking branch 'origin' into merge-cute-dsl-deepgemm-bl…

73b9085

…owise-op Signed-off-by: Mindy Li <[email protected]>

chzblych removed their request for review September 2, 2025 05:16

-        l = 1
+        batch_size = 1
+            batch_size,

[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture #6514

Are you sure you want to change the base?

[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture #6514

Uh oh!

Conversation

limin2021 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!