-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture #6514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell architecuture #6514
Conversation
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
📝 WalkthroughWalkthroughThis change introduces new FP8 GEMM and batched GEMM operators using the CuTe DSL and CUTLASS libraries for NVIDIA Blackwell architecture, with corresponding integration in model, module, and test code. It adds environment-variable-driven flags to enable these kernels, updates weight loading and quantization logic, and extends test coverage for both single-GPU and multi-GPU scenarios. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Model
participant Linear
participant CustomOp (CuteDSLFp8BlackwellLinear)
participant CUDA Kernel
User->>Model: Forward(input)
Model->>Linear: Forward(input)
Linear->>CustomOp: cute_dsl_fp8_gemm_blackwell(input, weight, ...)
CustomOp->>CUDA Kernel: Launch FP8 GEMM kernel
CUDA Kernel-->>CustomOp: Output tensor
CustomOp-->>Linear: Output tensor
Linear-->>Model: Output
Model-->>User: Output
sequenceDiagram
participant User
participant MLA
participant CustomOp (CuteDSLFp8BlackwellBmm)
participant CUDA Kernel
User->>MLA: forward_generation(...)
MLA->>CustomOp: cute_dsl_fp8_bmm_blackwell(input, weight, ...)
CustomOp->>CUDA Kernel: Launch FP8 BMM kernel
CUDA Kernel-->>CustomOp: Output tensor
CustomOp-->>MLA: Output tensor
MLA-->>User: Output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (4)
11-29: Remove commented imports to improve code clarity.The file contains both commented and active imports for the same modules. The active imports using absolute paths are correct. Please remove the commented relative import statements to avoid confusion.
-# from ..attention_backend.interface import AttentionInputType -# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec, -# OptimizationProfile, TunableRunner, TuningConfig) -# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape, -# get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2) from tensorrt_llm._torch.attention_backend.interface import AttentionInputType from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec, OptimizationProfile, TunableRunner, TuningConfig) -# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \ -# BlockwiseGemmKernel from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \ BlockwiseGemmKernel
1173-1182: Remove commented code in append_ones_wrapper function.The function contains a commented line that duplicates the functionality. Clean up the code by removing it.
@cute.jit def append_ones_wrapper(a: cute.Tensor): - # return cute.append_ones(tensor) a_layout = a.layout a_layout = cute.append(a_layout, cute.make_layout(1, stride=1), up_to_rank=3) new_a = cute.make_tensor(a.iterator, a_layout) return new_a
1254-1285: Remove commented "before opt" code section.The extensive commented code block labeled "before opt" should be removed as it clutters the codebase and the optimized version is already implemented.
Remove lines 1254-1285 entirely to improve code readability.
1372-1406: Consider implementing autotuning or documenting why it's disabled.The function contains extensive commented code for autotuning support. Either:
- Implement the autotuning functionality if it's ready
- Remove the commented code and add a TODO comment explaining the plan
- Document why fixed parameters are currently preferred
If autotuning is not ready, replace the commented code with:
def cute_dsl_fp8_gemm_blackwell( input: torch.Tensor, weight: torch.Tensor, input_scale: torch.Tensor, weight_scale: torch.Tensor, ) -> torch.Tensor: # TODO: Implement autotuning support for optimal tactic selection # Currently using fixed parameters for initial implementation cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear() return cute_dsl_fp8_gemm_blackwell_runner( inputs=[input, weight, input_scale, weight_scale], tactic=0, use_2cta_instrs=False, mma_tiler_mn=(128, 128), cluster_shape_mn=(1, 1), )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm-api/quickstart_advanced.py(1 hunks)tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(2 hunks)tensorrt_llm/_torch/modules/linear.py(3 hunks)tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
1203-1203: Ambiguous variable name: l
(E741)
🔇 Additional comments (4)
examples/llm-api/quickstart_advanced.py (1)
50-54: LGTM!The addition of 'CUTEDSL' to the MoE backend choices properly exposes the new CuTe DSL FP8 GEMM backend option to users.
tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py (1)
60-70: LGTM!The test addition properly validates the new CuTe DSL FP8 GEMM kernel on Blackwell architecture with appropriate tolerances for FP8 precision.
tensorrt_llm/_torch/modules/linear.py (2)
1451-1451: LGTM! Well-structured parameter addition.The new boolean parameter is properly positioned, has a descriptive name, and maintains backward compatibility with the default value.
1498-1498: LGTM! Proper instance attribute assignment.The parameter is correctly stored as an instance attribute following the established pattern.
| m = inputs[0].shape[0] | ||
| n = inputs[1].shape[0] | ||
| k = inputs[0].shape[1] | ||
| l = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a more descriptive variable name instead of 'l'.
Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name.
- l = 1
+ batch_size = 1Also update the reference to this variable in line 1221:
- l,
+ batch_size,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| l = 1 | |
| batch_size = 1 | |
| batch_size, |
🧰 Tools
🪛 Ruff (0.12.2)
1203-1203: Ambiguous variable name: l
(E741)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at line 1203, replace the
single-letter variable name 'l' with a more descriptive name that clearly
indicates its purpose. Also, update all references to this variable, including
the one at line 1221, to use the new descriptive name to maintain consistency
and clarity.
| if module.use_cute_dsl_fp8_block_scale_gemm: | ||
| output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell( | ||
| act_input_fp8, module.weight, act_input_sf, module.weight_scale) | ||
| else: | ||
| output = torch.ops.trtllm.fp8_block_scaling_gemm( | ||
| act_input_fp8, module.weight, act_input_sf, module.weight_scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider adding runtime validation for the new operator.
The conditional logic is implemented correctly, but consider adding validation to ensure the new operator is available and compatible with the current hardware architecture, especially since the operator name suggests Blackwell-specific functionality.
🏁 Script executed:
#!/bin/bash
# Search for any existing hardware/architecture validation patterns in the codebase
rg -A 5 -B 5 "blackwell|Blackwell" --type pyLength of output: 221000
🏁 Script executed:
#!/bin/bash
# Look for operator availability checks or similar validation patterns
rg -A 3 -B 3 "torch\.ops\.trtllm\." --type py | rg -A 3 -B 3 "(available|exists|check)"Length of output: 13154
Add runtime checks for Blackwell-specific operator
Wrap the call to cute_dsl_fp8_gemm_blackwell in tensorrt_llm/_torch/modules/linear.py (around lines 575–580) with:
- an SM-version guard (e.g. only if
get_sm_version() >= 100) - an operator-availability check (
hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell"))
Otherwise, fall back to fp8_block_scaling_gemm. This ensures you won’t attempt a Blackwell-only op on unsupported hardware or when the operator isn’t loaded.
Locations to update:
• tensorrt_llm/_torch/modules/linear.py (around the if module.use_cute_dsl_fp8_block_scale_gemm: block)
Example diff:
- if module.use_cute_dsl_fp8_block_scale_gemm:
- output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
- act_input_fp8, module.weight, act_input_sf, module.weight_scale)
- else:
- output = torch.ops.trtllm.fp8_block_scaling_gemm(
- act_input_fp8, module.weight, act_input_sf, module.weight_scale)
+ is_blackwell = get_sm_version() >= 100
+ has_cute_op = hasattr(torch.ops.trtllm, "cute_dsl_fp8_gemm_blackwell")
+ if module.use_cute_dsl_fp8_block_scale_gemm and is_blackwell and has_cute_op:
+ output = torch.ops.trtllm.cute_dsl_fp8_gemm_blackwell(
+ act_input_fp8, module.weight, act_input_sf, module.weight_scale)
+ else:
+ output = torch.ops.trtllm.fp8_block_scaling_gemm(
+ act_input_fp8, module.weight, act_input_sf, module.weight_scale)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/linear.py around lines 575 to 580, the call to
the Blackwell-specific operator cute_dsl_fp8_gemm_blackwell lacks runtime
checks. Add a guard to call this operator only if get_sm_version() is at least
100 and if torch.ops.trtllm has the attribute cute_dsl_fp8_gemm_blackwell. If
either condition is not met, fall back to calling fp8_block_scaling_gemm. This
prevents attempting to run the Blackwell-only operator on unsupported hardware
or when the operator is unavailable.
Signed-off-by: Mindy Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1)
1207-1207: Use a more descriptive variable name instead of 'l'Single-letter variable names like 'l' are ambiguous and can be confused with '1' or 'I'. Consider using a more descriptive name like 'batch_size'.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (6)
12-17: Remove commented-out codeThese commented imports are redundant as they've been replaced with proper absolute imports below. Remove them to improve code clarity.
-# from ..attention_backend.interface import AttentionInputType -# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec, -# OptimizationProfile, TunableRunner, TuningConfig) -# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape, -# get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2)
23-25: Remove commented-out importThis commented import for
BlockwiseGemmKernelshould be removed as it's been replaced with the actual import below.-# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \ -# BlockwiseGemmKernel
1257-1286: Remove large block of commented-out codeThis extensive commented code block appears to be an earlier implementation. It should be removed to improve code readability and maintainability.
- # before opt - """ - a, b, a_sf, b_sf = inputs - m, n, k = a.shape[0], b.shape[0], a.shape[1] - w_n, w_k = b_sf.shape[0], b_sf.shape[1] - c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda") - - # torch_tensor -> cute.tensor - a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8) - b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8) - c_tmp = c.as_strided((m, n, 1), (n, 1, m * n)) - weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k)) - # [xx, m] - input_scale_tmp = a_sf.permute(1, 0).as_strided( - (m, w_k, 1), (1, m, w_k * m) - ) - - mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mA.element_type = cutlass.Float8E4M3FN - mB.element_type = cutlass.Float8E4M3FN - - # TODO: mSFA is column major - mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic( - leading_dim=0 - ) - mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic( - leading_dim=1 - ) - """
1372-1404: Remove commented autotuner code or add TODOThe commented autotuner configuration suggests this was intended to be tunable. Either remove this code if not needed, or add a TODO comment explaining when autotuning will be implemented.
-# @autotuner.tuning_config( -# name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm", -# dynamic_tensor_specs=(DynamicTensorSpec( -# 0, 0, get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2), ), -# constraint_specs=(ConstraintSpec( -# 2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ), -# configs={ -# "use_2cta_instrs": [False], -# "mma_tiler_mn": [(128, 128)], -# "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), -# (4, 4)], -# }, -# ) def cute_dsl_fp8_gemm_blackwell( input: torch.Tensor, weight: torch.Tensor, input_scale: torch.Tensor, weight_scale: torch.Tensor, ) -> torch.Tensor: - # tuner = AutoTuner.get() - # # allocate workspace for profiling - # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear() - # _, best_tactic, best_config = tuner.choose_one( - # "trtllm::cute_dsl_fp8_gemm_blackwell::gemm", - # [cute_dsl_fp8_gemm_blackwell_runner], - # [input, weight, input_scale, weight_scale], - # ) - # return cute_dsl_fp8_gemm_blackwell_runner( - # inputs=[input, weight, input_scale, weight_scale], - # tactic=best_tactic, - # **best_config, - # ) - + # TODO: Implement autotuning when performance profiling is complete cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()
1503-1507: Remove commented validation and alternative implementation codeThese commented sections should be removed to improve code clarity.
- # if c.dtype != torch.bfloat16: - # assert False, "c.dtype != bf16" - # if c.shape != (l, m, n): - # assert False, "c.shape != (l, m, n)" - - # """ # torch_tensor -> cute.tensor a_tmp = a.permute(1, 2, 0).view(torch.uint8) ... - # """ - # a_tmp = a.view(torch.uint8) - # b_tmp = b.view(torch.uint8) - # mA = from_dlpack( - # a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mB = from_dlpack( - # b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mC = from_dlpack( - # c, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mA.element_type = cutlass.Float8E4M3FN - # mB.element_type = cutlass.Float8E4M3FN - - # # Note: mSFA is column major - # mSFA = from_dlpack( - # a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mSFB = from_dlpack( - # b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2)Also applies to: 1543-1560
1619-1636: Remove commented autotuner configurationSimilar to the GEMM operator, remove the commented autotuner code or add a TODO if this is planned for future implementation.
-# @autotuner.tuning_config( -# name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm", -# dynamic_tensor_specs=(DynamicTensorSpec( -# 0, 1, get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2), ), -# # constraint_specs=( -# # ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), -# # ), -# constraint_specs=(ConstraintSpec( -# 2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ), -# configs={ -# "use_2cta_instrs": [False], -# "mma_tiler_mn": [(128, 128)], -# "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), -# (4, 4)], -# # 'cluster_shape_mn': [(1, 1)], -# }, -# ) def cute_dsl_fp8_bmm_blackwell( input: torch.Tensor, weight: torch.Tensor, input_scale: torch.Tensor, weight_scale: torch.Tensor, out: torch.Tensor, ) -> None: - # tuner = AutoTuner.get() - - # # allocate workspace for profiling - # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm() - - # _, best_tactic, best_config = tuner.choose_one( - # "trtllm::cute_dsl_fp8_bmm_blackwell::gemm", - # [cute_dsl_fp8_bmm_blackwell_runner], - # [input, weight, input_scale, weight_scale, out], - # ) - # cute_dsl_fp8_bmm_blackwell_runner( - # inputs=[input, weight, input_scale, weight_scale, out], - # tactic=best_tactic, - # **best_config, - # ) - + # TODO: Implement autotuning when performance profiling is complete cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()Also applies to: 1644-1659
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(2 hunks)tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (1)
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
1207-1207: Ambiguous variable name: l
(E741)
1448-1448: Ambiguous variable name: l
(E741)
1501-1501: Ambiguous variable name: l
(E741)
1501-1501: Local variable n is assigned to but never used
Remove assignment to unused variable n
(F841)
1501-1501: Local variable k is assigned to but never used
Remove assignment to unused variable k
(F841)
1502-1502: Local variable w_n is assigned to but never used
Remove assignment to unused variable w_n
(F841)
1678-1678: Ambiguous variable name: l
(E741)
1678-1678: Local variable k is assigned to but never used
Remove assignment to unused variable k
(F841)
| from functools import lru_cache | ||
| from typing import List, Optional, Tuple | ||
|
|
||
| import cutlass | ||
| import cutlass.cute as cute | ||
| import nvtx | ||
| import torch | ||
| from cuda import cuda | ||
| from cutlass.cute.runtime import from_dlpack | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Add NVIDIA copyright header
According to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header that includes the current year.
🏁 Script executed:
#!/bin/bash
# Check if other Python files in the repository have copyright headers
head -n 20 tensorrt_llm/_torch/custom_ops/*.py | grep -E "(Copyright|NVIDIA)" | head -10Length of output: 366
Add missing NVIDIA copyright header
Per the coding guidelines, every source file must include an NVIDIA copyright header with the current year.
• File needing update:
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py — insert header at the very top
Suggested header snippet (adjust to match your repo’s exact template):
+ # Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+
from functools import lru_cache
...📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from functools import lru_cache | |
| from typing import List, Optional, Tuple | |
| import cutlass | |
| import cutlass.cute as cute | |
| import nvtx | |
| import torch | |
| from cuda import cuda | |
| from cutlass.cute.runtime import from_dlpack | |
| # Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| from functools import lru_cache | |
| from typing import List, Optional, Tuple | |
| import cutlass | |
| import cutlass.cute as cute | |
| import nvtx | |
| import torch | |
| from cuda import cuda | |
| from cutlass.cute.runtime import from_dlpack |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py at the very top before any
imports, add the NVIDIA copyright header with the current year as required by
coding guidelines. Insert the standard header snippet used in your repo to
ensure compliance.
| l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] | ||
| # [b, n, k] | ||
| n = inputs[1].shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use descriptive variable name instead of 'l'
The variable 'l' is ambiguous. Use 'batch_size' for clarity.
- l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
+ batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2]
n = inputs[1].shape[1]Also update the reference at line 1467:
- l,
+ batch_size,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] | |
| # [b, n, k] | |
| n = inputs[1].shape[1] | |
| # before (lines 1448–1450): | |
| - l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] | |
| + batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] | |
| # [b, n, k] | |
| n = inputs[1].shape[1] |
| l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] | |
| # [b, n, k] | |
| n = inputs[1].shape[1] | |
| # before (around line 1467): | |
| - l, | |
| + batch_size, |
🧰 Tools
🪛 Ruff (0.12.2)
1448-1448: Ambiguous variable name: l
(E741)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1448 to 1450,
rename the variable 'l' to 'batch_size' to improve code clarity and
descriptiveness. Also, update all references to 'l' accordingly, including the
one at line 1467, to maintain consistency and avoid errors.
| l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2] | ||
| n = mat_b.shape[1] | ||
| if out.dtype != torch.bfloat16: | ||
| assert False, "out.dtype != bf16" | ||
| if out.shape != (l, m, n): | ||
| assert False, "out.shape != (l, m, n)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix variable naming and remove unused variable
Use descriptive variable names and remove the unused variable 'k'. Also, use proper exceptions instead of assert False.
- l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2]
+ batch_size, m = mat_a.shape[0], mat_a.shape[1]
n = mat_b.shape[1]
if out.dtype != torch.bfloat16:
- assert False, "out.dtype != bf16"
- if out.shape != (l, m, n):
- assert False, "out.shape != (l, m, n)"
+ raise ValueError("out.dtype must be torch.bfloat16")
+ if out.shape != (batch_size, m, n):
+ raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2] | |
| n = mat_b.shape[1] | |
| if out.dtype != torch.bfloat16: | |
| assert False, "out.dtype != bf16" | |
| if out.shape != (l, m, n): | |
| assert False, "out.shape != (l, m, n)" | |
| batch_size, m = mat_a.shape[0], mat_a.shape[1] | |
| n = mat_b.shape[1] | |
| if out.dtype != torch.bfloat16: | |
| raise ValueError("out.dtype must be torch.bfloat16") | |
| if out.shape != (batch_size, m, n): | |
| raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}") |
🧰 Tools
🪛 Ruff (0.12.2)
1678-1678: Ambiguous variable name: l
(E741)
1678-1678: Local variable k is assigned to but never used
Remove assignment to unused variable k
(F841)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/custom_ops/torch_custom_ops.py around lines 1678 to 1683,
rename variables to more descriptive names, remove the unused variable 'k', and
replace the assert False statements with proper exception raising such as
ValueError with clear error messages for dtype and shape mismatches.
…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (5)
1-10: Add missing NVIDIA copyright headerAccording to the coding guidelines, all TensorRT-LLM source files must contain an NVIDIA copyright header with the current year.
Add the copyright header at the very beginning of the file:
+# Copyright (c) 2025 NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + from functools import lru_cache
998-998: Use descriptive variable name instead of 'l'Single-letter variable 'l' is ambiguous and violates coding standards.
- l = 1 + batch_size = 1Also update the reference at line 1016:
- l, + batch_size,
1239-1241: Fix variable naming and remove unused variablesUse descriptive variable names and remove unused variables.
- l, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] + batch_size, m, k = inputs[0].shape[0], inputs[0].shape[1], inputs[0].shape[2] n = inputs[1].shape[1]Also update references at lines 1255, 1258, 1259:
- l, + batch_size,
1292-1293: Remove unused variables in forward methodVariables
n,k, andw_nare assigned but never used.- l, m, n, k = a.shape[0], a.shape[1], b.shape[1], b.shape[2] - w_n, w_k = b_sf.shape[1], b_sf.shape[2] + l, m = a.shape[0], a.shape[1] + w_k = b_sf.shape[2]
1469-1474: Fix variable naming, remove unused variable, and use proper exceptionsUse descriptive variable names, remove unused variables, and replace assert statements with proper exceptions.
- l, m, k = mat_a.shape[0], mat_a.shape[1], mat_a.shape[2] + batch_size, m = mat_a.shape[0], mat_a.shape[1] n = mat_b.shape[1] if out.dtype != torch.bfloat16: - assert False, "out.dtype != bf16" - if out.shape != (l, m, n): - assert False, "out.shape != (l, m, n)" + raise ValueError("out.dtype must be torch.bfloat16") + if out.shape != (batch_size, m, n): + raise ValueError(f"out.shape must be ({batch_size}, {m}, {n}), got {out.shape}")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (7)
12-26: Remove commented imports and document the release kernel usageThe commented imports create unnecessary clutter. If switching between debug and release kernels is needed, consider using a configuration flag instead.
-# from ..attention_backend.interface import AttentionInputType -# from ..autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec, -# OptimizationProfile, TunableRunner, TuningConfig) -# from ..utils import (compute_swizzled_sf_shape, fp4_scale_infer_shape, -# get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2) from tensorrt_llm._torch.autotuner import (AutoTuner, ConstraintSpec, DynamicTensorSpec, OptimizationProfile, TunableRunner, TuningConfig) -# from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm import \ -# BlockwiseGemmKernel +# Using release version of BlockwiseGemmKernel for production from tensorrt_llm._torch.custom_ops.cute_dsl_kernels.blackwell.blockwise_gemm_release import \ BlockwiseGemmKernel
944-946: Add type hints to utility functionAdd type hints for better code clarity and IDE support.
-def pad_up(x: int, y: int) -> int: - return ((x + y - 1) // y) * y +def pad_up(x: int, y: int) -> int: + """Round up x to the nearest multiple of y.""" + return ((x + y - 1) // y) * y
969-978: Clean up commented code in append_ones_wrapperRemove the commented line and improve documentation.
@cute.jit def append_ones_wrapper(a: cute.Tensor): - # return cute.append_ones(tensor) + """Append ones to tensor layout up to rank 3.""" a_layout = a.layout a_layout = cute.append(a_layout, cute.make_layout(1, stride=1), up_to_rank=3) new_a = cute.make_tensor(a.iterator, a_layout) return new_a
1047-1078: Remove large block of commented codeThe "before opt" commented code should be removed. If needed for reference, it should be in version control history or documentation.
- # before opt - """ - a, b, a_sf, b_sf = inputs - m, n, k = a.shape[0], b.shape[0], a.shape[1] - w_n, w_k = b_sf.shape[0], b_sf.shape[1] - c = torch.empty(*(m, n), dtype=torch.bfloat16, device="cuda") - - # torch_tensor -> cute.tensor - a_tmp = a.as_strided((m, k, 1), (k, 1, m * k)).view(torch.uint8) - b_tmp = b.as_strided((n, k, 1), (k, 1, n * k)).view(torch.uint8) - c_tmp = c.as_strided((m, n, 1), (n, 1, m * n)) - weight_scale_tmp = b_sf.as_strided((w_n, w_k, 1), (w_k, 1, w_n * w_k)) - # [xx, m] - input_scale_tmp = a_sf.permute(1, 0).as_strided( - (m, w_k, 1), (1, m, w_k * m) - ) - - mA = from_dlpack(a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mB = from_dlpack(b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mC = from_dlpack(c_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=1) - mA.element_type = cutlass.Float8E4M3FN - mB.element_type = cutlass.Float8E4M3FN - - # TODO: mSFA is column major - mSFA = from_dlpack(input_scale_tmp, assumed_align=16).mark_layout_dynamic( - leading_dim=0 - ) - mSFB = from_dlpack(weight_scale_tmp, assumed_align=16).mark_layout_dynamic( - leading_dim=1 - ) - """ - a, b, a_sf, b_sf = inputs
1163-1195: Remove commented autotuner code and document why it's disabledLarge blocks of commented code should be removed. If autotuning was intentionally disabled, please document the reason.
-# @autotuner.tuning_config( -# name="trtllm::cute_dsl_fp8_gemm_blackwell::gemm", -# dynamic_tensor_specs=(DynamicTensorSpec( -# 0, 0, get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2), ), -# constraint_specs=(ConstraintSpec( -# 2, 1, cute_dsl_fp8_linear_scale_infer_shape_blackwell), ), -# configs={ -# "use_2cta_instrs": [False], -# "mma_tiler_mn": [(128, 128)], -# "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), -# (4, 4)], -# }, -# ) def cute_dsl_fp8_gemm_blackwell( input: torch.Tensor, weight: torch.Tensor, input_scale: torch.Tensor, weight_scale: torch.Tensor, ) -> torch.Tensor: - # tuner = AutoTuner.get() - # # allocate workspace for profiling - # cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear() - # _, best_tactic, best_config = tuner.choose_one( - # "trtllm::cute_dsl_fp8_gemm_blackwell::gemm", - # [cute_dsl_fp8_gemm_blackwell_runner], - # [input, weight, input_scale, weight_scale], - # ) - # return cute_dsl_fp8_gemm_blackwell_runner( - # inputs=[input, weight, input_scale, weight_scale], - # tactic=best_tactic, - # **best_config, - # ) - + # TODO: Autotuning is temporarily disabled. Using fixed configuration. cute_dsl_fp8_gemm_blackwell_runner = CuteDSLFp8BlackwellLinear()Was autotuning disabled for performance reasons or is this temporary? Consider adding a flag to enable/disable autotuning rather than commenting out the code.
1335-1350: Remove commented alternative implementationRemove the large block of commented code to improve readability.
- # """ - # a_tmp = a.view(torch.uint8) - # b_tmp = b.view(torch.uint8) - # mA = from_dlpack( - # a_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mB = from_dlpack( - # b_tmp, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mC = from_dlpack( - # c, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mA.element_type = cutlass.Float8E4M3FN - # mB.element_type = cutlass.Float8E4M3FN - - # # Note: mSFA is column major - # mSFA = from_dlpack( - # a_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # mSFB = from_dlpack( - # b_sf, assumed_align=16).mark_layout_dynamic(leading_dim=2) - # get stream stream = cuda.CUstream(torch.cuda.current_stream().cuda_stream)
1410-1449: Clean up commented autotuner configurationRemove the large block of commented autotuner code for better readability.
-# @autotuner.tuning_config( -# name="trtllm::cute_dsl_fp8_bmm_blackwell::gemm", -# dynamic_tensor_specs=(DynamicTensorSpec( -# 0, 1, get_last_power_of_2_num_tokens_buckets, -# last_positive_power_of_2), ), -# # constraint_specs=( -# # ConstraintSpec(2, 0, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), -# # ), -# constraint_specs=(ConstraintSpec( -# 2, 2, cute_dsl_fp8_bmm_scale_infer_shape_blackwell), ), -# configs={ -# "use_2cta_instrs": [False], -# "mma_tiler_mn": [(128, 128)], -# "cluster_shape_mn": [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), -# (4, 4)], -# # 'cluster_shape_mn': [(1, 1)], -# }, -# ) def cute_dsl_fp8_bmm_blackwell( input: torch.Tensor, weight: torch.Tensor, input_scale: torch.Tensor, weight_scale: torch.Tensor, out: torch.Tensor, ) -> None: - # tuner = AutoTuner.get() - - # # allocate workspace for profiling - # cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm() - - # _, best_tactic, best_config = tuner.choose_one( - # "trtllm::cute_dsl_fp8_bmm_blackwell::gemm", - # [cute_dsl_fp8_bmm_blackwell_runner], - # [input, weight, input_scale, weight_scale, out], - # ) - # cute_dsl_fp8_bmm_blackwell_runner( - # inputs=[input, weight, input_scale, weight_scale, out], - # tactic=best_tactic, - # **best_config, - # ) - + # TODO: Autotuning is temporarily disabled. Using fixed configuration. cute_dsl_fp8_bmm_blackwell_runner = CuteDSLFp8BlackwellBmm()
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py(2 hunks)tensorrt_llm/_torch/modules/linear.py(3 hunks)tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unittest/_torch/thop/test_fp8_block_scale_gemm.py
- tensorrt_llm/_torch/modules/linear.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🧠 Learnings (3)
📚 Learning: applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : all tensorrt-llm open source software code should contain...
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.{cpp,h,hpp,cc,cxx,cu,py} : All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
998-998: Ambiguous variable name: l
(E741)
1239-1239: Ambiguous variable name: l
(E741)
1292-1292: Ambiguous variable name: l
(E741)
1292-1292: Local variable n is assigned to but never used
Remove assignment to unused variable n
(F841)
1292-1292: Local variable k is assigned to but never used
Remove assignment to unused variable k
(F841)
1293-1293: Local variable w_n is assigned to but never used
Remove assignment to unused variable w_n
(F841)
1469-1469: Ambiguous variable name: l
(E741)
1469-1469: Local variable k is assigned to but never used
Remove assignment to unused variable k
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (6)
tensorrt_llm/_torch/modules/attention.py (2)
223-224: Clean up commented codeRemove commented code fragments to improve code readability.
allreduce_strategy=config.allreduce_strategy, - force_dynamic_quantization=config.force_dynamic_quantization, - ) - # use_cute_dsl_fp8_block_scale_gemm=config.use_cute_dsl_fp8_block_scale_gemm) + force_dynamic_quantization=config.force_dynamic_quantization)
895-896: Remove TODO commentRemove debug TODO comment.
- # TODO: - # print(f"limin: attention.py, get_sm_version(): {get_sm_version()}, self.use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}")tests/unittest/_torch/modules/test_fused_moe.py (1)
678-678: Clean up commented code or restore validation logic.The commented debug prints should be removed, but consider restoring the validation comparisons between different implementations to ensure correctness.
- # output_origin = fused_moe_origin.forward(x, router_logits) - # print("limin: output = ", output) - # print("limin: ref_output = ", ref_output) - # torch.testing.assert_close(output_origin, output, rtol=1e-2, atol=0.1) - # torch.testing.assert_close(output_origin, ref_output, rtol=1e-2, atol=0.1)Also applies to: 683-686
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
1130-1130: Remove debug print statement.This debug print statement should be removed from production code. It also exceeds the 120-character line length limit.
- # print(f"limin: DeepseekV3ForCausalLM, model_config.use_cute_dsl_fp8_block_scale_bmm: {model_config.use_cute_dsl_fp8_block_scale_bmm}, model_config.use_cute_dsl_fp8_block_scale_gemm: {model_config.use_cute_dsl_fp8_block_scale_gemm}")
1359-1362: Replace debug prints with proper logging.Debug print statements should be removed or replaced with proper logging using the
loggerobject that's already imported.- moe_backend = self.model_config.moe_backend.upper() - print( - f"limin: DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}" - ) + moe_backend = self.model_config.moe_backend.upper() + logger.debug(f"DeepseekV3ForCausalLM load_weights, moe_backend: {moe_backend}")
1429-1448: Clean up commented dequantization code.This large block of commented weight dequantization code should be either:
- Removed if no longer needed with the new FP8 implementation
- Properly integrated if still required for certain configurations
Leaving extensive commented code reduces maintainability.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
examples/llm-api/quickstart_advanced.py(2 hunks)tensorrt_llm/_torch/model_config.py(1 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py(11 hunks)tensorrt_llm/_torch/modules/attention.py(13 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py(4 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py(1 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py(3 hunks)tensorrt_llm/_torch/modules/gated_mlp.py(3 hunks)tensorrt_llm/_torch/modules/linear.py(3 hunks)tensorrt_llm/_torch/pyexecutor/config.py(1 hunks)tensorrt_llm/llmapi/llm.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(2 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(3 hunks)tests/unittest/_torch/modules/test_fused_moe.py(6 hunks)
✅ Files skipped from review due to trivial changes (5)
- examples/llm-api/quickstart_advanced.py
- tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/model_config.py
- tensorrt_llm/llmapi/llm_args.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/modules/linear.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/modules/gated_mlp.pytensorrt_llm/llmapi/llm.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytensorrt_llm/_torch/modules/attention.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/modules/gated_mlp.pytensorrt_llm/llmapi/llm.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytensorrt_llm/_torch/modules/attention.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/models/modeling_deepseekv3.py
🧠 Learnings (2)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/llmapi/llm.pytests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: applies to **/*.py : the code developed for tensorrt-llm should conform to python 3.8+....
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Applied to files:
tensorrt_llm/llmapi/llm.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/attention.py
204-204: Line too long (191 > 120)
(E501)
671-671: Line too long (214 > 120)
(E501)
896-896: Line too long (161 > 120)
(E501)
tensorrt_llm/_torch/models/modeling_deepseekv3.py
1130-1130: Line too long (242 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (26)
tensorrt_llm/llmapi/llm.py (1)
137-144: LGTM! Code formatting improvement with future feature preparation.The reformatting from inline to multi-line list improves readability. The commented code indicates planned support for FP8 block scale GEMM and BMM flags, which aligns with the PR's objective of adding CuTe DSL FP8 operators.
tensorrt_llm/_torch/modules/gated_mlp.py (4)
40-41: LGTM! Proper parameter addition with backward compatibility.The new
use_cute_dsl_fp8_block_scale_gemmparameter is well-structured with appropriate type annotation and safe default value.
47-47: LGTM! Proper instance attribute assignment.Correctly stores the parameter as an instance attribute following established patterns.
79-81: LGTM! Correct parameter propagation to gate_up_proj.The FP8 block scale GEMM flag is properly passed to the Linear module constructor.
99-100: LGTM! Consistent parameter propagation to down_proj.The FP8 block scale GEMM flag is consistently passed to both Linear modules in the GatedMLP.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (4)
12-13: LGTM! Necessary import additions for quantization method selection.The new imports are required for the
_get_quant_methodimplementation and are properly organized.
36-37: LGTM! Useful debug instrumentation for development.The commented debug print statements provide valuable tensor shape information for debugging without affecting production performance.
145-155: LGTM! Robust quantization method selection with proper error handling.The
_get_quant_methodcorrectly handles different quantization scenarios:
- FP8 block scales: returns appropriate CuTe DSL method
- Unsupported modes: raises informative ValueError
- No quantization: returns unquantized method
The error handling and method selection logic are well-implemented.
227-230: LGTM! Additional debug instrumentation for FP8 quantization pipeline.The debug print statements provide valuable tensor shape information for debugging the FP8 quantization forward pass without affecting production code.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
988-988: Hardware targeting correctly updated to Blackwell.The change from
@skip_no_hopperto@skip_pre_blackwellcorrectly reflects that the new CuTe DSL FP8 block scale kernels require Blackwell architecture, consistent with the PR's introduction of Blackwell-optimized operators.
1016-1017: CUDA graph configuration modernization is correct.The transition from the boolean
use_cuda_graphflag to the structuredcuda_graph_configobject aligns with the modern API pattern used throughout the codebase, providing better configuration flexibility.
1147-1147: Hardware targeting consistently updated across test variants.The decorator change matches the single-GPU test variant, ensuring consistent Blackwell architecture targeting across both test methods.
1183-1184: CUDA graph configuration consistently updated across test variants.The configuration change matches the single-GPU test variant, maintaining API consistency across both test methods.
1020-1021: Do not enable CuTe DSL block‐scale flags in the integration test
The Pydantic MoeConfig used by the LLM API does not includeuse_cute_dsl_fp8_block_scale_bmmor…_gemm(they’re still commented out in tensorrt_llm/llmapi/llm_args.py), and the API’svalid_keyscheck only allows its defined fields (e.g.backend,load_balancer). As a result, uncommenting those lines in the test would break argument validation rather than turn on block‐scale operators. Leave them commented until first‐class support is added to the LLM API.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
557-601: Well-structured refactoring for SM 100 supportThe refactoring cleanly separates the base functionality from SM 100-specific weight resmoothing and scale transformation logic. This design allows the Cute DSL backend to use the simpler base class while maintaining backward compatibility.
tensorrt_llm/_torch/modules/attention.py (1)
1408-1428: Correct implementation of CuTe DSL FP8 BMM pathThe conditional logic properly selects between the new
cute_dsl_fp8_bmm_blackwelloperator and the legacyfp8_block_scaling_bmm_outbased on the configuration flag.tests/unittest/_torch/modules/test_fused_moe.py (6)
562-563: LGTM! Architecture requirement correctly updated.The change from
@skip_non_hopper_unittestto@skip_pre_blackwellcorrectly reflects that the new CuTe DSL FP8 operations target Blackwell architecture.
569-570: Verify the restriction of sequence length parameterization.The test now only runs with
seq_len=[256]instead of the full range. This significantly reduces test coverage. Please verify if this restriction is:
- Temporary for initial development
- Due to known issues with other sequence lengths
- Performance optimization
Consider re-enabling the full range once the implementation is stable.
575-580: LGTM! Function name appropriately updated.The function name change to
test_fused_moe_fp8_blockwise_cute_dslcorrectly reflects the new CuTe DSL functionality being tested.
671-671: LGTM! Reference model correctly configured.Adding
use_cute_dsl_fp8_block_scale_gemm=Trueproperly configures the reference model to use the new CuTe DSL implementation for consistent testing.
696-696: LGTM! Multi-GPU test correctly updated.The function name and call are properly updated to use the new CuTe DSL test function, maintaining consistency between single-GPU and multi-GPU tests.
Also applies to: 700-700
974-975: LGTM! Parameter propagation correctly implemented.The
use_cute_dsl_fp8_block_scale_gemmparameter is properly added to the constructor, stored as an instance variable, and correctly passed to theGatedMLPinstances, ensuring consistent configuration throughout the reference model.Also applies to: 985-985, 994-995
tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)
193-194: LGTM! Parameter correctly added to DeepseekV3Linear.The
use_cute_dsl_fp8_block_scale_gemmparameter is properly added to the constructor and correctly passed to the parentLinearclass, maintaining the expected interface.Also applies to: 209-210
1363-1364: LGTM! Conditional logic correctly implemented.The conditional check for
moe_backend == "DEEPGEMM"with FP8 block scales and SM version 100 correctly targets the Blackwell architecture scenarios where the new functionality is needed.
1273-1274: Verify removal of contiguous() calls after transpose.The
contiguous()calls have been removed after transpose operations. Please verify that downstream code doesn't require contiguous tensors, as this could cause performance issues or errors.If the contiguous memory layout isn't required, the change is fine. Otherwise, the calls should be restored.
Also applies to: 1315-1316
1496-1497: LGTM! Consistent conditional logic implementation.The conditional check for DEEPGEMM backend with FP8 block scales and SM version 100 is consistent with the earlier implementation and correctly targets the appropriate hardware/software configuration.
| reduce_output=False, | ||
| ) | ||
| # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Complete the parameter implementation in Deepseekv3MoE.
The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. This should be completed for consistency.
reduce_output=False,
- )
- # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm)
+ use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| reduce_output=False, | |
| ) | |
| # use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm) | |
| reduce_output=False, | |
| use_cute_dsl_fp8_block_scale_gemm=model_config.use_cute_dsl_fp8_block_scale_gemm, | |
| ) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 481 to 483,
the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP
constructor call. Uncomment this parameter and pass
model_config.use_cute_dsl_fp8_block_scale_gemm to ensure the parameter is
properly implemented and consistent with the model configuration.
| self.mlp = GatedMLP( | ||
| hidden_size=config.hidden_size, | ||
| intermediate_size=config.intermediate_size, | ||
| bias=False, | ||
| dtype=config.torch_dtype, | ||
| config=model_config, | ||
| overridden_tp_size=self.mlp_tp_size, | ||
| reduce_output=True, | ||
| ) | ||
| # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Complete the parameter implementation in DeepseekV3DecoderLayer.
The use_cute_dsl_fp8_block_scale_gemm parameter is commented out in the GatedMLP constructor. For consistency across the codebase, this should be properly implemented.
reduce_output=True,
- )
- # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm)
+ use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.mlp = GatedMLP( | |
| hidden_size=config.hidden_size, | |
| intermediate_size=config.intermediate_size, | |
| bias=False, | |
| dtype=config.torch_dtype, | |
| config=model_config, | |
| overridden_tp_size=self.mlp_tp_size, | |
| reduce_output=True, | |
| ) | |
| # use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm) | |
| self.mlp = GatedMLP( | |
| hidden_size=config.hidden_size, | |
| intermediate_size=config.intermediate_size, | |
| bias=False, | |
| dtype=config.torch_dtype, | |
| config=model_config, | |
| overridden_tp_size=self.mlp_tp_size, | |
| reduce_output=True, | |
| use_cute_dsl_fp8_block_scale_gemm=self.model_config.use_cute_dsl_fp8_block_scale_gemm, | |
| ) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 670 to 679,
the parameter use_cute_dsl_fp8_block_scale_gemm is commented out in the GatedMLP
constructor. To fix this, uncomment the parameter and pass
self.model_config.use_cute_dsl_fp8_block_scale_gemm to the GatedMLP
initialization to ensure consistent parameter usage across the codebase.
| # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm | ||
| # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm | ||
| # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}") | ||
|
|
||
| # limin-todo: WARNING: | ||
| self.use_cute_dsl_fp8_block_scale_bmm = True | ||
| self.use_cute_dsl_fp8_block_scale_gemm = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace hardcoded flags with proper configuration
The hardcoded flags bypass configuration and force all instances to use the CuTe DSL implementation. This should be configurable through the model config.
- # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
- # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
- # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}")
-
- # limin-todo: WARNING:
- self.use_cute_dsl_fp8_block_scale_bmm = True
- self.use_cute_dsl_fp8_block_scale_gemm = True
+ self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
+ self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm | |
| # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm | |
| # print(f"limin: attention.py, use_cute_dsl_fp8_block_scale_bmm: {self.use_cute_dsl_fp8_block_scale_bmm}, use_cute_dsl_fp8_block_scale_gemm: {self.use_cute_dsl_fp8_block_scale_gemm}") | |
| # limin-todo: WARNING: | |
| self.use_cute_dsl_fp8_block_scale_bmm = True | |
| self.use_cute_dsl_fp8_block_scale_gemm = True | |
| self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False) | |
| self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False) |
🧰 Tools
🪛 Ruff (0.12.2)
204-204: Line too long (191 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/attention.py around lines 202 to 209, the flags
use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are
hardcoded to True, bypassing the model configuration. Modify the code to assign
these flags based on the corresponding values from the config object instead of
hardcoding them, ensuring the behavior is configurable per model instance.
| # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}") | ||
|
|
||
| # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm | ||
| # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm | ||
|
|
||
| ## limin-TODO: WARNING: | ||
| self.use_cute_dsl_fp8_block_scale_bmm = True | ||
| self.use_cute_dsl_fp8_block_scale_gemm = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace hardcoded flags with proper configuration in MLA class
The hardcoded flags bypass configuration. This should be configurable through the model config.
- # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}")
-
- # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm
- # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm
-
- ## limin-TODO: WARNING:
- self.use_cute_dsl_fp8_block_scale_bmm = True
- self.use_cute_dsl_fp8_block_scale_gemm = True
+ self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False)
+ self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # print(f"limin: attention.py, MLA, config.use_cute_dsl_fp8_block_scale_bmm: {config.use_cute_dsl_fp8_block_scale_bmm}, config.use_cute_dsl_fp8_block_scale_gemm: {config.use_cute_dsl_fp8_block_scale_gemm}") | |
| # self.use_cute_dsl_fp8_block_scale_bmm = config.use_cute_dsl_fp8_block_scale_bmm | |
| # self.use_cute_dsl_fp8_block_scale_gemm = config.use_cute_dsl_fp8_block_scale_gemm | |
| ## limin-TODO: WARNING: | |
| self.use_cute_dsl_fp8_block_scale_bmm = True | |
| self.use_cute_dsl_fp8_block_scale_gemm = True | |
| self.use_cute_dsl_fp8_block_scale_bmm = getattr(config, 'use_cute_dsl_fp8_block_scale_bmm', False) | |
| self.use_cute_dsl_fp8_block_scale_gemm = getattr(config, 'use_cute_dsl_fp8_block_scale_gemm', False) |
🧰 Tools
🪛 Ruff (0.12.2)
671-671: Line too long (214 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/attention.py around lines 671 to 679, the flags
use_cute_dsl_fp8_block_scale_bmm and use_cute_dsl_fp8_block_scale_gemm are
hardcoded to True, bypassing the model configuration. To fix this, remove the
hardcoded assignments and instead assign these flags from the corresponding
attributes in the config object, ensuring they are properly configurable through
the model config.
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
…mm-blowise-op Signed-off-by: Mindy Li <[email protected]>
…owise-op Signed-off-by: Mindy Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added some suggestions on how to apply autotuner on the cute_dsl_fp8_bmm_blackwell. You can use tactics to pack all the valid tunable configs.
| is_valid = BlockwiseGemmKernel.can_implement( | ||
| cutlass.Float8E4M3FN, # ab_dtype, | ||
| cutlass.Float32, # acc_dtype, | ||
| cutlass.BFloat16, # c_dtype, | ||
| use_2cta_instrs, | ||
| mma_tiler_mn, | ||
| cluster_shape_mn, | ||
| m, | ||
| n, | ||
| k, | ||
| l, | ||
| a_major, | ||
| b_major, | ||
| c_major, | ||
| ) | ||
| if is_valid: | ||
| return [0] | ||
| else: | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| is_valid = BlockwiseGemmKernel.can_implement( | |
| cutlass.Float8E4M3FN, # ab_dtype, | |
| cutlass.Float32, # acc_dtype, | |
| cutlass.BFloat16, # c_dtype, | |
| use_2cta_instrs, | |
| mma_tiler_mn, | |
| cluster_shape_mn, | |
| m, | |
| n, | |
| k, | |
| l, | |
| a_major, | |
| b_major, | |
| c_major, | |
| ) | |
| if is_valid: | |
| return [0] | |
| else: | |
| return [] | |
| # This will return all the valid config combinations for tuning | |
| use_2cta_instrs_candi = [False] | |
| mma_tiler_mn_candi = [(128, 128)] | |
| cluster_shape_mn_candi = [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)] | |
| return [(use_2cta_instrs, mma_tiler_mn, cluster_shape_mn) | |
| for use_2cta_instrs in [False] | |
| for mma_tiler_mn in [(128, 128)] | |
| for cluster_shape_mn in [(1, 1), (1, 2), (1, 4), (2, 1), (2, 2), (4, 1), (4, 4)] | |
| if BlockwiseGemmKernel.can_implement( | |
| cutlass.Float8E4M3FN, # ab_dtype, | |
| cutlass.Float32, # acc_dtype, | |
| cutlass.BFloat16, # c_dtype, | |
| use_2cta_instrs, | |
| mma_tiler_mn, | |
| cluster_shape_mn, | |
| m, | |
| n, | |
| k, | |
| l, | |
| a_major, | |
| b_major, | |
| c_major, | |
| ) | |
| ] |
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
Signed-off-by: Mindy Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM on the llmapi changes.
Signed-off-by: Mindy Li <[email protected]>
| max_active_clusters: cutlass.Constexpr, | ||
| stream: cuda.CUstream, | ||
| ): | ||
| a = append_ones_wrapper(a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this wrapper given it's already inside "jit" function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper is used to permute the tensors to the layout our kernel expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not inside "jit" function. So I add a jit wrapper for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular use I'm commemting is inside "jit" function I believe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misunderstood what you meant . You mean we could move the @cute.jit in the following code?
@cute.jit
def append_ones_wrapper(a: cute.Tensor):
a_layout = a.layout
a_layout = cute.append(a_layout,
cute.make_layout(1, stride=1),
up_to_rank=3)
new_a = cute.make_tensor(a.iterator, a_layout)
return new_a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!. Most changes associated with the tuning part are good for me.
The tunable runner CuteDSLFp8BlackwellGroupGemm still requires the same modifications to be compatible with the autotuner.
And it will be better to add some comments in get_valid_tactics and forward of each runner for detailed explanations.
Signed-off-by: Mindy Li <[email protected]>
…owise-op Signed-off-by: Mindy Li <[email protected]>
|
/bot run |
|
PR_Github #15365 [ run ] triggered by Bot |
|
PR_Github #15365 [ run ] completed with state |
[TRTLLM-6898] [feat] Add CuTe DSL fp8 gemm, bmm and group gemm ops on Blackwell arch
New Features
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.