-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[OMNIML-2336][feat] Add NVFP4 x FP8 #6809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 36 files out of 144 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the 📝 WalkthroughWalkthroughExpands GEMM kernel APIs and data layouts (bias, SF tiling, matrix layouts), refactors kernel-parameter builders and memory-chunk model, adds a runtime PDL flag to GemmInterface/KernelRunner, introduces W4A8 NVFP4+FP8 quant paths and tests, and updates many cubin LFS pointers and build/test wiring. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Runner as TrtllmGenGemmRunner
participant GI as GemmInterface
participant KPS as KernelParamsSetup
participant Kernel as DeviceKernel
participant MC as ModuleCache
Caller->>Runner: run(gemmData, options)
Runner->>GI: run(config, workspace, data, stream, mpCount, usePdl, MC)
GI->>KPS: setKernelParams(..., ptrSfA, ptrSfB, ptrBias, ...)
KPS-->>GI: KernelParams
GI->>Kernel: launch(params) -- grid-wait gated by usePdl
Kernel-->>GI: complete
GI-->>Runner: status
Runner-->>Caller: status
sequenceDiagram
autonumber
participant Model
participant QM as QuantMode
participant Linear as LinearMethod
participant GEMM as nvfp4_fp8_gemm_trtllmgen
Model->>QM: select quant mode
QM-->>Model: has_w4a8_nvfp4_fp8 == true
Model->>Linear: create_weights(...FP8 paths...)
Linear->>GEMM: run(A_fp4, sf, B_fp8, scales, usePdl)
GEMM-->>Linear: C
Linear-->>Model: output
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 49
🔭 Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x128u2_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-4
: .cpp file contains non-C++ LFS pointer; ensure it is not compiled and address license header policy
- The file has a .cpp extension but contains a Git LFS pointer (non-C++). If any CMake glob or target includes this directory for compilation, it will break the build.
- Coding guidelines require an NVIDIA copyright header in all .cpp files. We cannot prepend a header without breaking the LFS pointer format.
Recommended actions:
- Prefer storing these prebuilt cubins with a non-source extension (e.g., .cubin) and ensure all code paths reference the new extension. Alternatively, keep them under .cpp but guarantee they are excluded from all compilation units via CMake.
- If .cpp extension is retained, document an explicit exemption to the license-header requirement for LFS pointer assets in this directory, or move a license header to a companion README at the directory level.
Also note: filename casing (leading uppercase “Gemm_…”) does not follow the “camel case with first letter lowercase” rule for C++ filenames; if these are assets rather than source, aligning the extension can resolve this inconsistency too.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h (1)
102-111
: Replace macros with explicit inline helpers; also avoid defining macros inside a namespaceMacros here are unnecessary and violate the guideline to limit macro usage and keep #define/#undef at global scope. Use explicit inline functions instead.
Apply this diff to remove macros and define the helpers directly:
-#define SPLIT_K_FUNCTION(Mode) \ - inline bool doesSplitKUse##Mode(SplitK mode) { \ - return (mode == SplitK::Mode); \ - } - -SPLIT_K_FUNCTION(Gmem) -SPLIT_K_FUNCTION(Dsmem) - -#undef SPLIT_K_FUNCTION +inline bool doesSplitKUseGmem(SplitK mode) +{ + return mode == SplitK::Gmem; +} + +inline bool doesSplitKUseDsmem(SplitK mode) +{ + return mode == SplitK::Dsmem; +} @@ -#define BIAS_TYPE_FUNCTION(Mode) \ - inline bool isBiasType##Mode(BiasType type) { \ - return (type == BiasType::Mode); \ - } - -BIAS_TYPE_FUNCTION(None) -BIAS_TYPE_FUNCTION(N) -BIAS_TYPE_FUNCTION(M) -BIAS_TYPE_FUNCTION(Mn) - -#undef BIAS_TYPE_FUNCTION +inline bool isBiasTypeNone(BiasType type) +{ + return type == BiasType::None; +} +inline bool isBiasTypeN(BiasType type) +{ + return type == BiasType::N; +} +inline bool isBiasTypeM(BiasType type) +{ + return type == BiasType::M; +} +inline bool isBiasTypeMn(BiasType type) +{ + return type == BiasType::Mn; +}Also applies to: 116-127
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (1)
597-601
: Missing closing brace for namespace gemm.The file structure shows two opening braces for namespace gemm but only one closing brace.
Apply this diff to fix the namespace structure:
} // namespace gemm //////////////////////////////////////////////////////////////////////////////////////////////////// } // namespace gemm + +////////////////////////////////////////////////////////////////////////////////////////////////////
♻️ Duplicate comments (36)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply herePlease ensure this .cpp LFS pointer is covered by the LFS rule, excluded from compilation, and consider using a non-C++ extension to avoid header policy violations and accidental compilation.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x256u2_s3_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply hereEnsure LFS tracking, build exclusion, and consider renaming to a non-C++ extension to avoid license header policy conflicts.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x128u2_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply hereConfirm LFS filter coverage, exclude from C++ compilation, and consider renaming away from .cpp to reflect a binary asset.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply hereTreat this as a binary asset: enforce LFS, avoid compilation, and consider non-.cpp extension to align with policies.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply hereEnsure LFS filter + build exclusion, and consider renaming to avoid .cpp for non-source assets.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer considerations apply herePlease verify LFS tracking, ensure CMake does not try to compile it, and consider a non-.cpp extension to avoid policy conflicts.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: LFS pointer addition is OK; ensure LFS and build are wired.Same concerns as noted on the first cubin pointer: verify .gitattributes, CI LFS fetch, and CMake references/exclusions for these sources.
Refer to the shell script attached in the first cubin file’s comment to validate LFS setup and integration across all added pointers.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policySee the comments on file:
- cpp/tensorrt_llm/.../Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x128u2_..._dsFp8_schedS_sm100a_cubin.cpp
Ensure exclusion from compilation and address the license-header requirement or switch to a non-source extension.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512u2_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policyRefer to the detailed guidance in the first file’s review. Apply the same resolution here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policyPlease ensure the same build and policy safeguards are applied.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policyApply the same remediation as suggested in the first file.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policyEnsure exclusion from compilation and license policy handling or switch to a non-source extension.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policySame actions recommended (exclude from compile, resolve license-header requirement or change extension).
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-4
: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policyPlease apply the same safeguards and decisions here as well.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x32x128u2_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Same build-exclusion concern for LFS pointer .cpp file.Ensure this file is not part of any compilation unit. Refer to the verification script provided in the first cubin pointer comment.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: LFS pointer with .cpp extension — confirm exclusion from C++ targets.This is not compilable C++. Validate exclusion via CMake (see earlier script).
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp32_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Non-compilable .cpp (LFS pointer) — ensure it’s never compiled.Please verify it’s not referenced by any build targets.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x64x128_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Confirm CMake ignores this .cpp LFS pointer.Same rationale as above.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x16x512_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: LFS pointer in a .cpp file — verify exclusion from targets and maintainability.Reiterate: ensure no target compiles this; prefer renaming to a data extension to avoid tooling confusion.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x64x128u2_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Build safety check for LFS pointer .cpp.Please confirm it’s excluded from all compilation units.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Ensure .cpp LFS pointer is not compiled; prefer a non-code extension.Same concern as the other cubin pointers in this PR.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512u2_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x32x128u2_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:
- excluding .cpp LFS pointers from compilation,
- documenting/automating header-check exemptions, and
- verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: LFS pointer under .cpp — avoid compilation and header-lint conflictsSame concerns as noted for the other cubin pointer:
- Prefer renaming away from .cpp, or guarantee CMake excludes this path and header-lint skips it.
- Verify LFS tracking via .gitattributes and git lfs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x64x128u2_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: Use non-source extension for LFS pointers and confirm LFS/CMake configTo prevent accidental compilation and header checks on LFS pointers:
- Rename away from .cpp or explicitly exclude in CMake and lint.
- Ensure LFS tracking is configured for this path.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: LFS pointer stored as .cpp — fix extension or exclude from build/lintApply the same resolution as suggested for the other files to avoid compilation and header-lint issues, and verify LFS tracking.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E2m1_Fp32_t128x128x256u2_s3_et128x128_m128x128x64_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: Non-C++ content in .cpp file — address extension and tooling implicationsAs above, please rename from .cpp or ensure robust exclusions in CMake and header-lint; verify Git LFS tracking for this path.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x16x128u2_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: LFS pointer file should not use .cpp extensionSame action items:
- Rename to a non-source extension or exclude from builds and header checks.
- Confirm LFS tracking with .gitattributes and git lfs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x128_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: .cpp extension for LFS pointer — rename or exclude; verify LFS and CMakePlease apply the same resolution as the other cubin pointer files.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Repeat: Ensure LFS pointers aren’t treated as C++ sourcesConsistent with previous files:
- Prefer non-source extension or ensure build/lint exclusions.
- Confirm LFS tracking and absence from any compilation units.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-3
: Same LFS pointer/build-exclusion concerns as aboveApply the same verification and build exclusion steps as noted for the other cubin pointer file to ensure these .cpp LFS pointers do not get compiled and are properly tracked by LFS.
🧹 Nitpick comments (41)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)
1-3
: Filename and location appear to violate C++ file naming conventions.Guideline: “C++ filenames should use camel case with the first letter lowercase.” This asset-style name with underscores and capitals is likely generated. If this remains a code file, align naming; if it’s a binary asset, prefer a data-oriented extension (e.g., .cubin) and exempt from C++ filename rules by not treating it as source.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (2)
1-3
: Filename violates repository C++ filename convention.Guideline: “C++ filenames should use camel case with the first letter lowercase.” This auto-generated/asset filename is all-caps and extremely long. If practical, consider:
- Renaming to a non-source asset extension (e.g., .cubin) and moving generation-specific metadata to build system or manifest files.
- If a wrapper .cpp is needed, give the wrapper a compliant lowerCamelCase name and keep the long, descriptive name only for the binary asset.
1-3
: Packaging and distribution considerations for large LFS cubin assets.
- Ensure LFS assets are not accidentally included in PyPI wheels or source distributions unless required (wheel bloat).
- Confirm runtime gating so this sm100a cubin is only selected on compatible GPUs.
- Document or record the toolchain and compiler flags used to produce this cubin for reproducibility.
If helpful, I can draft CMake/Bazel snippets to:
- Exclude LFS pointer .cpp files from compilation and
- Install .cubin assets alongside the library with proper runtime lookup.
Would you like me to propose those?cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x16x512u2_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (2)
1-3
: Required NVIDIA copyright header is missing (C++ file).Per project policy, all .cpp files must start with an NVIDIA copyright header including the current year. Because this is an LFS-managed file, the actual content might have the header while the pointer does not. Please confirm the fetched .cpp content begins with the required header; if not, update the source that generated this asset.
1-3
: File naming and path length risk — consider a stable short-name + manifest mapping.The filename is extremely long, deviates from the camelCase convention for C++ files, and risks hitting path length limits on Windows/macOS environments. Consider:
- Storing the asset under a shorter, stable hashed name (e.g., cubins/.cpp or .cubin) and
- Maintaining a manifest (JSON/TOML) mapping kernel metadata → filename,
- Or add an explicit exception to naming rules for generated cubin assets.
This reduces path-length issues and improves portability while keeping metadata discoverable.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)
1-3
: Filename violates C++ naming guideline for source files.Guidelines specify C++ filenames use camelCase starting lowercase. This name is UpperCamel with underscores and a very long, config-encoded identifier. If this remains a source file, please rename accordingly or justify/document an exception for generated/asset files. If it’s an asset, prefer a .cubin extension with the architecture/config encoded as part of the filename (common for precompiled kernels), and keep code files minimal and convention-compliant.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
2-2
: Provenance and runtime gating for sm100a cubin.The OID points to a prebuilt sm100a cubin. Ensure there is runtime gating so this asset is only loaded on compatible GPUs and a fallback exists for others. Consider adding a small adjacent metadata file (e.g., .json) capturing toolchain, commit, and build flags for reproducibility.
I can generate a JSON metadata template and loader snippet that validates SM compatibility before selecting the cubin. Want me to draft that?
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (3)
1-3
: .cpp extension for LFS cubin payloads is risky; consider renaming or excluding in build.With a .cpp extension, any GLOB_RECURSE for *.cpp could try to compile these assets. Prefer:
- Rename assets to .cubin (or place under a non-source suffix), and/or
- Ensure CMake explicitly excludes this directory from any *.cpp globs.
If renaming isn’t feasible due to downstream tooling, add an explicit exclusion in CMake and a README in this directory documenting that these are LFS-served binary assets, not C++ sources.
1-3
: Missing NVIDIA copyright header (guideline) — request an exemption for LFS pointers.Guidelines state all .cpp/.cu/.py must include an NVIDIA copyright header. For LFS pointer files, adding headers would break the pointer format. Please:
- Document an explicit exemption for these LFS cubin assets in CONTRIBUTING or a README here, or
- Move them to a binary-friendly extension and keep source-only files under .cpp/.cu.
This avoids false positives in audits and keeps us compliant with policy intent.
1-3
: Confirm LFS tracking and CI checkout behavior for cubin assets
- .gitattributes contains an entry
*cubin.cpp filter=lfs diff=lfs merge=lfs -text
(line 8)
which covers all*_cubin.cpp
files under LFS.- No CMakeLists.txt uses a
GLOB
orGLOB_RECURSE
for*.cpp
nor references thecubins
folder, so these stub files won’t be picked up by the build.- Action required: Ensure your CI pipelines actually fetch LFS objects before the build steps.
- On GitHub Actions: use
- uses: actions/checkout@v3 with: lfs: true- On Jenkins/other systems: add a
git lfs pull
(orfetch
) after checkout.cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x8x128_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Filename does not follow C++ file naming guidelineC++ filenames should use camel case with a lowercase first letter. The current autogenerated name violates this. If this is an auto-generated cubin asset, consider:
- Renaming to a data filename (e.g., .cubin) and exclude from compilation targets; or
- Introducing a generation step that maps kernel config to a compliant source filename and stores the config string inside the translation unit as a constant.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (2)
1-3
: Filename/style exceptions or alternative packaging.The filename violates the usual C++ filename guideline (lowerCamelCase). If these are treated as assets rather than source, consider:
- Renaming to .cubin and loading at runtime (cudaModuleLoad/cudaLibraryLoad or driver API).
- Or keeping the naming as-is but exclude these from clang-format/style checks and compilation units, and generate thin, well-named C++ wrappers that register kernels.
1-3
: Optional: Build-time embedding pattern to avoid committing huge C++ arrays.If you choose to store actual .cubin in LFS, you can embed it at build time. Example CMake approach:
- Add to CMakeLists.txt:
# Embed cubin into an object file and link it add_custom_command( OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o COMMAND ${CMAKE_OBJCOPY} -I binary -O elf64-x86-64 -B i386:x86-64 ${CMAKE_CURRENT_SOURCE_DIR}/cubins/Gemm_..._sm100a.cubin ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/cubins/Gemm_..._sm100a.cubin COMMENT "Embedding GEMM cubin" ) add_library(gemm_cubin_obj OBJECT ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o) target_link_libraries(your_target PRIVATE gemm_cubin_obj)
- In C++ you can declare the symbols emitted by objcopy (start/end/size) and pass the bytes to the loader.
This keeps the repo clean and satisfies header/style rules in actual C++ sources.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: OK to merge; follow existing naming despite style guide exceptions.Filename deviates from the camelCase guideline, but these assets appear auto-generated and consistent with existing patterns in this subtree. If style linters run globally, add an exception for this directory.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x16x128u2_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: No functional code here; verify CI pulls LFS before any compile step.Some build environments skip LFS by default. Ensure CI jobs (and end-user docs) include “git lfs install && git lfs pull” before build or package steps to avoid missing sources.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512u2_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Large asset pointer added; consider a manifest/consistency check.Given the volume of cubins, a manifest (JSON) mapping filename → oid/size can help detect drift and ensure runtime registration, plus a small CI check to compare against the directory contents.
If helpful, I can draft a simple Python tool and CI step to validate the manifest against this directory.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512u2_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Confirm these .cpp LFS pointers are excluded from compilation and packaging.These are Git LFS pointer stubs to prebuilt cubins but use a .cpp extension. Many build systems glob .cpp under src/ and attempt to compile them, which would fail. Either:
- rename to a non-compilable extension (e.g., .cubin.lfs or .lfs), or
- explicitly exclude this cubins/ directory (or pattern) from add_library/add_executable globs/targets in CMake.
If renaming isn’t feasible, add a CMake exclusion to avoid accidental compilation and to prevent these files from being misinterpreted by tooling.
I can suggest the exact CMake exclusion lines once I see the build files that collect sources.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Ensure LFS pointer files are not treated as C++ sources
The file under
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_…_cubin.cpp
is a Git LFS pointer, not valid C++ code. To prevent accidental compilation and lint failures, please:• Rename these files with a non-.cpp extension (e.g.
.cubin
,.bin
or.lfs
)
• Or update your CMakeLists.txt to explicitly exclude the entirecubins/
directory from any targets
• Update your header-lint configuration to skipcubins/
files
• Manually verify that these assets are tracked by LFS (git lfs ls-files
) and that your.gitattributes
includes patterns coveringcpp/tensorrt_llm/.../cubins/
I couldn’t confirm LFS tracking or CMake exclusions in this environment—please verify those before merging.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x64x128_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Confirm Git LFS pointer validity and upload statusPlease verify that the corresponding LFS object (oid sha256:0c6578ab...) is uploaded to the remote LFS store; otherwise consumers will fetch broken pointers. Double-check via your CI or with
git lfs push --all
prior to merging.If not already present, consider a pre-merge CI step that runs
git lfs fsck
to fail fast on missing LFS objects.cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)
1-3
: Asset provenance and manifest suggestionPrebuilt cubins are security- and reproducibility-sensitive. Add or update a manifest (JSON/TOML) mapping:
- filename → kernel config (types, tile shapes, scheduler), SM arch, build toolchain version, source commit, and hash.
This helps users and CI validate they’re loading the correct artifact and eases regeneration.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x128_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)
1-3
: Consider relocating LFS pointers to a data-only extension or directoryKeeping these as .cpp creates avoidable coupling with C++ policies (header check, formatter) and risks accidental compilation. A data extension (.cubin) or a data assets tree with clear ignore rules can reduce maintenance overhead.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-3
: CI hygiene: exclude cubins/ from format/lint hooksclang-format and header checks should skip these pointer files. Ensure the clang-format exclude and header-lint configs ignore cpp/**/cubins/*.cpp to avoid noisy CI failures.
If helpful, I can draft a minimal .clang-format/.cmake-lint config snippet to exclude this path.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp32_E2m1E2m1_Fp32_t128x128x256u2_s3_et128x128_m128x128x64_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)
1-3
: Size growth awareness and distribution strategyMultiple large LFS assets increase clone time and require LFS availability. Consider:
- Publishing these cubins as a versioned release artifact and downloading at build/test time, or
- Keeping only the minimum set required by default and providing an optional pack for extended configs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h (2)
77-86
: BiasType comments appear swapped (M vs N)The comments say “M = bias per N” and “N = bias per M (row)”, which reads inverted relative to the names. Please correct to avoid user confusion.
Suggested fix in comments:
- M: “One bias value per row M of the output tensor.”
- N: “One bias value per column N of the output tensor.”
25-35
: Brace style and indentation drift from Allman + 4-space ruleCurrent style places opening braces on the same line and uses mixed indentation. Please reformat per project guidelines (Allman, 4 spaces) using the repo’s clang-format configuration.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (1)
417-424
: Comment mismatch: says “TileN 64” but values are 128The second Nvfp4x Throughput config has mmaN/tileN/epilogueTileN = 128, but the comment reads “TileN 64”. Fix the comment to prevent confusion.
Apply this diff:
- "_comment": "TileN 64", + "_comment": "TileN 128",cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (2)
2-2
: Update copyright year to 2025The copyright header should include the year 2025 since this is a new change in 2025.
35-35
: Add documentation for the new dtypeMmaA fieldThe new public member
dtypeMmaA
should be documented to explain its purpose and usage, particularly its default value and relationship to kernel configuration selection.Add a comment above the field:
bool transposeMmaOutput{false}; + // MMA A-side dtype override for kernel config selection. Void means no restriction. gemm::trtllm::gen::Dtype dtypeMmaA{gemm::trtllm::gen::Dtype::Void};
tests/unittest/_torch/thop/test_fp4_gemm_quantize.py (2)
130-138
: Consider enabling the larger test caseThe test has a commented-out larger test case [1024, 1024, 1024]. Consider either removing it if not needed or enabling it to improve test coverage.
@parameterized.expand( list( [ -# [1024, 1024, 1024], + [1024, 1024, 1024], [128, 1, 256], ] ), name_func=unittest_name_func, )
148-151
: Document the quantization parametersThe magic numbers used in quantization (sf_vec_size=32, scaling factor 200.0) should be documented or made into named constants for clarity.
Add comments or constants:
+ # Scaling factor vector size for FP4 quantization sf_vec_size = 32 + # Global scaling factor chosen to fit within FP8 range + FP8_SCALE_FACTOR = 200.0 a_fp4, a_sf, rep_float = torch.ops.tensorrt_llm.float_to_e2m1_and_ufp8sf_scale( - a*a_global_sf, sf_vec_size, 1, True + a * a_global_sf, sf_vec_size, 1, True )cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (1)
2-2
: Update copyright year to 2025The copyright header should include 2025 for new changes.
cpp/tensorrt_llm/thop/nvfp4xFp8GemmTrtllmGen.cpp (2)
78-79
: Improve error message for dimension mismatchThe error message could be clearer about the expected relationship between mat1 and mat2 dimensions.
- TORCH_CHECK(mat1.sizes()[1] == mat2.sizes()[1] * 2, "mat1 and mat2 shapes cannot be multiplied (", mat1.sizes()[0], - "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")"); + TORCH_CHECK(mat1.sizes()[1] == mat2.sizes()[1] * 2, + "mat1.sizes()[1] must be twice mat2.sizes()[1] for FP4 packing. Got mat1: (", mat1.sizes()[0], + "x", mat1.sizes()[1], ") and mat2: (", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
85-90
: Consider making the default output dtype configurableThe default output dtype is hardcoded to FP16. Consider making this configurable or documenting why FP16 is the preferred default.
Add a comment explaining the default choice:
if (!out_dtype) { + // Default to FP16 for best compatibility and performance balance out_dtype = torch::kHalf; }
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)
410-411
: Remove unused parameter warnings in non-PDL builds.The parameters
usePdl
andmoduleCache
are marked as potentially unused but should be properly handled based on the build configuration.Consider using
[[maybe_unused]]
attribute instead of the cast to void pattern:- // Might be used. - (void)usePdl; - (void)moduleCache; + [[maybe_unused]] auto pdl = usePdl; + [[maybe_unused]] auto cache = moduleCache;
528-530
: Improve the gating condition for launchKernel.The condition for
usePdl
is complex and could benefit from better readability.Consider extracting the complex condition into a named variable:
+ bool const requiresPdl = config.mOptions.mGridWaitForPrimaryEarlyExit | + config.mOptions.mGridWaitForPrimaryA | + config.mOptions.mGridWaitForPrimaryB; auto result = trtllm::gen::launchKernel((void*)&kernelParams, cudaStream, config.mSharedMemSize, cuFunction, block3, grid3, cluster3, - usePdl && (config.mOptions.mGridWaitForPrimaryEarlyExit | - config.mOptions.mGridWaitForPrimaryA | - config.mOptions.mGridWaitForPrimaryB)); + usePdl && requiresPdl);cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (1)
303-305
: Missing closing brace comment for nested namespace.According to coding guidelines, closing braces of namespaces should have a comment.
} // namespace gemm } // namespace gemm +
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h (1)
315-317
: Missing closing brace comment for namespace.According to coding guidelines, closing braces of namespaces should have a comment.
} // namespace gemm } // namespace gemm +
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)
370-376
: Missing closing brace comment for namespace.}; // namespace KernelParamsSetup //////////////////////////////////////////////////////////////////////////////////////////////////// } // namespace gemm } // namespace gemm +
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (2)
96-107
: Consider using structured logging instead of printf.The print() method uses printf for debugging output. Consider using a logging framework for better control.
Consider using a stream-based approach:
void print(std::ostream& os = std::cout) const { for (size_t ii = 0; ii < mNumBytesAndAlignmentPerSmemChunk.size(); ++ii) { os << "Chunk " << ii << " " << mSmemChunkNames[ii] << ": " << mNumBytesAndAlignmentPerSmemChunk[ii].first << " bytes, " << mNumBytesAndAlignmentPerSmemChunk[ii].second << " alignment, reuse " << (mFirstChunkReuse[ii] ? "true" : "false") << ", offset " << getChunkOffset(ii) << "\n"; } }
660-662
: Missing closing brace comment for namespace.} // namespace gemm } // namespace gemm +
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)
1-3
: Offer: Provide build-time embedding patternIf you opt to embed binaries into a valid .cpp, consider a generator step. Example outline:
- Keep the binary tracked via LFS as *.cubin.
- Add a custom command to transform .cubin to .cpp (using xxd -i), prepend the license header, and compile that generated .cpp.
- Exclude raw .cubin and generated files from clang-format/lint as needed.
I can draft an example CMake rule and a small C++ wrapper that exposes:
- extern const unsigned char k[];
- extern const size_t kSize;
to your existing loader code.
...128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
...x16x512_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
...6x512u2_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
...x32x512_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
...k32_t128x64x256_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
...Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
...128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
c985a73
to
6a21d85
Compare
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json
Show resolved
Hide resolved
aec13da
to
1c0b4a0
Compare
1c0b4a0
to
8540922
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🔭 Outside diff range comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (2)
55-58
: Missing include for mPassingConfigIndices (header should be self-contained)This header declares std::vector members but does not include , relying on transitive includes. Make the header self-contained.
#include "trtllmGen_gemm_export/trtllm/gen/DtypeDecl.h" #include <optional> +#include <vector>
17-17
: Follow project guideline: use include guards instead of pragma onceCoding guidelines require header guards named TRTLLM__H. Replace pragma once with an include guard.
-#pragma once +#ifndef TRTLLM_KERNELRUNNER_H +#define TRTLLM_KERNELRUNNER_H ... -} // namespace tensorrt_llm - +#endif // TRTLLM_KERNELRUNNER_H +} // namespace tensorrt_llmcpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (1)
17-17
: Follow project guideline: use include guards instead of pragma oncePer coding guidelines, headers must use include guards. Replace pragma once with TRTLLM_TMADESCRIPTOR_H guard.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (1)
17-17
: Follow project guideline: use include guards instead of pragma oncePlease replace pragma once with a TRTLLM_KERNELTRAITS_H include guard per the coding guideline.
♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)
153-170
: Validate tileShapes[0] range and positivity to avoid underflow/overflow in boxDim[0]tileShapes[0] can be 0 or negative (int32_t), leading to boxDim[0]=0 or large uint32_t after sign conversion. Also we only clamp, we don’t report out-of-range. Validate tileShapes[0] > 0 and <= numEltsIn128B before use.
- auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, tileShapes[0]); + auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, static_cast<int32_t>(tileShapes[0])); + if (tileShapes[0] <= 0 || tileShapes[0] > numEltsIn128B) + { + std::cerr << "buildNdTmaDescriptor: tileShapes[0] " << tileShapes[0] + << " is out of range (1.." << numEltsIn128B << ")" << std::endl; + assert(false); + } @@ - for (size_t ii = 1; ii < tileShapes.size(); ++ii) + for (size_t ii = 1; ii < tileShapes.size(); ++ii) { - if (tileShapes[ii] > 256) + if (tileShapes[ii] == 0 || tileShapes[ii] > 256) { - std::cerr << "buildNdTmaDescriptor: boxDim too large " << tileShapes[ii] << std::endl; + std::cerr << "buildNdTmaDescriptor: boxDim[" << ii << "] invalid: " << tileShapes[ii] << std::endl; assert(false); } else { boxDim[ii] = tileShapes[ii]; } }
288-306
: Type truncation in diagnostic loops (uint32_t vs uint64_t)shapes and stridesInBytes are uint64_t vectors but the range-for variables are uint32_t, risking truncation in logs. Use uint64_t or auto/const refs.
- ss << "shape:"; - for (uint32_t shape_i : shapes) + ss << "shape:"; + for (uint64_t shape_i : shapes) { ss << " " << shape_i; } @@ - ss << "stridesInBytes:"; - for (uint32_t stride_i : stridesInBytes) + ss << "stridesInBytes:"; + for (uint64_t stride_i : stridesInBytes) { ss << " " << stride_i; }cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h (1)
306-311
: Duplicate “Miscellaneous parameters” section headerThis second “Miscellaneous parameters” header looks like an accidental copy and should be removed to avoid confusion.
Apply this diff:
- ////////////////////////////////////////////////////////////////////////////////////////////////// - // - // Miscellaneous parameters. - // - ////////////////////////////////////////////////////////////////////////////////////////////////// -
🧹 Nitpick comments (15)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (1)
31-36
: A/B dtype split is good; document Void semantics and ensure callers set both when neededThe eltTypeA/eltTypeB split fixes the prior limitation. Please add a short comment clarifying that eltTypeB=Void acts as a wildcard (no restriction) so future readers don’t assume it’s an error. Also ensure call sites that require B filtering explicitly set eltTypeB.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)
91-112
: Swizzle selection: add explicit handling for 16B-aligned FP16/BF16 tiles or improve diagnosticThe 16B path is currently only allowed for UE8m0/E4m3. If other dtypes ever produce 16B tiles (e.g., small tiles with Fp16/Bfloat16), this asserts. Consider relaxing to NONE swizzle when multiple-of-16 for all dtypes that are TMA-legal, or at least print dtype in the error to aid triage.
- else if ((fastestDimTileSizeBytes % 16) == 0 && (dtype == tg::Dtype::UE8m0 || dtype == tg::Dtype::E4m3)) + else if ((fastestDimTileSizeBytes % 16) == 0 && (dtype == tg::Dtype::UE8m0 || dtype == tg::Dtype::E4m3)) { swizzleType = CU_TENSOR_MAP_SWIZZLE_NONE; } else { - std::cerr << "buildNdTmaDescriptor: unexpected fastestDimTileSizeBytes " << fastestDimTileSizeBytes - << std::endl; + std::cerr << "buildNdTmaDescriptor: unexpected fastestDimTileSizeBytes " << fastestDimTileSizeBytes + << " for dtype " << tg::dtypeToString(dtype) << std::endl; assert(false); }
184-221
: Improve error diagnostics: include CUDA error stringYou fetch errorString but don’t print it. Add it for faster triage.
- char const* errorString; - cuGetErrorString(result, &errorString); + char const* errorString = nullptr; + cuGetErrorString(result, &errorString); @@ - ss << "Error: Failed to initialize the TMA descriptor " << result << std::endl; + ss << "Error: Failed to initialize the TMA descriptor (" << result << "): " + << (errorString ? errorString : "unknown") << std::endl;cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (2)
251-279
: New GemmFp4xFp8 template: add basic constraints doc and defaultsNice addition. Consider documenting the interplay of dtypeA=e2m1 with dtypeMmaA=e4m3 and sf* fields (block size/layout expectations), and whether transposeMmaOutput must be true for all low-latency variants. This helps future edits avoid invalid combinations.
352-358
: Test coverage for GemmFp4xFp8 across tileN setConfigs now span tileN 8..128. Please add/extend unit tests to exercise at least a representative subset (e.g., N=8, 32, 128 with dtypeC in [bf16, fp16, e4m3]) to guard against layout/descriptor regressions in the new path.
I can draft a small pytest/torch harness that invokes the TRT-LLM op with these configs and compares against a reference matmul. Want me to open a follow-up PR with tests?
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (4)
93-104
: Reuse flag accessor should return boolgetFirstChunkReuseFlagByName returns int while flags are bool. Prefer bool for clarity (or document reason if an int is required by downstream ABI).
- int getFirstChunkReuseFlagByName(std::string const& name) const + bool getFirstChunkReuseFlagByName(std::string const& name) const
410-427
: smemConstSfBuf 512B magic constant — consider constexpr with brief rationaleHard-coded 512 may surprise future maintainers. Introduce a constexpr (e.g., kSMEM_CONST_SF_BUF_BYTES = 512) with a short comment why 512 is sufficient (e.g., for 2x64 blocks x lanes).
472-478
: A in TMEM when casting or slice-K: validate denominator and comment intentGood: using TMEM when casting A or with slice-K. Add a brief comment explaining the division by (numSlicesForSliceK * 32/bitWidth), and assert numSlicesForSliceK >= 1 to avoid accidental zero division if callers pass 0.
641-646
: Return type of isSmemAbRepurposedToGmemC should align with underlying flagThe helper currently returns int; consider bool for readability unless an int is required by device ABI where you consume it.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)
49-54
: Config filter now respects eltTypeA/B — refine error message and consider logging countsThe gating looks correct. Update the error text below to reflect that no kernel matches A/B/output types (not just outputType). Optionally, log how many configs were filtered out for each criterion to aid debugging.
- TLLM_CHECK_WITH_INFO(mPassingConfigIndices.size() != 0, "No kernel found for the given output type"); + TLLM_CHECK_WITH_INFO( + mPassingConfigIndices.size() != 0, + "No kernel found for the given A/B/output dtype combination and flags (transposeMmaOutput, deepSeekFp8)");
118-120
: Pass usePDL: add inline comment for readabilityAdd a param comment so future maintainers don’t misread the boolean.
- auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount, - tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache); + auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount, + /*usePdl=*/tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache);cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (3)
516-518
: Prefer logical OR over bitwise OR on booleansUse
||
instead of|
for boolean composition. Current expression compiles but is non-idiomatic and could mask intent.Apply this diff:
- usePdl - && (config.mOptions.mGridWaitForPrimaryEarlyExit | config.mOptions.mGridWaitForPrimaryA - | config.mOptions.mGridWaitForPrimaryB)); + usePdl + && (config.mOptions.mGridWaitForPrimaryEarlyExit || config.mOptions.mGridWaitForPrimaryA + || config.mOptions.mGridWaitForPrimaryB));
164-168
: Type-safety for mPtrScaleC: avoid void + C-style cast*
mPtrScaleC
is documented as a single float on device. Define it asfloat const*
in GemmData::InputBuffers and drop the cast at call site.Apply this diff here:
- void* mPtrScaleC{nullptr}; + float const* mPtrScaleC{nullptr};And this diff at the call site:
- (float*) data.mInputBuffers.mPtrScaleC, dSplitKSlices, data.mAllReduceBuffers.mPtrTileBars, + const_cast<float*>(data.mInputBuffers.mPtrScaleC), dSplitKSlices, data.mAllReduceBuffers.mPtrTileBars,Rationale: aligns types across the API surface and removes unsafe casts.
Also applies to: 442-447
412-415
: (Nit) Remove unused-variable casts when they are actually used under TLLM_GEN_EXPORT_INTERFACE
(void) usePdl; (void) moduleCache;
are immediately used under#ifdef TLLM_GEN_EXPORT_INTERFACE
. Consider moving these casts inside the#else
branch to silence warnings only when the symbols are unused.Example:
- (void) usePdl; - (void) moduleCache; +#ifndef TLLM_GEN_EXPORT_INTERFACE + (void) usePdl; + (void) moduleCache; +#endifcpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)
25-28
: “Dependency free” note is misleading in this fileThis header includes
TmaDescriptor.h
, which pulls in CUDA-dependent types. The “keep this code dependency free... NVRTC” note applies to KernelParamsDecl.h, not this file. Consider removing or rewording to avoid confusion.Suggested wording: “KernelParamsDecl.h is kept dependency-free for NVRTC inclusion. This header builds descriptors and requires host-side CUDA headers.”
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (2)
238-255
: from_description: expose the new flag in error messages.Add use_w4a8_nvfp4_fp8 to the raise_error f-string to simplify debugging unsupported combinations.
Apply this patch within the f-string:
f"{use_nvfp4=}, " + f"{use_w4a8_nvfp4_fp8=}, " f"{use_w4a8_qserve=}, "
433-466
: Expose the new mode in to_dict for downstream configs.to_dict currently lacks an 'enable_w4a8_nvfp4_fp8' key. Several config/dumping flows rely on this structure; add it to maintain feature parity with other modes.
You can update to_dict like this (outside the changed hunk):
def to_dict(self): return { # ... 'enable_nvfp4': self.has_nvfp4(), 'enable_w4a8_nvfp4_fp8': self.has_w4a8_nvfp4_fp8(), # add this 'enable_w4a8_mxfp4_fp8': self.has_w4a8_mxfp4_fp8(), 'enable_w4a8_mxfp4_mxfp8': self.has_w4a8_mxfp4_mxfp8(), 'enable_w4a16_mxfp4': self.has_w4a16_mxfp4(), # ... }tensorrt_llm/_torch/modules/linear.py (2)
916-918
: Line too long (E501): wrap comment onto a separate line.Minor style fix flagged by Ruff.
Apply:
- assert ws.dtype == torch.float8_e4m3fn # TODO: or e8m0 for mxfp4 recipe? + # TODO: or e8m0 for mxfp4 recipe? + assert ws.dtype == torch.float8_e4m3fn
1-3
: Missing NVIDIA SPDX license header at file top.Other files include the SPDX header. Please prepend the standard NVIDIA header to this source file.
For consistency, add:
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/linear.py
(2 hunks)tensorrt_llm/quantization/mode.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/modules/linear.py
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/modules/linear.py (8)
tensorrt_llm/_utils.py (2)
dtype
(935-936)dtype
(943-953)tensorrt_llm/runtime/generation.py (2)
dtype
(818-819)dtype
(1221-1222)tensorrt_llm/python_plugin.py (2)
dtype
(258-259)dtype
(262-274)tensorrt_llm/quantization/utils/fp4_utils.py (3)
pad_up
(22-23)shuffle_matrix_a
(149-158)shuffle_matrix_sf_a
(183-206)tensorrt_llm/module.py (1)
register_parameter
(186-190)tensorrt_llm/_torch/distributed/communicator.py (2)
tp_size
(46-47)tp_rank
(54-55)tensorrt_llm/mapping.py (1)
tp_rank
(338-339)tensorrt_llm/quantization/mode.py (1)
has_w4a8_nvfp4_fp8
(185-186)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py
876-876: Undefined name model
(F821)
897-897: Line too long (132 > 120)
(E501)
🔇 Additional comments (9)
tensorrt_llm/quantization/mode.py (6)
94-96
: New QuantMode flag added: verify bit ordering stays in sync with C++.Adding W4A8_NVFP4_FP8 between existing flags shifts subsequent bit positions if C++ relies on ordinal order. Confirm the C++ IntFlag definitions are updated identically.
Same script as above should help spot any mismatch in the cpp/tensorrt_llm/common/quantization.h.
185-187
: New predicate is correct and consistent.QuantMode.has_w4a8_nvfp4_fp8() is implemented consistently with other predicates.
204-216
: Include W4A8_NVFP4_FP8 in has_any_quant: LGTM.This keeps feature detection coherent for the new mode.
324-326
: Flag-setting for the new mode is straightforward.Setting QuantMode.W4A8_NVFP4_FP8 when requested is correct.
413-415
: from_quant_algo routing: LGTM.QuantAlgo.W4A8_NVFP4_FP8 is mapped to from_description(use_w4a8_nvfp4_fp8=True) as expected.
43-45
: Ensure C++ QuantAlgo enum is synchronizedI wasn’t able to find a
enum class QuantAlgo
in the C++ headers under
cpp/include/tensorrt_llm/common/quantization.h
cpp/tensorrt_llm/kernels/quantization.h
Please manually verify that:
- The C++
QuantAlgo
enum includes the new members in the same order:
• W4A8_NVFP4_FP8
• W4A8_MXFP4_FP8
• W4A8_MXFP4_MXFP8- Any serialization/deserialization logic (e.g., switch-cases, mappings) is updated to align integer values across Python and C++.
tensorrt_llm/_torch/modules/linear.py (3)
823-863
: New W4A8NVFP4FP8LinearMethod scaffolding reads clean.The state carried (epilogue_tile_m, scaling_vector_size, weight, weight_scale, input_scale, weight_scale_2, alpha, bias) matches the intended FP4xFP8 path. Good use of pad_up for scale sizing.
932-955
: Weight/scale shuffling and parameter copies look consistent.Shuffle paths via shuffle_matrix_a and shuffle_matrix_sf_a and subsequent copies are aligned with the kernel expectations (epilogue_tile_m and scaling_vector_size). The fused paths also perform the required concatenation before shuffling.
1668-1669
: Routing to the new quant method added correctly.get_quant_method now dispatches to W4A8NVFP4FP8LinearMethod when the flag is set.
30e5faa
to
eebbd11
Compare
eebbd11
to
47fefb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/quantization/mode.py (1)
94-101
: Critical: Sync and Preserve Bit Positions for W4A8_NVFP4_FP8
tensorrt_llm/quantization/mode.py
• Move# W4A8 NVFP4 W4A8_NVFP4_FP8 = auto()so it sits immediately before
# The smallest power-of-two that is not used by a flag. Do not call auto() after that line. COUNT = auto()This ensures it’s assigned the next free bit instead of shifting the existing flags.
cpp/include/tensorrt_llm/common/quantization.h
• Insert betweennvfp4()
(1u << 12) andw4a8Mxfp4Fp8()
(1u << 14):// W4A8 NVFP4 static constexpr QuantMode w4a8Nvfp4Fp8() noexcept { return QuantMode(BaseType(1u) << 13); } constexpr bool hasW4a8Nvfp4Fp8() const noexcept { return isSet(w4a8Nvfp4Fp8()); }• Update all overloads of
fromDescription
(and any helpers) to accept auseW4a8Nvfp4Fp8
bool and pass it into bit 13.After these changes, bit assignments will be:
12 = nvfp4(), 13 = w4a8Nvfp4Fp8(), 14 = w4a8Mxfp4Fp8(), 15 = w4a8Mxfp4Mxfp8(), 16 = w4a16Mxfp4() – matching Python’s IntFlag and preserving backward compatibility.
🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (1)
238-255
: Add missing argument to raise_error() for better debuggability.
use_w4a8_nvfp4_fp8
is not listed in the error details, making invalid-combo debugging harder.def raise_error(): raise ValueError(f"Unsupported combination of QuantMode args: " @@ f"{use_fp8_rowwise=}, " f"{use_nvfp4=}, " + f"{use_w4a8_nvfp4_fp8=}, " f"{use_w4a8_qserve=}, "
tensorrt_llm/_torch/modules/linear.py (3)
1667-1670
: Prefer the specific W4A8_NVFP4_FP8 path over plain NVFP4 when both flags are set.If a caller sets both
NVFP4
andW4A8_NVFP4_FP8
bits (possible via from_description), the current ordering routes toNVFP4LinearMethod
, which is less specific. Route toW4A8NVFP4FP8LinearMethod
first.- if quant_config.layer_quant_mode.has_nvfp4(): - return NVFP4LinearMethod() - if quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8(): - return W4A8NVFP4FP8LinearMethod() + if quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8(): + return W4A8NVFP4FP8LinearMethod() + if quant_config.layer_quant_mode.has_nvfp4(): + return NVFP4LinearMethod()Also applies to: 1661-1668
902-929
: Long line exceeds style limit flagged by Ruff — wrap assertion.Line length >120 (E501). Wrap the assertion message for readability and to satisfy linters.
- assert input_scale == w["input_scale"][...], "The input_scale should be same for all the weights" + assert input_scale == w["input_scale"][...], ( + "The input_scale should be same for all the weights" + )Repeat similarly for the
weight_scale_2
assertion below.
1-13
: Missing NVIDIA SPDX header at file top.Per repo guidelines, prepend the SPDX and copyright header.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0Place these lines before any imports.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/linear.py
(3 hunks)tensorrt_llm/quantization/mode.py
(8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/quantization/mode.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/linear.py
tensorrt_llm/quantization/mode.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/quantization/utils/fp4_utils.py (3)
pad_up
(22-23)shuffle_matrix_a
(149-158)shuffle_matrix_sf_a
(183-206)tensorrt_llm/quantization/mode.py (1)
has_w4a8_nvfp4_fp8
(185-186)
tensorrt_llm/quantization/mode.py (2)
tensorrt_llm/_torch/modules/linear.py (1)
has_w4a8_nvfp4_fp8
(1825-1828)cpp/include/tensorrt_llm/common/quantization.h (2)
QuantMode
(28-440)QuantMode
(34-35)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py
895-895: Line too long (132 > 120)
(E501)
🔇 Additional comments (8)
tensorrt_llm/quantization/mode.py (5)
185-187
: Predicate addition looks good.
has_w4a8_nvfp4_fp8
aligns with the new flag and mirrors existing helpers.
212-215
: Include the new mode in has_any_quant — good.Including
W4A8_NVFP4_FP8
inhas_any_quant
keeps feature detection consistent.
324-326
: Setting the new flag in from_description is fine.No conflicts with existing constraints. Note that setting both
use_nvfp4=True
anduse_w4a8_nvfp4_fp8=True
will set both flags, which impacts get_quant_method routing (see comment in linear.py).Would you ever call from_description with both
use_nvfp4=True
anduse_w4a8_nvfp4_fp8=True
? If yes, we should prefer the “more specific” path in get_quant_method (see proposed change there).
413-415
: from_quant_algo mapping added — good.The new
QuantAlgo.W4A8_NVFP4_FP8
is recognized and routed viafrom_description(use_w4a8_nvfp4_fp8=True)
, avoiding accidentalNVFP4
bit setting.
453-455
: to_dict entry added — good.This keeps surfaces (serialization/inspection) in sync.
tensorrt_llm/_torch/modules/linear.py (3)
823-887
: New W4A8NVFP4FP8 path: overall structure is sound.
- Properly allocates NVFP4-packed weights and per-block scales.
- Tracks input_scale, weight_scale_2, alpha; dynamic path recomputes alpha.
Two follow-ups are addressed in comments below (weight_scale dtype handling and routing precedence in get_quant_method).
880-883
: Kernel input expects FP8-view of weight_scale — OK at apply-time.Viewing
module.weight_scale
astorch.float8_e4m3fn
here makes sense if the underlying parameter stores the 4-bit scale payload (see load path below).
864-877
: Alpha handling for dynamic quantization — double-check dtype and device.
alpha = module.weight_scale_2 * input_scale.to(torch.float32)
assumes both are on the same device. Ifweight_scale_2
lives on CUDA andinput_scale
is produced on CPU in some paths, add a.to(module.weight_scale_2.device)
to avoid device mismatch.Do you expect
torch.ops.tensorrt_llm.quantize_e4m3_per_tensor(input)
to return the scale on the same device asinput
(CUDA)? If yes, we’re safe. Otherwise, we should align devices.
47fefb3
to
ef5ed4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (2)
237-255
: Include new argument in raise_error diagnosticfrom_description added use_w4a8_nvfp4_fp8 but raise_error omits it from the diagnostics, making debugging harder.
Apply:
def raise_error(): raise ValueError(f"Unsupported combination of QuantMode args: " f"{quantize_weights=}, " f"{quantize_activations=}, " f"{per_token=}, " f"{per_channel=}, " f"{per_group=}, " f"{use_int4_weights=}, " f"{use_int8_kv_cache=}, " f"{use_fp8_kv_cache=}, " f"{use_fp8_qdq=}, " f"{use_fp8_block_scales=}, " f"{use_fp8_rowwise=}, " f"{use_nvfp4=}, " + f"{use_w4a8_nvfp4_fp8=}, " f"{use_w4a8_qserve=}, " f"{use_w4a8_mxfp4_fp8=}, " f"{use_w4a8_mxfp4_mxfp8=}, " f"{use_w4a16_mxfp4=}")
413-415
: from_quant_algo mapping is minimal; consider encoding W/A intentOptional: For introspection tools that rely on INT4_WEIGHTS/ACTIVATIONS flags, consider mapping W4A8_NVFP4_FP8 to a mode that also sets quantize_weights=True, quantize_activations=True, use_int4_weights=True. Current routing works since get_quant_method checks the new flag directly, but auxiliary logic (e.g., is_weight_only()) will not reflect W4A8.
Example:
- elif quant_algo == QuantAlgo.W4A8_NVFP4_FP8: - quant_mode = QuantMode.from_description(use_w4a8_nvfp4_fp8=True) + elif quant_algo == QuantAlgo.W4A8_NVFP4_FP8: + quant_mode = QuantMode.from_description( + quantize_weights=True, + quantize_activations=True, + use_int4_weights=True, + use_w4a8_nvfp4_fp8=True, + )tensorrt_llm/_torch/modules/linear.py (2)
825-863
: Create-weights shape/dtype choices look sound; add a short docstring for scale semanticsInitialization matches kernel expectations (fp4_e2m1x2 weights, padded float4_sf for per-block scales, scalar FP32 scales). Consider adding a brief docstring on the expected meanings/units of input_scale, weight_scale, weight_scale_2, and alpha to avoid confusion with the NVFP4 path (which uses 448*6/amax conventions).
888-933
: Tighten long lines and clarify invariants in load_weight_scales
- The divide-by-6.0 normalization is justified by the kernel’s nvfp4->e4m3 conversion; thanks for the inline comment.
- Two lines exceed 120 characters; reflow to satisfy Ruff E501 without changing behavior.
Apply:
- assert input_scale == w["input_scale"][...], "The input_scale should be same for all the weights" + assert input_scale == w["input_scale"][...], ( + "The input_scale should be same for all the weights" + ) @@ - assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, ( - f"The weight_scale_2 should be same for all the weights: {weight_scale_2} vs. {w['weight_scale_2']}*6" - ) + assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, ( + "The weight_scale_2 should be same for all the weights: " + f"{weight_scale_2} vs. {w['weight_scale_2']}*6" + )Additionally, if you observe minor numerical jitter across shards, consider switching to torch.allclose for these equality checks.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/linear.py
(3 hunks)tensorrt_llm/quantization/mode.py
(8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/modules/linear.py
🧬 Code Graph Analysis (2)
tensorrt_llm/quantization/mode.py (2)
tensorrt_llm/_torch/modules/linear.py (1)
has_w4a8_nvfp4_fp8
(1826-1829)cpp/include/tensorrt_llm/common/quantization.h (2)
QuantMode
(28-440)QuantMode
(34-35)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/quantization/utils/fp4_utils.py (3)
pad_up
(22-23)shuffle_matrix_a
(149-158)shuffle_matrix_sf_a
(183-206)tensorrt_llm/quantization/mode.py (1)
has_w4a8_nvfp4_fp8
(185-186)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py
895-895: Line too long (132 > 120)
(E501)
925-925: Line too long (126 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tensorrt_llm/quantization/mode.py (5)
43-43
: New QuantAlgo W4A8_NVFP4_FP8 looks goodEnumeration addition is consistent with the rest of QuantAlgo options.
185-187
: Public accessor has_w4a8_nvfp4_fp8 is clearAccessor naming and semantics align with existing helpers.
204-213
: has_any_quant correctly includes the new flagIncluding W4A8_NVFP4_FP8 in the has_any_quant mask is necessary for routing.
324-326
: Bit wiring in from_description is correctSetting the W4A8_NVFP4_FP8 flag independently is fine given get_quant_method routes on this flag.
453-455
: Expose enable_w4a8_nvfp4_fp8 in to_dictPublic serialization reflecting the new flag is good.
tensorrt_llm/_torch/modules/linear.py (6)
864-887
: Apply path is consistent; dynamic alpha handling is correct
- Uses static FP8 quant when input_scale is provided; falls back to dynamic and recomputes alpha = weight_scale_2 * input_scale.
- Reinterprets weight_scale to FP8 only at call site, leaving storage as float4_sf_dtype.
934-956
: Vanilla load path and swizzling are correctWeight shuffle and scale shuffling align with epilogue_tile_m and scaling_vector_size; alpha/input_scale propagation is consistent.
957-982
: Fused QKV: concatenation and scale shuffling LGTMCorrect order (q, k, v) and shuffling after concatenation before storing.
983-1006
: Fused Gate-Up: weight shuffling and scale handling LGTMMirrors the QKV path appropriately.
1669-1671
: Wire-up in get_quant_method is correct and ordered safelyThe check comes after NVFP4 and before MXFP4 branches, matching exclusivity assumptions.
1825-1829
: New Linear.has_w4a8_nvfp4_fp8 property is consistentMatches other boolean helpers and used by get_quant_method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
1069-1082
: Initialize all quant flags unconditionally to avoid AttributeError when quant_config is None.has_fp8_block_wise, has_fp8_rowwise, and has_w4a8_nvfp4_fp8 are only set inside the quant_config block. If quant_config is None, accessing these in forward will raise. Initialize them to False alongside the other defaults.
def update_quant_config(self, new_quant_config: Optional[QuantConfig]): self.quant_config = new_quant_config self.wrapper.update_quant_config(self.quant_config) - self.has_fp8_qdq = self.has_fp8_kv_cache = self.has_nvfp4 = False + self.has_fp8_qdq = self.has_fp8_kv_cache = self.has_nvfp4 = False + self.has_fp8_block_wise = False + self.has_fp8_rowwise = False + self.has_w4a8_nvfp4_fp8 = False if self.quant_config is not None: self.has_fp8_kv_cache = self.quant_config.layer_quant_mode.has_fp8_kv_cache( ) self.has_fp8_qdq = self.quant_config.layer_quant_mode.has_fp8_qdq() self.has_fp8_block_wise = self.quant_config.layer_quant_mode.has_fp8_block_scales( ) self.has_fp8_rowwise = self.quant_config.layer_quant_mode.has_fp8_rowwise( ) self.has_nvfp4 = self.quant_config.layer_quant_mode.has_nvfp4() self.has_w4a8_nvfp4_fp8 = self.quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8( )
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/attention.py (1)
1-1
: Add NVIDIA copyright header.Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
1-1
: Add NVIDIA copyright header.Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/modules/linear.py (3)
1-1
: Add NVIDIA copyright header.Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
867-885
: Dynamic-quant path computes alpha on-the-fly — OK; minor readability tweak.Computation is correct. Consider splitting the long call for readability.
- fp8_input, input_scale = torch.ops.tensorrt_llm.quantize_e4m3_per_tensor( - input) + fp8_input, input_scale = ( + torch.ops.tensorrt_llm.quantize_e4m3_per_tensor(input) + )
898-898
: Wrap overlong lines to satisfy linters (Ruff E501).Two lines exceed 120 chars. Wrap them to avoid CI noise.
- ws = (ws.to(torch.float32) / 6.0).to(torch.float8_e4m3fn) + ws = ( + ws.to(torch.float32).div(6.0).to(torch.float8_e4m3fn) + )- assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, ( - f"The weight_scale_2 should be same for all the weights: {weight_scale_2} vs. {w['weight_scale_2']}*6" - ) + expected = w["weight_scale_2"][...].mul(6.0) + assert weight_scale_2 == expected, ( + "The weight_scale_2 should be same for all the weights" + )Also applies to: 928-928
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/attention_backend/trtllm.py
(2 hunks)tensorrt_llm/_torch/modules/attention.py
(2 hunks)tensorrt_llm/_torch/modules/linear.py
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/modules/linear.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py
898-898: Line too long (132 > 120)
(E501)
928-928: Line too long (126 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/_torch/modules/attention.py (1)
342-352
: Correct out_scale logic for new quant case.Adding has_fp8_rowwise and has_w4a8_nvfp4_fp8 to the has_quant_scale check is consistent with create_output. No issues here.
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
1182-1191
: Extend FP8 out-dtype condition — looks good.Including has_w4a8_nvfp4_fp8 and has_fp8_rowwise in the FP8-output decision, still gated by has_fp8_kv_cache, matches the module-side gating.
tensorrt_llm/_torch/modules/linear.py (3)
823-890
: New W4A8NVFP4FP8LinearMethod scaffold — OK.Weight creation parameters, dtype choices, and basic apply path wiring look consistent with the new NVFP4 x FP8 kernel.
1673-1674
: Quant method routing includes W4A8_NVFP4_FP8 — OK.Factory correctly returns W4A8NVFP4FP8LinearMethod when enabled.
1829-1834
: Public flag has_w4a8_nvfp4_fp8 — OK.Property aligns with attention-side gating.
668a658
to
2825448
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/linear.py (1)
1-1
: Add NVIDIA copyright header (2025) at file top.Coding guidelines require the NVIDIA copyright header to be prepended to all source files. This file has none.
Apply:
+################################################################################ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the NVIDIA Source Code License. +################################################################################ from __future__ import annotations
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/linear.py (4)
823-866
: New NVFP4×FP8 linear: shapes and packing look correct; consider safer defaults for scales.
- Weight packing (E2M1x2) and block-scale buffer (float4_sf_dtype) sizing using pad_up(..., 128/4) align with kernel expectations. Good.
- Minor: input_scale and weight_scale_2 are created with uninitialized values (torch.empty). If a caller mistakenly runs apply() before load_weights, alpha and static quantization could use garbage. Prefer initializing to 1.0 or setting them to None and gating usage.
Example:
- module.input_scale = Parameter(torch.empty([1], dtype=torch.float32), + module.input_scale = Parameter(torch.ones([1], dtype=torch.float32), requires_grad=False) - module.weight_scale_2 = Parameter(torch.empty([1], dtype=torch.float32), + module.weight_scale_2 = Parameter(torch.ones([1], dtype=torch.float32), requires_grad=False) - module.alpha = Parameter(torch.empty([1], dtype=torch.float32), + module.alpha = Parameter(torch.ones([1], dtype=torch.float32), requires_grad=False)
937-960
: Guard reciprocal to avoid division-by-zero for inv_input_scale.If an exported checkpoint has zeros or denormals in input_scale, 1.0 / input_scale can INF. Clamp to eps.
Apply:
- copy_weight(module.inv_input_scale, 1.0 / input_scale) + eps = torch.finfo(torch.float32).eps + copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))
961-987
: Fused QKV: same inv_input_scale robustness as vanilla; weight swizzle order LGTM.
- Swizzle and shuffle order (concat → shuffle_matrix_sf_a for scales, shuffle_matrix_a for weights) is consistent.
- Recommend the same clamp as above for inv_input_scale.
Apply:
- copy_weight(module.inv_input_scale, 1.0 / input_scale) + eps = torch.finfo(torch.float32).eps + copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))
988-1013
: Fused Gate-Up: mirror the inv_input_scale clamp.Keeps attention FP8 output scaling numerically safe.
Apply:
- copy_weight(module.inv_input_scale, 1.0 / input_scale) + eps = torch.finfo(torch.float32).eps + copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/attention_backend/trtllm.py
(2 hunks)tensorrt_llm/_torch/modules/attention.py
(2 hunks)tensorrt_llm/_torch/modules/linear.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/modules/attention.py
- tensorrt_llm/_torch/attention_backend/trtllm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/modules/linear.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/modules/linear.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py
898-898: Line too long (132 > 120)
(E501)
928-928: Line too long (126 > 120)
(E501)
🔇 Additional comments (2)
tensorrt_llm/_torch/modules/linear.py (2)
1831-1836
: Property accessor aligns with dispatch path.Accessor matches the new dispatch branch and asserts weights created before use. LGTM.
1675-1676
: Here’s a quick script to print out the underlying bit-mask (or enum) definitions for both flags. That will let us confirm whether they share any bits or are truly independent:#!/bin/bash # Find the mode definition file mode_file=$(find tensorrt_llm/quantization -type f -name mode.py) echo "--> $mode_file" # Show NVFP4 definition grep -n 'NVFP4[[:space:]]*=' "$mode_file" # Show W4A8_NVFP4_FP8 definition grep -n 'W4A8_NVFP4_FP8[[:space:]]*=' "$mode_file"If the two constants map to distinct values, you’re safe—the
has_nvfp4()
check cannot shadowhas_w4a8_nvfp4_fp8()
. Otherwise, we’ll need to reorder or split them out.
2825448
to
8569794
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)
339-345
: Integer truncation: std::accumulate init should be size_tThe init value 0 makes accumulate compute in int and return int, risking overflow on large workspaces.
- auto size = std::accumulate(workspaceSizes.begin(), workspaceSizes.end(), 0); + auto size = std::accumulate(workspaceSizes.begin(), workspaceSizes.end(), size_t{0});
371-377
: Potential overflow in numEltsSplitK computationnumEltsSplitK is inferred as int; the product can exceed 32-bit. Promote to int64_t explicitly.
- auto numEltsSplitK = options.mNumSlicesForSplitK * numTilesM * numTilesN * options.mTileM * options.mTileN - * options.mNumSlicesForSliceK; + int64_t numEltsSplitK = static_cast<int64_t>(options.mNumSlicesForSplitK) + * static_cast<int64_t>(numTilesM) * static_cast<int64_t>(numTilesN) + * static_cast<int64_t>(options.mTileM) * static_cast<int64_t>(options.mTileN) + * static_cast<int64_t>(options.mNumSlicesForSliceK);cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (1)
83-86
: Accidental nested namespace gemm causes symbol qualification issuesThere are two consecutive namespace gemm declarations, yielding gemm::gemm. This breaks qualified calls like gemm::toString and gemm::divUp from within this file and misplaces public symbols under gemm::gemm. Remove the extra namespace and one closing brace.
-namespace gemm -{ - -namespace gemm -{ +namespace gemm +{ @@ -} // namespace gemm +// (removed extra closing; only one gemm namespace remains)Optionally, simplify qualified calls inside the same namespace (see divUpMul comment).
Also applies to: 1351-1352
♻️ Duplicate comments (10)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (1)
296-306
: Split-K variants retained despite earlier guidanceThere’s still a LowLatency FP8 block with numSlicesForSplitK=2 and clusterDimZ=2. A prior review asked to set both to 1 due to a known split-K issue and you replied “Done.” If the issue isn’t fully resolved, consider removing the split-K=2 entry for now to avoid flaky perf/correctness.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)
251-253
: Parameter name mismatch in declaration vs definitionDeclaration uses GemmData const& options, definition uses GemmData const& data. Keep parameter names consistent per guidelines.
- int32_t run(GemmConfig const& config, void* workspace, GemmData const& options, void* cudaStream, + int32_t run(GemmConfig const& config, void* workspace, GemmData const& data, void* cudaStream, int32_t multiProcessorCount, bool usePdl = true, std::optional<std::reference_wrapper<ModuleCache>> moduleCache = std::nullopt) const;
442-449
: Symbol resolution: KernelParamsSetup namespace doesn’t match its declarationThis calls gemm::KernelParamsSetup::setKernelParams, but KernelParamsSetup is currently nested under gemm::gemm in KernelParams.h. Either flatten the namespaces in KernelParams.h/Decl.h or qualify the call as gemm::gemm::KernelParamsSetup here.
Minimal local fix:
- auto kernelParams = gemm::KernelParamsSetup::setKernelParams(options, data.mInputBuffers.mPtrA, + auto kernelParams = gemm::gemm::KernelParamsSetup::setKernelParams(options, data.mInputBuffers.mPtrA,Preferred: remove the redundant inner namespace gemm in the KernelParams headers so callers remain gemm::KernelParamsSetup.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)
154-170
: Validate tileShapes[0] and avoid silent clampingboxDim[0] is clamped to <= numEltsIn128B. Silent clamping can hide misconfigurations and produce unexpected perf. Validate that tileShapes[0] is in (0, numEltsIn128B], and fail fast if not.
- auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, tileShapes[0]); + if (tileShapes.empty() || tileShapes[0] <= 0 || tileShapes[0] > numEltsIn128B) { + std::cerr << "buildNdTmaDescriptor: tileShapes[0] (" << (tileShapes.empty()? -1 : tileShapes[0]) + << ") must be in (0, " << numEltsIn128B << "]" << std::endl; + assert(false); + } + auto const numEltsInClampedFastestTileSize = tileShapes[0];
295-306
: Type mismatch in diagnostic loops (shapes/strides use uint64_t, not uint32_t)Both loops truncate 64-bit values into 32-bit. Use uint64_t (or const auto&).
- ss << "shape:"; - for (uint32_t shape_i : shapes) + ss << "shape:"; + for (uint64_t shape_i : shapes) { ss << " " << shape_i; } @@ - ss << "stridesInBytes:"; - for (uint32_t stride_i : stridesInBytes) + ss << "stridesInBytes:"; + for (uint64_t stride_i : stridesInBytes) { ss << " " << stride_i; }cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)
32-34
: Duplicate namespace gemm nesting — breaks symbol lookupThis header declares namespace gemm inside namespace gemm, placing symbols under gemm::gemm. Call sites (e.g., GemmInterface) use gemm::. Remove the redundant inner namespace to flatten to a single gemm.
-namespace gemm -{ - -namespace gemm -{ +namespace gemm +{ @@ -} // namespace gemm - -} // namespace gemm +} // namespace gemmAlso applies to: 338-343
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (2)
80-91
: Name-based chunk lookup: add metadata consistency checks; avoid throwing across ABIAdd asserts that vectors are sized consistently; consider returning a sentinel (-1) instead of throwing in headers to avoid exception crossing lib boundaries.
int32_t getChunkOffsetByName(std::string const& name) const { + assert(mNumBytesAndAlignmentPerSmemChunk.size() == mFirstChunkReuse.size()); + assert(mSmemChunkNames.size() == mFirstChunkReuse.size()); for (size_t ii = 0; ii < mSmemChunkNames.size(); ++ii) { if (mSmemChunkNames[ii] == name) { return getChunkOffset(ii); } } - throw std::runtime_error("Name not found: " + name); + throw std::runtime_error("SMEM chunk name not found: " + name); }
124-136
: Guard against infinite recursion when the first chunk is marked reusableIf mFirstChunkReuse[0] is ever true, getChunkOffset(0) recurses. Add a guard.
- if (mFirstChunkReuse[ii]) + if (mFirstChunkReuse[ii]) { - // Reuse the offset of the 0th chunk. - return getChunkOffset(0); + // Reuse the offset of the 0th chunk. + if (ii == 0) + { + assert(false && "First chunk cannot be marked reusable"); + return 0; + } + return getChunkOffset(0); }cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (2)
54-77
: LOG macros unconditionally return false and cause unintended early returns in export buildsAs previously noted, TLLM_LOG_INFO and TLLM_LOG_WARNING expand to TLLM_CHECK_WARNING(false, ...), which returns false from checkAndUpdateGemmOptions on informative messages. This breaks flows where you intend to warn and auto-correct (e.g., fixing MmaK/MmaM, setting dtypeC, epilogue tiles). Rework LOG macros to only print; reserve CHECK macros for control flow.
Apply:
#define TLLM_CHECK_ERROR(cond, ...) \ if (!(cond)) \ { \ printArgs(__VA_ARGS__); \ - printArgs("\n"); \ + printArgs("\n"); \ return false; \ } -#define TLLM_LOG_ERROR(...) TLLM_CHECK_ERROR(false, __VA_ARGS__) - #define TLLM_CHECK_ERROR_FMT(cond, ...) TLLM_CHECK_ERROR(cond, __VA_ARGS__) -#define TLLM_CHECK_WARNING(cond, ...) \ - if (!(cond)) \ - { \ - printArgs(__VA_ARGS__); \ - printArgs("\n"); \ - return false; \ - } - -#define TLLM_LOG_WARNING(...) TLLM_CHECK_WARNING(false, __VA_ARGS__) - -#define TLLM_LOG_INFO(...) TLLM_CHECK_WARNING(false, __VA_ARGS__) +#define TLLM_CHECK_WARNING(cond, ...) \ + if (!(cond)) \ + { \ + printArgs(__VA_ARGS__); \ + printArgs("\n"); \ + /* no early return on warnings */ \ + } + +#define TLLM_LOG_ERROR(...) do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0) +#define TLLM_LOG_WARNING(...) do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0) +#define TLLM_LOG_INFO(...) do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0)
1279-1284
: Bug: condition checks mLayoutB twice; should check A and BThe intent is to gate features when not both matrices are K‑major. Current code repeats mLayoutB.
- if (options.mLayoutB != MatrixLayout::MajorK || options.mLayoutB != MatrixLayout::MajorK) + if (options.mLayoutA != MatrixLayout::MajorK || options.mLayoutB != MatrixLayout::MajorK) { TLLM_CHECK_ERROR(isBlackwell, "Non K-major layouts are only supported on Blackwell"); TLLM_CHECK_ERROR(options.mSplitK == SplitK::None, "Non K-major layouts do not support split K"); }
🧹 Nitpick comments (23)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (2)
1-3
: Fail fast in configure/build if LFS pointers are not hydrated.To avoid confusing compiler errors when the working tree still contains LFS pointers, add a CMake-time guard that detects and errors out if any cubin .cpp is a pointer.
Example CMake addition (adjust the glob as needed):
+## Ensure Git LFS is installed and cubin sources are hydrated +message(STATUS "Checking Git LFS and cubin sources...") +find_program(GIT_EXE git REQUIRED) +execute_process(COMMAND "${GIT_EXE}" lfs version + RESULT_VARIABLE LFS_RC + OUTPUT_QUIET ERROR_QUIET) +if(NOT LFS_RC EQUAL 0) + message(FATAL_ERROR "Git LFS is required to build cubin-wrapped sources. Please install Git LFS.") +endif() + +# Detect unhydrated LFS pointers among cubin .cpp sources +file(GLOB_RECURSE TRTLLMGEN_CUBIN_SRCS + "${CMAKE_SOURCE_DIR}/cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp") +foreach(_src IN LISTS TRTLLMGEN_CUBIN_SRCS) + file(READ "${_src}" _content LIMIT 100) + if("${_content}" MATCHES "https://git-lfs.github.com/spec/v1") + message(FATAL_ERROR + "Unhydrated Git LFS pointer detected: ${_src}\n" + "Run: git lfs pull && git lfs checkout") + endif() +endforeach()
1-3
: Ensure .gitattributes tracks these cubin wrapper .cpp files via LFS.If not already present, add/verify a dedicated LFS rule so these large generated .cpp wrappers don’t bloat the Git object store and are consistently smudged.
Proposed .gitattributes addition:
+# TensorRT-LLM generated cubin wrapper sources (store via Git LFS) +cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp filter=lfs diff=lfs merge=lfs -textThen run:
- git lfs track "cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp"
- git add .gitattributes && git add -u
- git commit -m "Track cubin wrapper .cpp via Git LFS"
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (2)
1-3
: Filename style vs. guideline — acceptable exception for generated artifacts?The repo guideline prefers lowerCamelCase filenames for C++, while this generated cubin wrapper uses a long, underscore-and-capitalized schema. If this mirrors existing generated cubins, consider documenting an explicit exception in a README within the cubins directory.
If this is a one-off deviation, consider aligning with the established naming or adding a short rationale comment at the top of the generated file.
1-3
: Compilation footprint considerations (optional).Large embedded arrays (~475 KB here; many such files added) can increase compile time and object size. If build times regress, consider:
- Packaging as external assets loaded at runtime for relevant architectures.
- Grouping multiple cubins into a single translation unit with linkonce symbols to reduce overhead.
- Compressing byte arrays with a lightweight runtime decompressor (zstd) if startup cost is acceptable.
No action required if current CI times remain acceptable.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (3)
252-279
: New template GemmFp4xFp8: surface looks coherent; check A/B SF semantics and defaults
- dtypeA=e2m1 with dtypeMmaA=e4m3 and sfBlockSizeA=32 plus sfLayoutA="128x4" matches the R128c4 path implemented in KernelParamsSetup::makeTmaShapeStrideSfAb.
- dtypeB=e4m3 (non-block) implies no tmaSfB — that’s consistent.
Two small asks:
- Please confirm that options default for mSfLayoutB is ignored on this template (since B is non-block), to avoid confusion.
- Consider adding a brief _comment field documenting expected constraints (e.g., K % 64, M % 128 for A’s R128c4).
352-358
: Config sweep for GemmFp4xFp8: set is sensible; add a guardrail for tileK on tileN=128The (tileN,tileK) grid is reasonable, and lowering tileK to 256 at tileN=128 is prudent. Please also cap epilogueTileN to <= tileN in the validator to prevent misconfig if templates evolve.
If helpful, I can add a minimal validator rule in the generator to enforce epilogueTileN <= tileN and tileK ∈ {256,512}.
349-351
: MxE2m1×MxE4m3: dtypeC list includes fp32/mxe4m3 — confirm store pathsIncluding "fp32" and "mxe4m3" for C implies TMA store must handle these, with mPtrSfC populated for mxe4m3 and two-shot AR promotion for fp32 when needed. Please confirm both paths are exercised in tests.
I can scaffold a small pytest to exercise C={fp16,bf16,fp32,mxe4m3}.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)
516-519
: Use logical OR for boolean gatingUsing bitwise OR on bools works but is non-idiomatic and evaluates all operands. Prefer short-circuit logical OR.
- usePdl - && (config.mOptions.mGridWaitForPrimaryEarlyExit | config.mOptions.mGridWaitForPrimaryA - | config.mOptions.mGridWaitForPrimaryB)); + usePdl + && (config.mOptions.mGridWaitForPrimaryEarlyExit || config.mOptions.mGridWaitForPrimaryA + || config.mOptions.mGridWaitForPrimaryB));
468-496
: Module cache key might collide across different cubins with identical function namesThe key is context + function name. If two different cubins expose the same function name in the same context, we could fetch a mismatched CUfunction. Include a fingerprint of config.mData (e.g., address bytes) to disambiguate.
- auto const moduleKey = ctxName + funcName; + std::string moduleKey = ctxName + funcName; + moduleKey.append(reinterpret_cast<char const*>(&config.mData), sizeof(config.mData));cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (3)
21-27
: Missing include for std::stringstreamstd::stringstream is used below; include explicitly to avoid relying on transitive includes.
#include <iostream> +#include <sstream>
188-221
: Error path: include CUDA error string in diagnosticsYou fetch errorString but don’t print it for buildNdTmaDescriptor; include it for readability (you already do this for SF descriptor).
- ss << "Error: Failed to initialize the TMA descriptor " << result << std::endl; + ss << "Error: Failed to initialize the TMA descriptor (" << result << "): " << errorString << std::endl;
309-325
: Minor: const auto& in loops; unify case label diagnosticsNon-blocking: prefer const auto& for loops over tileShapes/tileStrides; also consider logging both decimal and hex for swizzleType.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (2)
216-334
: KernelParamsSetup: overall API stands; minor consistency nits
- Nice: explicit passing of ptrBias and per-token SFs, and auto-selecting dtypeC for two-shot AR.
- Nit: consider constexpr for MatrixType alias, and marking helpers static inline to avoid ODR issues in headers compiled by multiple TUs. Not critical.
25-28
: “Dependency free and NVRTC-compilable” note vs exceptionsThis header is included by device code per the comment. Throwing std::runtime_error in device-includable headers can be problematic for NVRTC. If these code paths never reach NVRTC (guarded by TLLM_ENABLE_CUDA or host-only), clarify with comments or guard throws with host-side macros.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (1)
566-571
: Avoid copying KernelTraits by value in small helpersPassing KernelTraits by value to getters copies the allocator helpers; switch to const&.
-inline int32_t getSmemOffsetLoadA(KernelTraits traits) +inline int32_t getSmemOffsetLoadA(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetLoadB(KernelTraits traits) +inline int32_t getSmemOffsetLoadB(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetLoadShuffleB(KernelTraits traits) +inline int32_t getSmemOffsetLoadShuffleB(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetGmemC(KernelTraits traits, int resIdx = 0) +inline int32_t getSmemOffsetGmemC(KernelTraits const& traits, int resIdx = 0) @@ -inline int32_t getSmemOffsetRowMax(KernelTraits traits) +inline int32_t getSmemOffsetRowMax(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetSliceK(KernelTraits traits) +inline int32_t getSmemOffsetSliceK(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetPerTokenSf(KernelTraits traits) +inline int32_t getSmemOffsetPerTokenSf(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetBias(KernelTraits traits) +inline int32_t getSmemOffsetBias(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetBlockAmax(KernelTraits traits) +inline int32_t getSmemOffsetBlockAmax(KernelTraits const& traits) @@ -inline int32_t getSmemOffsetConstSfBuf(KernelTraits traits) +inline int32_t getSmemOffsetConstSfBuf(KernelTraits const& traits) @@ -inline int32_t getTmemOffsetD(KernelTraits traits) +inline int32_t getTmemOffsetD(KernelTraits const& traits) @@ -inline int32_t getTmemOffsetA(KernelTraits traits) +inline int32_t getTmemOffsetA(KernelTraits const& traits) @@ -inline int32_t getTmemOffsetSfA(KernelTraits traits) +inline int32_t getTmemOffsetSfA(KernelTraits const& traits) @@ -inline int32_t getTmemOffsetSfB(KernelTraits traits) +inline int32_t getTmemOffsetSfB(KernelTraits const& traits)Also applies to: 573-576, 587-590, 594-597, 601-604, 608-611, 615-618, 622-625, 629-632, 636-639, 654-657, 661-664, 668-671, 675-678
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (8)
529-533
: Unnecessary self-qualification; may mask nested-namespace mistakesWithin namespace gemm, calling gemm::divUp is redundant and, with the accidental nested gemm, can break lookup. Keep it simple:
-inline T divUpMul(T a, T b) +inline T divUpMul(T a, T b) { - return gemm::divUp(a, b) * b; + return divUp(a, b) * b; }
590-597
: Comment and code disagree on supported A-cast pathsComment says only {MxFp4, NvFp4} → Bf16, but code also allows NvFp4 (E2m1) → Fp8 (E4m3). Update comment to match behavior to avoid confusion in downstream API users.
- // Currently, we only support {MxFp4, NvFp4} -> Bf16. + // Currently supported A casts: + // {MxFp4 (MxE2m1), NvFp4 (E2m1)} -> Bf16 + // NvFp4 (E2m1) -> Fp8 (E4m3)
774-781
: Typos in comments (“Nvida”, “expertiment”)Polish the user-facing comments.
- // SfBlockSizeA can also support 64 and 128, although they are not officially supported Nvida + // SfBlockSizeA can also support 64 and 128, although they are not officially supported NVIDIA @@ - // TODO Leaving this as an option for now in case we want to expertiment with other block sizes + // TODO Leaving this as an option for now in case we want to experiment with other block sizes
17-17
: Add include guards per repository guideline (keep #pragma once if desired)Headers must use TRTLLM__H guards. Suggest adding guards around the file content.
#pragma once +#ifndef TRTLLM_GEMMOPTIONS_H +#define TRTLLM_GEMMOPTIONS_H @@ } // namespace gemm - +#endif // TRTLLM_GEMMOPTIONS_HPlease ensure adding guards doesn’t conflict with existing include order. If other headers already use guards, this will be consistent across the codebase.
Also applies to: 1364-1365
494-503
: dumpOptions: std::nullopt formatting nitCurrent output prints “std::nullopt, ” with extra space before comma on the next line. Consider consistent comma-spacing for machine-parsable dumps.
- ss << "mSfBlockSizeA=" - << "std::nullopt" - << ", " << std::endl; + ss << "mSfBlockSizeA=std::nullopt," << std::endl;
101-187
: Constructor explosion: consider a builder or struct-of-args to reduce call-site errorsThe GemmOptions constructor accepts >40 parameters, many booleans. This is error-prone and hurts readability. Consider a fluent builder, or pass a small config struct for logically grouped fields (layouts, dtypes, SF options).
No immediate change required, but worth a follow-up refactor.
I can sketch a GemmOptionsBuilder that preserves defaults and emits compile-time errors for missing required fields.
488-494
: Optional: consider always dumping mSfBlockSizeA (print nullopt explicitly)You already handle nullopt; formatting tweak above will make downstream parsing easier.
359-377
: GemmConfig: new mHash fieldLooks fine. If you have a config dump elsewhere, consider including mHash for traceability.
Want me to add a dumpConfig(GemmConfig const&) helper mirroring dumpOptions?
...k32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)
112-114
: Check CUDA return code when querying SM countUn-checked cudaDeviceGetAttribute can leave multiProcessorCount undefined on failure. Guard with TLLM_CUDA_CHECK and initialize the variable.
Apply:
- int32_t multiProcessorCount; - cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount, device); + int32_t multiProcessorCount = 0; + TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount, device));
146-185
: Fix comparator to satisfy strict weak ordering; use std::abs to avoid overload pitfalls
- Returning true when all keys are equal violates strict weak ordering and can yield undefined behavior in std::sort. End with return false.
- Prefer std::abs and precompute distances for clarity; include the proper header if not already pulled transitively.
Apply:
- [&configs, &gemmData](int32_t idx0, int32_t idx1) + [&configs, &gemmData](int32_t idx0, int32_t idx1) { auto const& optionsA = configs[idx0].mOptions; auto const& optionsB = configs[idx1].mOptions; @@ - if (optionsA.mTileN != optionsB.mTileN) - { - return abs(gemmData.mProblemDimensions.mN - optionsA.mTileN) - < abs(gemmData.mProblemDimensions.mN - optionsB.mTileN); - } + if (optionsA.mTileN != optionsB.mTileN) + { + auto const n = gemmData.mProblemDimensions.mN; + auto const dA = std::abs(n - optionsA.mTileN); + auto const dB = std::abs(n - optionsB.mTileN); + return dA < dB; + } @@ - return true; + return false; // Equal preference — maintain strict weak ordering. });Additionally, ensure the correct overload is visible:
// At top of file with other includes #include <cstdlib> // for std::abs on integral types
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)
45-55
: Document B-type “Void as wildcard” filter; avoid copying optionsTwo small improvements:
- Clarify in code that mOptions.eltTypeB == Dtype::Void acts as a wildcard.
- Bind options by const reference to avoid an unnecessary copy.
Apply:
- auto const options = configs[i].mOptions; + auto const& options = configs[i].mOptions; @@ - // When we include low-latency kernels we can set transposeMmaOutput via constructor + // When we include low-latency kernels we can set transposeMmaOutput via constructor. + // For B-type, Dtype::Void is treated as a wildcard (no restriction on options.mDtypeB). if (options.mDtypeA == mOptions.eltTypeA && options.mDtypeC == mOptions.outputType && options.mUseDeepSeekFp8 == mOptions.deepSeekFp8 && options.mTransposeMmaOutput == mOptions.transposeMmaOutput && (mOptions.eltTypeB == gemm::trtllm::gen::Dtype::Void || options.mDtypeB == mOptions.eltTypeB))
118-120
: Good switch to env-gated PDL; minor readability tweakNice follow-up adopting getEnvEnablePDL() instead of a bare boolean. For readability in call sites, consider naming the parameter with an inline comment.
Apply:
- auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount, - tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache); + bool const usePdl = tensorrt_llm::common::getEnvEnablePDL(); + auto const err = gemm.run( + config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount, /*usePdl=*/usePdl, + globalTrtllmGenGemmModuleCache);
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)
59-60
: Make the failure message reflect all applied filtersThe mismatch can be due to multiple gates (A/B dtype, output type, transposeMmaOutput, deepSeekFp8). Tighten the wording.
Apply:
- TLLM_CHECK_WITH_INFO(mPassingConfigIndices.size() != 0, "No kernel found for the given output type"); + TLLM_CHECK_WITH_INFO(!mPassingConfigIndices.empty(), + "No GEMM kernel found matching eltTypeA/eltTypeB/outputType and transposeMmaOutput/deepSeekFp8 filters.");
151-176
: Align heuristic priority with the comments (latency vs throughput)Comment says tileN closeness is for low-latency (transposeMmaOutput), while tileM governs throughput (non-transpose). The comparator currently prioritizes tileN closeness unconditionally, which may suboptimal in throughput mode. Consider branching the heuristic based on mOptions.transposeMmaOutput.
Apply:
- std::sort(sortedIndices.begin(), sortedIndices.end(), - [&configs, &gemmData](int32_t idx0, int32_t idx1) + std::sort(sortedIndices.begin(), sortedIndices.end(), + [this, &configs, &gemmData](int32_t idx0, int32_t idx1) { auto const& optionsA = configs[idx0].mOptions; auto const& optionsB = configs[idx1].mOptions; - // Choose the tileN that is closest to the problem N - // This is the batch size dimension for low latency (transposeMmaOutput) case; - if (optionsA.mTileN != optionsB.mTileN) - { - auto const n = gemmData.mProblemDimensions.mN; - auto const dA = std::abs(n - optionsA.mTileN); - auto const dB = std::abs(n - optionsB.mTileN); - return dA < dB; - } + // For low latency, prefer tileN close to problem N; for throughput, prioritize larger tileM. + if (mOptions.transposeMmaOutput) + { + if (optionsA.mTileN != optionsB.mTileN) + { + auto const n = gemmData.mProblemDimensions.mN; + auto const dA = std::abs(n - optionsA.mTileN); + auto const dB = std::abs(n - optionsB.mTileN); + return dA < dB; + } + } + else + { + if (optionsA.mTileM != optionsB.mTileM) + { + return optionsA.mTileM > optionsB.mTileM; + } + }If you have perf data suggesting unconditional tileN closeness still wins, feel free to ignore; otherwise this aligns code with the stated intent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.
Applied to files:
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (1)
21-21
: Env-gated PDL include is correct and scoped to this TUIncluding tensorrt_llm/common/envUtils.h here is appropriate for the PDL runtime toggle used below. No issues.
df1273b
to
e29565d
Compare
/bot run |
PR_Github #17399 [ run ] triggered by Bot |
PR_Github #17399 [ run ] completed with state |
/bot run |
PR_Github #17415 [ run ] triggered by Bot |
PR_Github #17415 [ run ] completed with state |
Signed-off-by: Shiyang Chen <[email protected]> temp Signed-off-by: Shiyang Chen <[email protected]> debug print Signed-off-by: Shiyang Chen <[email protected]> export again with updated config Signed-off-by: Shiyang Chen <[email protected]> updated according to reviewer comments use eltTypeA/B instead of dtypeMmaA rename files pdl value from env Signed-off-by: Shiyang Chen <[email protected]> fix bug Signed-off-by: Shiyang Chen <[email protected]> small update based on comments Signed-off-by: Shiyang Chen <[email protected]> improve unittest Signed-off-by: Shiyang Chen <[email protected]> run pre-commit Signed-off-by: Shiyang Chen <[email protected]> export again Signed-off-by: Shiyang Chen <[email protected]> export again Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
Add rules based on tileN and tileM. Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
c6177ca
to
44e32b6
Compare
/bot run |
PR_Github #17486 [ run ] triggered by Bot |
PR_Github #17486 [ run ] completed with state |
/bot run |
PR_Github #17556 [ run ] triggered by Bot |
PR_Github #17556 [ run ] completed with state |
Signed-off-by: Shiyang Chen <[email protected]>
44e32b6
to
1e654c3
Compare
/bot run |
PR_Github #17569 [ run ] triggered by Bot |
PR_Github #17569 [ run ] completed with state |
/bot run |
PR_Github #17575 [ run ] triggered by Bot |
PR_Github #17575 [ run ] completed with state |
Signed-off-by: Shiyang Chen <[email protected]>
Summary by CodeRabbit
New Features
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
pr-checklist