Skip to content

Conversation

sychen52
Copy link
Collaborator

@sychen52 sychen52 commented Aug 12, 2025

Summary by CodeRabbit

  • New Features

    • Added W4A8 NVFP4+FP8 and MX FP4+FP8 quantization options with automatic selection.
    • GEMM enhancements: bias support, richer matrix-layout/MMA dtype controls, per-token scaling (SF tiling/reshape), and a new FP4xFP8 configuration.
    • Kernel launch can now be gated by an environment-controlled PD flag.
  • Tests

    • New unit test validating NVFP4/FP8 GEMM against an FP32 reference.
  • Chores

    • Updated prebuilt kernel assets and build configuration; added a translation unit to the common build.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

pr-checklist

Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

Important

Review skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

36 files out of 144 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Expands GEMM kernel APIs and data layouts (bias, SF tiling, matrix layouts), refactors kernel-parameter builders and memory-chunk model, adds a runtime PDL flag to GemmInterface/KernelRunner, introduces W4A8 NVFP4+FP8 quant paths and tests, and updates many cubin LFS pointers and build/test wiring.

Changes

Cohort / File(s) Summary
Runner & options wiring
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h, cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
Split eltTypeeltTypeA/eltTypeB in options; update constructor filtering; add env include; GEMM invocation now passes extra usePdl bool and calls env helper.
Public GEMM interface & data layouts
.../trtllmGen_gemm_export/GemmInterface.h, .../trtllmGen_gemm_export/Enums.h
Add enums (MatrixLayout, SplitK, BiasType, TileScheduler); GemmData::InputBuffers gains mPtrBias; GemmInterface::run adds bool usePdl parameter; kernel-param builder usage updated; bias semantics documented.
Kernel params refactor & declaration
.../trtllmGen_gemm_export/KernelParams.h, .../KernelParamsDecl.h
Introduce gemm::KernelParamsSetup helpers and public alias; expand setKernelParams to accept SF pointers and ptrBias; add dependency-free device-facing gemm::gemm::KernelParams (KernelParamsDecl.h).
TMA descriptor & traits
.../trtllmGen_gemm_export/TmaDescriptor.h, .../KernelTraits.h
buildNdTmaDescriptor accepts tileShapes vector, broader dtype/swizzle handling, stronger diagnostics; KernelTraits moves to name-based SMEM/TMEM chunk offsets, exposes chunk lookup/total-size, adds bias/const-SF chunks; constructor extended for dtypeMmaA/B and BiasType.
GemmOptions & config
.../trtllmGen_gemm_export/GemmOptions.h, .../config.json
Expand GemmOptions public surface (BiasType, blockK, MatrixLayout A/B, mDtypeMmaA/mDtypeMmaB, SF options, etc.); constructor/signature updated; config.json renames numMmaStagesnumStagesMma, removes useMetaFp8 entries, and adds GemmFp4xFp8 template + variants.
Kernel traits / exports
.../trtllmGen_gemm_export/KernelParamsDecl.h, .../KernelParams.h, .../KernelTraits.h, .../TmaDescriptor.h
New exported device-friendly KernelParams struct; KernelParamsSetup exposes shape/stride builders and SF support; KernelTraits API updated for named-chunk accessors and bias/const SF offsets.
Cubins (LFS assets)
.../trtllmGen_gemm_export/cubins/*_cubin.cpp
Bulk additions of many new cubin Git LFS pointer files and removals of legacy cubin pointer files (metadata/asset changes only).
PyTorch quantization & linear methods
tensorrt_llm/quantization/mode.py, tensorrt_llm/_torch/modules/linear.py
Add W4A8_NVFP4_FP8 enum/flag and accessor in QuantMode; add W4A8NVFP4FP8LinearMethod and W4A8MXFP4FP8LinearMethod classes; Linear gains has_w4a8_nvfp4_fp8 property and selection path.
Attention backend propagation & tests
tensorrt_llm/_torch/attention_backend/trtllm.py, tensorrt_llm/_torch/modules/attention.py, tests/unittest/_torch/thop/test_fp4_gemm_quantize.py
Propagate has_w4a8_nvfp4_fp8 into FP8 gating for attention output selection; add unit test test_nvfp4_fp8_gemm_trtllmgen.
Build file change
cpp/tensorrt_llm/thop/CMakeLists.txt
Add nvfp4xFp8GemmTrtllmGen.cpp to th_common sources.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Runner as TrtllmGenGemmRunner
  participant GI as GemmInterface
  participant KPS as KernelParamsSetup
  participant Kernel as DeviceKernel
  participant MC as ModuleCache

  Caller->>Runner: run(gemmData, options)
  Runner->>GI: run(config, workspace, data, stream, mpCount, usePdl, MC)
  GI->>KPS: setKernelParams(..., ptrSfA, ptrSfB, ptrBias, ...)
  KPS-->>GI: KernelParams
  GI->>Kernel: launch(params)  -- grid-wait gated by usePdl
  Kernel-->>GI: complete
  GI-->>Runner: status
  Runner-->>Caller: status
Loading
sequenceDiagram
  autonumber
  participant Model
  participant QM as QuantMode
  participant Linear as LinearMethod
  participant GEMM as nvfp4_fp8_gemm_trtllmgen

  Model->>QM: select quant mode
  QM-->>Model: has_w4a8_nvfp4_fp8 == true
  Model->>Linear: create_weights(...FP8 paths...)
  Linear->>GEMM: run(A_fp4, sf, B_fp8, scales, usePdl)
  GEMM-->>Linear: C
  Linear-->>Model: output
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 49

🔭 Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x128u2_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-4: .cpp file contains non-C++ LFS pointer; ensure it is not compiled and address license header policy

  • The file has a .cpp extension but contains a Git LFS pointer (non-C++). If any CMake glob or target includes this directory for compilation, it will break the build.
  • Coding guidelines require an NVIDIA copyright header in all .cpp files. We cannot prepend a header without breaking the LFS pointer format.

Recommended actions:

  • Prefer storing these prebuilt cubins with a non-source extension (e.g., .cubin) and ensure all code paths reference the new extension. Alternatively, keep them under .cpp but guarantee they are excluded from all compilation units via CMake.
  • If .cpp extension is retained, document an explicit exemption to the license-header requirement for LFS pointer assets in this directory, or move a license header to a companion README at the directory level.

Also note: filename casing (leading uppercase “Gemm_…”) does not follow the “camel case with first letter lowercase” rule for C++ filenames; if these are assets rather than source, aligning the extension can resolve this inconsistency too.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h (1)

102-111: Replace macros with explicit inline helpers; also avoid defining macros inside a namespace

Macros here are unnecessary and violate the guideline to limit macro usage and keep #define/#undef at global scope. Use explicit inline functions instead.

Apply this diff to remove macros and define the helpers directly:

-#define SPLIT_K_FUNCTION(Mode)                                                                     \
-  inline bool doesSplitKUse##Mode(SplitK mode) {                                                   \
-    return (mode == SplitK::Mode);                                                                 \
-  }
-
-SPLIT_K_FUNCTION(Gmem)
-SPLIT_K_FUNCTION(Dsmem)
-
-#undef SPLIT_K_FUNCTION
+inline bool doesSplitKUseGmem(SplitK mode)
+{
+    return mode == SplitK::Gmem;
+}
+
+inline bool doesSplitKUseDsmem(SplitK mode)
+{
+    return mode == SplitK::Dsmem;
+}
@@
-#define BIAS_TYPE_FUNCTION(Mode)                                                                   \
-  inline bool isBiasType##Mode(BiasType type) {                                                    \
-    return (type == BiasType::Mode);                                                               \
-  }
-
-BIAS_TYPE_FUNCTION(None)
-BIAS_TYPE_FUNCTION(N)
-BIAS_TYPE_FUNCTION(M)
-BIAS_TYPE_FUNCTION(Mn)
-
-#undef BIAS_TYPE_FUNCTION
+inline bool isBiasTypeNone(BiasType type)
+{
+    return type == BiasType::None;
+}
+inline bool isBiasTypeN(BiasType type)
+{
+    return type == BiasType::N;
+}
+inline bool isBiasTypeM(BiasType type)
+{
+    return type == BiasType::M;
+}
+inline bool isBiasTypeMn(BiasType type)
+{
+    return type == BiasType::Mn;
+}

Also applies to: 116-127

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (1)

597-601: Missing closing brace for namespace gemm.

The file structure shows two opening braces for namespace gemm but only one closing brace.

Apply this diff to fix the namespace structure:

 } // namespace gemm

 ////////////////////////////////////////////////////////////////////////////////////////////////////

 } // namespace gemm
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
♻️ Duplicate comments (36)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Please ensure this .cpp LFS pointer is covered by the LFS rule, excluded from compilation, and consider using a non-C++ extension to avoid header policy violations and accidental compilation.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x256u2_s3_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Ensure LFS tracking, build exclusion, and consider renaming to a non-C++ extension to avoid license header policy conflicts.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x128u2_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Confirm LFS filter coverage, exclude from C++ compilation, and consider renaming away from .cpp to reflect a binary asset.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Treat this as a binary asset: enforce LFS, avoid compilation, and consider non-.cpp extension to align with policies.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Ensure LFS filter + build exclusion, and consider renaming to avoid .cpp for non-source assets.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_MxE4m3_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same LFS pointer considerations apply here

Please verify LFS tracking, ensure CMake does not try to compile it, and consider a non-.cpp extension to avoid policy conflicts.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: LFS pointer addition is OK; ensure LFS and build are wired.

Same concerns as noted on the first cubin pointer: verify .gitattributes, CI LFS fetch, and CMake references/exclusions for these sources.

Refer to the shell script attached in the first cubin file’s comment to validate LFS setup and integration across all added pointers.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

See the comments on file:

  • cpp/tensorrt_llm/.../Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x128u2_..._dsFp8_schedS_sm100a_cubin.cpp

Ensure exclusion from compilation and address the license-header requirement or switch to a non-source extension.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512u2_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Refer to the detailed guidance in the first file’s review. Apply the same resolution here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Please ensure the same build and policy safeguards are applied.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Apply the same remediation as suggested in the first file.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Ensure exclusion from compilation and license policy handling or switch to a non-source extension.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Same actions recommended (exclude from compile, resolve license-header requirement or change extension).

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256u2_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-4: Same concerns as first cubin pointer: .cpp extension with non-C++ content and header policy

Please apply the same safeguards and decisions here as well.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x32x128u2_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Same build-exclusion concern for LFS pointer .cpp file.

Ensure this file is not part of any compilation unit. Refer to the verification script provided in the first cubin pointer comment.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: LFS pointer with .cpp extension — confirm exclusion from C++ targets.

This is not compilable C++. Validate exclusion via CMake (see earlier script).

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp32_MxE2m1MxE4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Non-compilable .cpp (LFS pointer) — ensure it’s never compiled.

Please verify it’s not referenced by any build targets.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x64x128_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Confirm CMake ignores this .cpp LFS pointer.

Same rationale as above.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x16x512_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: LFS pointer in a .cpp file — verify exclusion from targets and maintainability.

Reiterate: ensure no target compiles this; prefer renaming to a data extension to avoid tooling confusion.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x64x128u2_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Build safety check for LFS pointer .cpp.

Please confirm it’s excluded from all compilation units.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Ensure .cpp LFS pointer is not compiled; prefer a non-code extension.

Same concern as the other cubin pointers in this PR.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512u2_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x32x128u2_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Same concerns as the first cubin pointer: ensure build exclusion and policy exemption.

Refer to my comment on Gemm_Bfloat16_E2m1E4m3_castE4m3_*_cubin.cpp for:

  • excluding .cpp LFS pointers from compilation,
  • documenting/automating header-check exemptions, and
  • verifying LFS tracking via .gitattributes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x128x128u2_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: LFS pointer under .cpp — avoid compilation and header-lint conflicts

Same concerns as noted for the other cubin pointer:

  • Prefer renaming away from .cpp, or guarantee CMake excludes this path and header-lint skips it.
  • Verify LFS tracking via .gitattributes and git lfs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x64x128u2_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: Use non-source extension for LFS pointers and confirm LFS/CMake config

To prevent accidental compilation and header checks on LFS pointers:

  • Rename away from .cpp or explicitly exclude in CMake and lint.
  • Ensure LFS tracking is configured for this path.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x32x128_s4_et64x32_m64x32x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: LFS pointer stored as .cpp — fix extension or exclude from build/lint

Apply the same resolution as suggested for the other files to avoid compilation and header-lint issues, and verify LFS tracking.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E2m1_Fp32_t128x128x256u2_s3_et128x128_m128x128x64_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: Non-C++ content in .cpp file — address extension and tooling implications

As above, please rename from .cpp or ensure robust exclusions in CMake and header-lint; verify Git LFS tracking for this path.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x16x128u2_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: LFS pointer file should not use .cpp extension

Same action items:

  • Rename to a non-source extension or exclude from builds and header checks.
  • Confirm LFS tracking with .gitattributes and git lfs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x128_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: .cpp extension for LFS pointer — rename or exclude; verify LFS and CMake

Please apply the same resolution as the other cubin pointer files.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Repeat: Ensure LFS pointers aren’t treated as C++ sources

Consistent with previous files:

  • Prefer non-source extension or ensure build/lint exclusions.
  • Confirm LFS tracking and absence from any compilation units.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-3: Same LFS pointer/build-exclusion concerns as above

Apply the same verification and build exclusion steps as noted for the other cubin pointer file to ensure these .cpp LFS pointers do not get compiled and are properly tracked by LFS.

🧹 Nitpick comments (41)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)

1-3: Filename and location appear to violate C++ file naming conventions.

Guideline: “C++ filenames should use camel case with the first letter lowercase.” This asset-style name with underscores and capitals is likely generated. If this remains a code file, align naming; if it’s a binary asset, prefer a data-oriented extension (e.g., .cubin) and exempt from C++ filename rules by not treating it as source.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x16x128_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (2)

1-3: Filename violates repository C++ filename convention.

Guideline: “C++ filenames should use camel case with the first letter lowercase.” This auto-generated/asset filename is all-caps and extremely long. If practical, consider:

  • Renaming to a non-source asset extension (e.g., .cubin) and moving generation-specific metadata to build system or manifest files.
  • If a wrapper .cpp is needed, give the wrapper a compliant lowerCamelCase name and keep the long, descriptive name only for the binary asset.

1-3: Packaging and distribution considerations for large LFS cubin assets.

  • Ensure LFS assets are not accidentally included in PyPI wheels or source distributions unless required (wheel bloat).
  • Confirm runtime gating so this sm100a cubin is only selected on compatible GPUs.
  • Document or record the toolchain and compiler flags used to produce this cubin for reproducibility.

If helpful, I can draft CMake/Bazel snippets to:

  • Exclude LFS pointer .cpp files from compilation and
  • Install .cubin assets alongside the library with proper runtime lookup.
    Would you like me to propose those?
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x16x512u2_s2_et128x16_m128x16x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (2)

1-3: Required NVIDIA copyright header is missing (C++ file).

Per project policy, all .cpp files must start with an NVIDIA copyright header including the current year. Because this is an LFS-managed file, the actual content might have the header while the pointer does not. Please confirm the fetched .cpp content begins with the required header; if not, update the source that generated this asset.


1-3: File naming and path length risk — consider a stable short-name + manifest mapping.

The filename is extremely long, deviates from the camelCase convention for C++ files, and risks hitting path length limits on Windows/macOS environments. Consider:

  • Storing the asset under a shorter, stable hashed name (e.g., cubins/.cpp or .cubin) and
  • Maintaining a manifest (JSON/TOML) mapping kernel metadata → filename,
  • Or add an explicit exception to naming rules for generated cubin assets.

This reduces path-length issues and improves portability while keeping metadata discoverable.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)

1-3: Filename violates C++ naming guideline for source files.

Guidelines specify C++ filenames use camelCase starting lowercase. This name is UpperCamel with underscores and a very long, config-encoded identifier. If this remains a source file, please rename accordingly or justify/document an exception for generated/asset files. If it’s an asset, prefer a .cubin extension with the architecture/config encoded as part of the filename (common for precompiled kernels), and keep code files minimal and convention-compliant.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

2-2: Provenance and runtime gating for sm100a cubin.

The OID points to a prebuilt sm100a cubin. Ensure there is runtime gating so this asset is only loaded on compatible GPUs and a fallback exists for others. Consider adding a small adjacent metadata file (e.g., .json) capturing toolchain, commit, and build flags for reproducibility.

I can generate a JSON metadata template and loader snippet that validates SM compatibility before selecting the cubin. Want me to draft that?

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (3)

1-3: .cpp extension for LFS cubin payloads is risky; consider renaming or excluding in build.

With a .cpp extension, any GLOB_RECURSE for *.cpp could try to compile these assets. Prefer:

  • Rename assets to .cubin (or place under a non-source suffix), and/or
  • Ensure CMake explicitly excludes this directory from any *.cpp globs.

If renaming isn’t feasible due to downstream tooling, add an explicit exclusion in CMake and a README in this directory documenting that these are LFS-served binary assets, not C++ sources.


1-3: Missing NVIDIA copyright header (guideline) — request an exemption for LFS pointers.

Guidelines state all .cpp/.cu/.py must include an NVIDIA copyright header. For LFS pointer files, adding headers would break the pointer format. Please:

  • Document an explicit exemption for these LFS cubin assets in CONTRIBUTING or a README here, or
  • Move them to a binary-friendly extension and keep source-only files under .cpp/.cu.

This avoids false positives in audits and keeps us compliant with policy intent.


1-3: Confirm LFS tracking and CI checkout behavior for cubin assets

  • .gitattributes contains an entry
    *cubin.cpp filter=lfs diff=lfs merge=lfs -text (line 8)
    which covers all *_cubin.cpp files under LFS.
  • No CMakeLists.txt uses a GLOB or GLOB_RECURSE for *.cpp nor references the cubins folder, so these stub files won’t be picked up by the build.
  • Action required: Ensure your CI pipelines actually fetch LFS objects before the build steps.
    • On GitHub Actions: use
      - uses: actions/checkout@v3
        with:
          lfs: true
    • On Jenkins/other systems: add a git lfs pull (or fetch) after checkout.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x8x128_s4_et64x8_m64x8x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Filename does not follow C++ file naming guideline

C++ filenames should use camel case with a lowercase first letter. The current autogenerated name violates this. If this is an auto-generated cubin asset, consider:

  • Renaming to a data filename (e.g., .cubin) and exclude from compilation targets; or
  • Introducing a generation step that maps kernel config to a compliant source filename and stores the config string inside the translation unit as a constant.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x64x256_s2_et128x64_m128x64x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (2)

1-3: Filename/style exceptions or alternative packaging.

The filename violates the usual C++ filename guideline (lowerCamelCase). If these are treated as assets rather than source, consider:

  • Renaming to .cubin and loading at runtime (cudaModuleLoad/cudaLibraryLoad or driver API).
  • Or keeping the naming as-is but exclude these from clang-format/style checks and compilation units, and generate thin, well-named C++ wrappers that register kernels.

1-3: Optional: Build-time embedding pattern to avoid committing huge C++ arrays.

If you choose to store actual .cubin in LFS, you can embed it at build time. Example CMake approach:

  • Add to CMakeLists.txt:
# Embed cubin into an object file and link it
add_custom_command(
    OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o
    COMMAND ${CMAKE_OBJCOPY} -I binary -O elf64-x86-64 -B i386:x86-64
            ${CMAKE_CURRENT_SOURCE_DIR}/cubins/Gemm_..._sm100a.cubin
            ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o
    DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/cubins/Gemm_..._sm100a.cubin
    COMMENT "Embedding GEMM cubin"
)
add_library(gemm_cubin_obj OBJECT ${CMAKE_CURRENT_BINARY_DIR}/gemm_cubin.o)
target_link_libraries(your_target PRIVATE gemm_cubin_obj)
  • In C++ you can declare the symbols emitted by objcopy (start/end/size) and pass the bytes to the loader.

This keeps the repo clean and satisfies header/style rules in actual C++ sources.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: OK to merge; follow existing naming despite style guide exceptions.

Filename deviates from the camelCase guideline, but these assets appear auto-generated and consistent with existing patterns in this subtree. If style linters run globally, add an exception for this directory.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E4m3E4m3_Fp32_t128x16x128u2_s4_et64x16_m64x16x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: No functional code here; verify CI pulls LFS before any compile step.

Some build environments skip LFS by default. Ensure CI jobs (and end-user docs) include “git lfs install && git lfs pull” before build or package steps to avoid missing sources.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512u2_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Large asset pointer added; consider a manifest/consistency check.

Given the volume of cubins, a manifest (JSON) mapping filename → oid/size can help detect drift and ensure runtime registration, plus a small CI check to compare against the directory contents.

If helpful, I can draft a simple Python tool and CI step to validate the manifest against this directory.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512u2_s2_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Confirm these .cpp LFS pointers are excluded from compilation and packaging.

These are Git LFS pointer stubs to prebuilt cubins but use a .cpp extension. Many build systems glob .cpp under src/ and attempt to compile them, which would fail. Either:

  • rename to a non-compilable extension (e.g., .cubin.lfs or .lfs), or
  • explicitly exclude this cubins/ directory (or pattern) from add_library/add_executable globs/targets in CMake.

If renaming isn’t feasible, add a CMake exclusion to avoid accidental compilation and to prevent these files from being misinterpreted by tooling.

I can suggest the exact CMake exclusion lines once I see the build files that collect sources.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x8x512u2_s3_et128x8_m128x8x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Ensure LFS pointer files are not treated as C++ sources
The file under
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_…_cubin.cpp
is a Git LFS pointer, not valid C++ code. To prevent accidental compilation and lint failures, please:

• Rename these files with a non-.cpp extension (e.g. .cubin, .bin or .lfs)
• Or update your CMakeLists.txt to explicitly exclude the entire cubins/ directory from any targets
• Update your header-lint configuration to skip cubins/ files
• Manually verify that these assets are tracked by LFS (git lfs ls-files) and that your .gitattributes includes patterns covering cpp/tensorrt_llm/.../cubins/

I couldn’t confirm LFS tracking or CMake exclusions in this environment—please verify those before merging.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E4m3E4m3_Fp32_t128x64x128_s4_et64x64_m64x64x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Confirm Git LFS pointer validity and upload status

Please verify that the corresponding LFS object (oid sha256:0c6578ab...) is uploaded to the remote LFS store; otherwise consumers will fetch broken pointers. Double-check via your CI or with git lfs push --all prior to merging.

If not already present, consider a pre-merge CI step that runs git lfs fsck to fail fast on missing LFS objects.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x2_16dp256b_splitK2_TN_transOut_schedS_sm100a_cubin.cpp (1)

1-3: Asset provenance and manifest suggestion

Prebuilt cubins are security- and reproducibility-sensitive. Add or update a manifest (JSON/TOML) mapping:

  • filename → kernel config (types, tile shapes, scheduler), SM arch, build toolchain version, source commit, and hash.

This helps users and CI validate they’re loading the correct artifact and eases regeneration.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E4m3E4m3_Fp32_t128x128x128_s4_et64x128_m64x128x32_cga1x1x1_16dp256b_TN_transOut_noShflA_dsFp8_schedS_sm100a_cubin.cpp (1)

1-3: Consider relocating LFS pointers to a data-only extension or directory

Keeping these as .cpp creates avoidable coupling with C++ policies (header check, formatter) and risks accidental compilation. A data extension (.cubin) or a data assets tree with clear ignore rules can reduce maintenance overhead.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_E4m3_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-3: CI hygiene: exclude cubins/ from format/lint hooks

clang-format and header checks should skip these pointer files. Ensure the clang-format exclude and header-lint configs ignore cpp/**/cubins/*.cpp to avoid noisy CI failures.

If helpful, I can draft a minimal .clang-format/.cmake-lint config snippet to exclude this path.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp32_E2m1E2m1_Fp32_t128x128x256u2_s3_et128x128_m128x128x64_cga1x1x1_16dp256b_TN_schedS_sm100a_cubin.cpp (1)

1-3: Size growth awareness and distribution strategy

Multiple large LFS assets increase clone time and require LFS availability. Consider:

  • Publishing these cubins as a versioned release artifact and downloading at build/test time, or
  • Keeping only the minimum set required by default and providing an optional pack for extended configs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/Enums.h (2)

77-86: BiasType comments appear swapped (M vs N)

The comments say “M = bias per N” and “N = bias per M (row)”, which reads inverted relative to the names. Please correct to avoid user confusion.

Suggested fix in comments:

  • M: “One bias value per row M of the output tensor.”
  • N: “One bias value per column N of the output tensor.”

25-35: Brace style and indentation drift from Allman + 4-space rule

Current style places opening braces on the same line and uses mixed indentation. Please reformat per project guidelines (Allman, 4 spaces) using the repo’s clang-format configuration.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (1)

417-424: Comment mismatch: says “TileN 64” but values are 128

The second Nvfp4x Throughput config has mmaN/tileN/epilogueTileN = 128, but the comment reads “TileN 64”. Fix the comment to prevent confusion.

Apply this diff:

-            "_comment": "TileN 64",
+            "_comment": "TileN 128",
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (2)

2-2: Update copyright year to 2025

The copyright header should include the year 2025 since this is a new change in 2025.


35-35: Add documentation for the new dtypeMmaA field

The new public member dtypeMmaA should be documented to explain its purpose and usage, particularly its default value and relationship to kernel configuration selection.

Add a comment above the field:

    bool transposeMmaOutput{false};
+    // MMA A-side dtype override for kernel config selection. Void means no restriction.
    gemm::trtllm::gen::Dtype dtypeMmaA{gemm::trtllm::gen::Dtype::Void};
tests/unittest/_torch/thop/test_fp4_gemm_quantize.py (2)

130-138: Consider enabling the larger test case

The test has a commented-out larger test case [1024, 1024, 1024]. Consider either removing it if not needed or enabling it to improve test coverage.

    @parameterized.expand(
        list(
            [
-#                [1024, 1024, 1024],
+                [1024, 1024, 1024],
                [128, 1, 256],
            ]
        ),
        name_func=unittest_name_func,
    )

148-151: Document the quantization parameters

The magic numbers used in quantization (sf_vec_size=32, scaling factor 200.0) should be documented or made into named constants for clarity.

Add comments or constants:

+        # Scaling factor vector size for FP4 quantization
         sf_vec_size = 32
+        # Global scaling factor chosen to fit within FP8 range
+        FP8_SCALE_FACTOR = 200.0
         a_fp4, a_sf, rep_float = torch.ops.tensorrt_llm.float_to_e2m1_and_ufp8sf_scale(
-            a*a_global_sf, sf_vec_size, 1, True
+            a * a_global_sf, sf_vec_size, 1, True
         )
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (1)

2-2: Update copyright year to 2025

The copyright header should include 2025 for new changes.

cpp/tensorrt_llm/thop/nvfp4xFp8GemmTrtllmGen.cpp (2)

78-79: Improve error message for dimension mismatch

The error message could be clearer about the expected relationship between mat1 and mat2 dimensions.

-    TORCH_CHECK(mat1.sizes()[1] == mat2.sizes()[1] * 2, "mat1 and mat2 shapes cannot be multiplied (", mat1.sizes()[0],
-        "x", mat1.sizes()[1], " and ", mat2.sizes()[0], "x", mat2.sizes()[1], ")");
+    TORCH_CHECK(mat1.sizes()[1] == mat2.sizes()[1] * 2, 
+        "mat1.sizes()[1] must be twice mat2.sizes()[1] for FP4 packing. Got mat1: (", mat1.sizes()[0],
+        "x", mat1.sizes()[1], ") and mat2: (", mat2.sizes()[0], "x", mat2.sizes()[1], ")");

85-90: Consider making the default output dtype configurable

The default output dtype is hardcoded to FP16. Consider making this configurable or documenting why FP16 is the preferred default.

Add a comment explaining the default choice:

     if (!out_dtype)
     {
+        // Default to FP16 for best compatibility and performance balance
         out_dtype = torch::kHalf;
     }
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)

410-411: Remove unused parameter warnings in non-PDL builds.

The parameters usePdl and moduleCache are marked as potentially unused but should be properly handled based on the build configuration.

Consider using [[maybe_unused]] attribute instead of the cast to void pattern:

-  // Might be used.
-  (void)usePdl;
-  (void)moduleCache;
+  [[maybe_unused]] auto pdl = usePdl;
+  [[maybe_unused]] auto cache = moduleCache;

528-530: Improve the gating condition for launchKernel.

The condition for usePdl is complex and could benefit from better readability.

Consider extracting the complex condition into a named variable:

+  bool const requiresPdl = config.mOptions.mGridWaitForPrimaryEarlyExit |
+                          config.mOptions.mGridWaitForPrimaryA |
+                          config.mOptions.mGridWaitForPrimaryB;
   auto result = trtllm::gen::launchKernel((void*)&kernelParams,
                                           cudaStream,
                                           config.mSharedMemSize,
                                           cuFunction,
                                           block3,
                                           grid3,
                                           cluster3,
-                                          usePdl && (config.mOptions.mGridWaitForPrimaryEarlyExit |
-                                                     config.mOptions.mGridWaitForPrimaryA |
-                                                     config.mOptions.mGridWaitForPrimaryB));
+                                          usePdl && requiresPdl);
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (1)

303-305: Missing closing brace comment for nested namespace.

According to coding guidelines, closing braces of namespaces should have a comment.

 } // namespace gemm

 } // namespace gemm
+
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h (1)

315-317: Missing closing brace comment for namespace.

According to coding guidelines, closing braces of namespaces should have a comment.

 } // namespace gemm

 } // namespace gemm
+
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)

370-376: Missing closing brace comment for namespace.

 }; // namespace KernelParamsSetup

 ////////////////////////////////////////////////////////////////////////////////////////////////////

 } // namespace gemm

 } // namespace gemm
+
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (2)

96-107: Consider using structured logging instead of printf.

The print() method uses printf for debugging output. Consider using a logging framework for better control.

Consider using a stream-based approach:

void print(std::ostream& os = std::cout) const {
  for (size_t ii = 0; ii < mNumBytesAndAlignmentPerSmemChunk.size(); ++ii) {
    os << "Chunk " << ii << " " << mSmemChunkNames[ii] 
       << ": " << mNumBytesAndAlignmentPerSmemChunk[ii].first << " bytes, "
       << mNumBytesAndAlignmentPerSmemChunk[ii].second << " alignment, reuse "
       << (mFirstChunkReuse[ii] ? "true" : "false")
       << ", offset " << getChunkOffset(ii) << "\n";
  }
}

660-662: Missing closing brace comment for namespace.

 } // namespace gemm

 } // namespace gemm
+
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Fp16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x128x256u2_s2_et128x128_m128x128x32_cga1x1x1_16dp256b_TN_transOut_schedP_sm100a_cubin.cpp (1)

1-3: Offer: Provide build-time embedding pattern

If you opt to embed binaries into a valid .cpp, consider a generator step. Example outline:

  • Keep the binary tracked via LFS as *.cubin.
  • Add a custom command to transform .cubin to .cpp (using xxd -i), prepend the license header, and compile that generated .cpp.
  • Exclude raw .cubin and generated files from clang-format/lint as needed.

I can draft an example CMake rule and a small C++ wrapper that exposes:

  • extern const unsigned char k[];
  • extern const size_t kSize;

to your existing loader code.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

@sychen52 sychen52 force-pushed the add_nvfp4xfp8 branch 2 times, most recently from aec13da to 1c0b4a0 Compare August 14, 2025 21:24
@sychen52 sychen52 changed the title [draft] Add nvfp4xfp8 [none][feat] Add NVFP4 x FP8 Aug 14, 2025
@sychen52 sychen52 changed the title [none][feat] Add NVFP4 x FP8 [None][feat] Add NVFP4 x FP8 Aug 14, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (2)

55-58: Missing include for mPassingConfigIndices (header should be self-contained)

This header declares std::vector members but does not include , relying on transitive includes. Make the header self-contained.

 #include "trtllmGen_gemm_export/trtllm/gen/DtypeDecl.h"
 #include <optional>
+#include <vector>

17-17: Follow project guideline: use include guards instead of pragma once

Coding guidelines require header guards named TRTLLM__H. Replace pragma once with an include guard.

-#pragma once
+#ifndef TRTLLM_KERNELRUNNER_H
+#define TRTLLM_KERNELRUNNER_H
...
-} // namespace tensorrt_llm
- 
+#endif // TRTLLM_KERNELRUNNER_H
+} // namespace tensorrt_llm
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (1)

17-17: Follow project guideline: use include guards instead of pragma once

Per coding guidelines, headers must use include guards. Replace pragma once with TRTLLM_TMADESCRIPTOR_H guard.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (1)

17-17: Follow project guideline: use include guards instead of pragma once

Please replace pragma once with a TRTLLM_KERNELTRAITS_H include guard per the coding guideline.

♻️ Duplicate comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)

153-170: Validate tileShapes[0] range and positivity to avoid underflow/overflow in boxDim[0]

tileShapes[0] can be 0 or negative (int32_t), leading to boxDim[0]=0 or large uint32_t after sign conversion. Also we only clamp, we don’t report out-of-range. Validate tileShapes[0] > 0 and <= numEltsIn128B before use.

-    auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, tileShapes[0]);
+    auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, static_cast<int32_t>(tileShapes[0]));
+    if (tileShapes[0] <= 0 || tileShapes[0] > numEltsIn128B)
+    {
+        std::cerr << "buildNdTmaDescriptor: tileShapes[0] " << tileShapes[0]
+                  << " is out of range (1.." << numEltsIn128B << ")" << std::endl;
+        assert(false);
+    }
@@
-    for (size_t ii = 1; ii < tileShapes.size(); ++ii)
+    for (size_t ii = 1; ii < tileShapes.size(); ++ii)
     {
-        if (tileShapes[ii] > 256)
+        if (tileShapes[ii] == 0 || tileShapes[ii] > 256)
         {
-            std::cerr << "buildNdTmaDescriptor: boxDim too large " << tileShapes[ii] << std::endl;
+            std::cerr << "buildNdTmaDescriptor: boxDim[" << ii << "] invalid: " << tileShapes[ii] << std::endl;
             assert(false);
         }
         else
         {
             boxDim[ii] = tileShapes[ii];
         }
     }

288-306: Type truncation in diagnostic loops (uint32_t vs uint64_t)

shapes and stridesInBytes are uint64_t vectors but the range-for variables are uint32_t, risking truncation in logs. Use uint64_t or auto/const refs.

-        ss << "shape:";
-        for (uint32_t shape_i : shapes)
+        ss << "shape:";
+        for (uint64_t shape_i : shapes)
         {
             ss << " " << shape_i;
         }
@@
-        ss << "stridesInBytes:";
-        for (uint32_t stride_i : stridesInBytes)
+        ss << "stridesInBytes:";
+        for (uint64_t stride_i : stridesInBytes)
         {
             ss << " " << stride_i;
         }
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParamsDecl.h (1)

306-311: Duplicate “Miscellaneous parameters” section header

This second “Miscellaneous parameters” header looks like an accidental copy and should be removed to avoid confusion.

Apply this diff:

-    //////////////////////////////////////////////////////////////////////////////////////////////////
-    //
-    // Miscellaneous parameters.
-    //
-    //////////////////////////////////////////////////////////////////////////////////////////////////
-
🧹 Nitpick comments (15)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.h (1)

31-36: A/B dtype split is good; document Void semantics and ensure callers set both when needed

The eltTypeA/eltTypeB split fixes the prior limitation. Please add a short comment clarifying that eltTypeB=Void acts as a wildcard (no restriction) so future readers don’t assume it’s an error. Also ensure call sites that require B filtering explicitly set eltTypeB.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)

91-112: Swizzle selection: add explicit handling for 16B-aligned FP16/BF16 tiles or improve diagnostic

The 16B path is currently only allowed for UE8m0/E4m3. If other dtypes ever produce 16B tiles (e.g., small tiles with Fp16/Bfloat16), this asserts. Consider relaxing to NONE swizzle when multiple-of-16 for all dtypes that are TMA-legal, or at least print dtype in the error to aid triage.

-        else if ((fastestDimTileSizeBytes % 16) == 0 && (dtype == tg::Dtype::UE8m0 || dtype == tg::Dtype::E4m3))
+        else if ((fastestDimTileSizeBytes % 16) == 0 && (dtype == tg::Dtype::UE8m0 || dtype == tg::Dtype::E4m3))
         {
             swizzleType = CU_TENSOR_MAP_SWIZZLE_NONE;
         }
         else
         {
-            std::cerr << "buildNdTmaDescriptor: unexpected fastestDimTileSizeBytes " << fastestDimTileSizeBytes
-                      << std::endl;
+            std::cerr << "buildNdTmaDescriptor: unexpected fastestDimTileSizeBytes " << fastestDimTileSizeBytes
+                      << " for dtype " << tg::dtypeToString(dtype) << std::endl;
             assert(false);
         }

184-221: Improve error diagnostics: include CUDA error string

You fetch errorString but don’t print it. Add it for faster triage.

-        char const* errorString;
-        cuGetErrorString(result, &errorString);
+        char const* errorString = nullptr;
+        cuGetErrorString(result, &errorString);
@@
-        ss << "Error: Failed to initialize the TMA descriptor " << result << std::endl;
+        ss << "Error: Failed to initialize the TMA descriptor (" << result << "): "
+           << (errorString ? errorString : "unknown") << std::endl;
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (2)

251-279: New GemmFp4xFp8 template: add basic constraints doc and defaults

Nice addition. Consider documenting the interplay of dtypeA=e2m1 with dtypeMmaA=e4m3 and sf* fields (block size/layout expectations), and whether transposeMmaOutput must be true for all low-latency variants. This helps future edits avoid invalid combinations.


352-358: Test coverage for GemmFp4xFp8 across tileN set

Configs now span tileN 8..128. Please add/extend unit tests to exercise at least a representative subset (e.g., N=8, 32, 128 with dtypeC in [bf16, fp16, e4m3]) to guard against layout/descriptor regressions in the new path.

I can draft a small pytest/torch harness that invokes the TRT-LLM op with these configs and compares against a reference matmul. Want me to open a follow-up PR with tests?

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (4)

93-104: Reuse flag accessor should return bool

getFirstChunkReuseFlagByName returns int while flags are bool. Prefer bool for clarity (or document reason if an int is required by downstream ABI).

-    int getFirstChunkReuseFlagByName(std::string const& name) const
+    bool getFirstChunkReuseFlagByName(std::string const& name) const

410-427: smemConstSfBuf 512B magic constant — consider constexpr with brief rationale

Hard-coded 512 may surprise future maintainers. Introduce a constexpr (e.g., kSMEM_CONST_SF_BUF_BYTES = 512) with a short comment why 512 is sufficient (e.g., for 2x64 blocks x lanes).


472-478: A in TMEM when casting or slice-K: validate denominator and comment intent

Good: using TMEM when casting A or with slice-K. Add a brief comment explaining the division by (numSlicesForSliceK * 32/bitWidth), and assert numSlicesForSliceK >= 1 to avoid accidental zero division if callers pass 0.


641-646: Return type of isSmemAbRepurposedToGmemC should align with underlying flag

The helper currently returns int; consider bool for readability unless an int is required by device ABI where you consume it.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)

49-54: Config filter now respects eltTypeA/B — refine error message and consider logging counts

The gating looks correct. Update the error text below to reflect that no kernel matches A/B/output types (not just outputType). Optionally, log how many configs were filtered out for each criterion to aid debugging.

-    TLLM_CHECK_WITH_INFO(mPassingConfigIndices.size() != 0, "No kernel found for the given output type");
+    TLLM_CHECK_WITH_INFO(
+        mPassingConfigIndices.size() != 0,
+        "No kernel found for the given A/B/output dtype combination and flags (transposeMmaOutput, deepSeekFp8)");

118-120: Pass usePDL: add inline comment for readability

Add a param comment so future maintainers don’t misread the boolean.

-    auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount,
-        tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache);
+    auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount,
+        /*usePdl=*/tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache);
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (3)

516-518: Prefer logical OR over bitwise OR on booleans

Use || instead of | for boolean composition. Current expression compiles but is non-idiomatic and could mask intent.

Apply this diff:

-        usePdl
-            && (config.mOptions.mGridWaitForPrimaryEarlyExit | config.mOptions.mGridWaitForPrimaryA
-                | config.mOptions.mGridWaitForPrimaryB));
+        usePdl
+            && (config.mOptions.mGridWaitForPrimaryEarlyExit || config.mOptions.mGridWaitForPrimaryA
+                || config.mOptions.mGridWaitForPrimaryB));

164-168: Type-safety for mPtrScaleC: avoid void + C-style cast*

mPtrScaleC is documented as a single float on device. Define it as float const* in GemmData::InputBuffers and drop the cast at call site.

Apply this diff here:

-        void* mPtrScaleC{nullptr};
+        float const* mPtrScaleC{nullptr};

And this diff at the call site:

-        (float*) data.mInputBuffers.mPtrScaleC, dSplitKSlices, data.mAllReduceBuffers.mPtrTileBars,
+        const_cast<float*>(data.mInputBuffers.mPtrScaleC), dSplitKSlices, data.mAllReduceBuffers.mPtrTileBars,

Rationale: aligns types across the API surface and removes unsafe casts.

Also applies to: 442-447


412-415: (Nit) Remove unused-variable casts when they are actually used under TLLM_GEN_EXPORT_INTERFACE

(void) usePdl; (void) moduleCache; are immediately used under #ifdef TLLM_GEN_EXPORT_INTERFACE. Consider moving these casts inside the #else branch to silence warnings only when the symbols are unused.

Example:

-    (void) usePdl;
-    (void) moduleCache;
+#ifndef TLLM_GEN_EXPORT_INTERFACE
+    (void) usePdl;
+    (void) moduleCache;
+#endif
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)

25-28: “Dependency free” note is misleading in this file

This header includes TmaDescriptor.h, which pulls in CUDA-dependent types. The “keep this code dependency free... NVRTC” note applies to KernelParamsDecl.h, not this file. Consider removing or rewording to avoid confusion.

Suggested wording: “KernelParamsDecl.h is kept dependency-free for NVRTC inclusion. This header builds descriptors and requires host-side CUDA headers.”

@sychen52 sychen52 requested a review from a team as a code owner August 15, 2025 20:57
@sychen52 sychen52 requested a review from hlu1 August 15, 2025 20:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (2)

238-255: from_description: expose the new flag in error messages.

Add use_w4a8_nvfp4_fp8 to the raise_error f-string to simplify debugging unsupported combinations.

Apply this patch within the f-string:

              f"{use_nvfp4=}, "
+             f"{use_w4a8_nvfp4_fp8=}, "
              f"{use_w4a8_qserve=}, "

433-466: Expose the new mode in to_dict for downstream configs.

to_dict currently lacks an 'enable_w4a8_nvfp4_fp8' key. Several config/dumping flows rely on this structure; add it to maintain feature parity with other modes.

You can update to_dict like this (outside the changed hunk):

def to_dict(self):
    return {
        # ...
        'enable_nvfp4': self.has_nvfp4(),
        'enable_w4a8_nvfp4_fp8': self.has_w4a8_nvfp4_fp8(),  # add this
        'enable_w4a8_mxfp4_fp8': self.has_w4a8_mxfp4_fp8(),
        'enable_w4a8_mxfp4_mxfp8': self.has_w4a8_mxfp4_mxfp8(),
        'enable_w4a16_mxfp4': self.has_w4a16_mxfp4(),
        # ...
    }
tensorrt_llm/_torch/modules/linear.py (2)

916-918: Line too long (E501): wrap comment onto a separate line.

Minor style fix flagged by Ruff.

Apply:

-                assert ws.dtype == torch.float8_e4m3fn  # TODO: or e8m0 for mxfp4 recipe?
+                # TODO: or e8m0 for mxfp4 recipe?
+                assert ws.dtype == torch.float8_e4m3fn

1-3: Missing NVIDIA SPDX license header at file top.

Other files include the SPDX header. Please prepend the standard NVIDIA header to this source file.

For consistency, add:

# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8540922 and 30e5faa.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/linear.py (2 hunks)
  • tensorrt_llm/quantization/mode.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/modules/linear.py
🧬 Code Graph Analysis (1)
tensorrt_llm/_torch/modules/linear.py (8)
tensorrt_llm/_utils.py (2)
  • dtype (935-936)
  • dtype (943-953)
tensorrt_llm/runtime/generation.py (2)
  • dtype (818-819)
  • dtype (1221-1222)
tensorrt_llm/python_plugin.py (2)
  • dtype (258-259)
  • dtype (262-274)
tensorrt_llm/quantization/utils/fp4_utils.py (3)
  • pad_up (22-23)
  • shuffle_matrix_a (149-158)
  • shuffle_matrix_sf_a (183-206)
tensorrt_llm/module.py (1)
  • register_parameter (186-190)
tensorrt_llm/_torch/distributed/communicator.py (2)
  • tp_size (46-47)
  • tp_rank (54-55)
tensorrt_llm/mapping.py (1)
  • tp_rank (338-339)
tensorrt_llm/quantization/mode.py (1)
  • has_w4a8_nvfp4_fp8 (185-186)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py

876-876: Undefined name model

(F821)


897-897: Line too long (132 > 120)

(E501)

🔇 Additional comments (9)
tensorrt_llm/quantization/mode.py (6)

94-96: New QuantMode flag added: verify bit ordering stays in sync with C++.

Adding W4A8_NVFP4_FP8 between existing flags shifts subsequent bit positions if C++ relies on ordinal order. Confirm the C++ IntFlag definitions are updated identically.

Same script as above should help spot any mismatch in the cpp/tensorrt_llm/common/quantization.h.


185-187: New predicate is correct and consistent.

QuantMode.has_w4a8_nvfp4_fp8() is implemented consistently with other predicates.


204-216: Include W4A8_NVFP4_FP8 in has_any_quant: LGTM.

This keeps feature detection coherent for the new mode.


324-326: Flag-setting for the new mode is straightforward.

Setting QuantMode.W4A8_NVFP4_FP8 when requested is correct.


413-415: from_quant_algo routing: LGTM.

QuantAlgo.W4A8_NVFP4_FP8 is mapped to from_description(use_w4a8_nvfp4_fp8=True) as expected.


43-45: Ensure C++ QuantAlgo enum is synchronized

I wasn’t able to find a enum class QuantAlgo in the C++ headers under

  • cpp/include/tensorrt_llm/common/quantization.h
  • cpp/tensorrt_llm/kernels/quantization.h

Please manually verify that:

  • The C++ QuantAlgo enum includes the new members in the same order:
    • W4A8_NVFP4_FP8
    • W4A8_MXFP4_FP8
    • W4A8_MXFP4_MXFP8
  • Any serialization/deserialization logic (e.g., switch-cases, mappings) is updated to align integer values across Python and C++.
tensorrt_llm/_torch/modules/linear.py (3)

823-863: New W4A8NVFP4FP8LinearMethod scaffolding reads clean.

The state carried (epilogue_tile_m, scaling_vector_size, weight, weight_scale, input_scale, weight_scale_2, alpha, bias) matches the intended FP4xFP8 path. Good use of pad_up for scale sizing.


932-955: Weight/scale shuffling and parameter copies look consistent.

Shuffle paths via shuffle_matrix_a and shuffle_matrix_sf_a and subsequent copies are aligned with the kernel expectations (epilogue_tile_m and scaling_vector_size). The fused paths also perform the required concatenation before shuffling.


1668-1669: Routing to the new quant method added correctly.

get_quant_method now dispatches to W4A8NVFP4FP8LinearMethod when the flag is set.

@sychen52 sychen52 changed the title [None][feat] Add NVFP4 x FP8 [OMNIML-2336][feat] Add NVFP4 x FP8 Aug 19, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/quantization/mode.py (1)

94-101: Critical: Sync and Preserve Bit Positions for W4A8_NVFP4_FP8

  • tensorrt_llm/quantization/mode.py
    • Move

    # W4A8 NVFP4
    W4A8_NVFP4_FP8 = auto()

    so it sits immediately before

    # The smallest power-of-two that is not used by a flag. Do not call auto() after that line.
    COUNT = auto()

    This ensures it’s assigned the next free bit instead of shifting the existing flags.

  • cpp/include/tensorrt_llm/common/quantization.h
    • Insert between nvfp4() (1u << 12) and w4a8Mxfp4Fp8() (1u << 14):

    // W4A8 NVFP4
    static constexpr QuantMode w4a8Nvfp4Fp8() noexcept {
        return QuantMode(BaseType(1u) << 13);
    }
    constexpr bool hasW4a8Nvfp4Fp8() const noexcept {
        return isSet(w4a8Nvfp4Fp8());
    }

    • Update all overloads of fromDescription (and any helpers) to accept a useW4a8Nvfp4Fp8 bool and pass it into bit 13.

After these changes, bit assignments will be:
12 = nvfp4(), 13 = w4a8Nvfp4Fp8(), 14 = w4a8Mxfp4Fp8(), 15 = w4a8Mxfp4Mxfp8(), 16 = w4a16Mxfp4() – matching Python’s IntFlag and preserving backward compatibility.

🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (1)

238-255: Add missing argument to raise_error() for better debuggability.

use_w4a8_nvfp4_fp8 is not listed in the error details, making invalid-combo debugging harder.

         def raise_error():
             raise ValueError(f"Unsupported combination of QuantMode args: "
@@
                              f"{use_fp8_rowwise=}, "
                              f"{use_nvfp4=}, "
+                             f"{use_w4a8_nvfp4_fp8=}, "
                              f"{use_w4a8_qserve=}, "
tensorrt_llm/_torch/modules/linear.py (3)

1667-1670: Prefer the specific W4A8_NVFP4_FP8 path over plain NVFP4 when both flags are set.

If a caller sets both NVFP4 and W4A8_NVFP4_FP8 bits (possible via from_description), the current ordering routes to NVFP4LinearMethod, which is less specific. Route to W4A8NVFP4FP8LinearMethod first.

-    if quant_config.layer_quant_mode.has_nvfp4():
-        return NVFP4LinearMethod()
-    if quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8():
-        return W4A8NVFP4FP8LinearMethod()
+    if quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8():
+        return W4A8NVFP4FP8LinearMethod()
+    if quant_config.layer_quant_mode.has_nvfp4():
+        return NVFP4LinearMethod()

Also applies to: 1661-1668


902-929: Long line exceeds style limit flagged by Ruff — wrap assertion.

Line length >120 (E501). Wrap the assertion message for readability and to satisfy linters.

-                    assert input_scale == w["input_scale"][...], "The input_scale should be same for all the weights"
+                    assert input_scale == w["input_scale"][...], (
+                        "The input_scale should be same for all the weights"
+                    )

Repeat similarly for the weight_scale_2 assertion below.


1-13: Missing NVIDIA SPDX header at file top.

Per repo guidelines, prepend the SPDX and copyright header.

+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0

Place these lines before any imports.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 30e5faa and 47fefb3.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tensorrt_llm/quantization/mode.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/quantization/mode.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/linear.py
  • tensorrt_llm/quantization/mode.py
🧬 Code Graph Analysis (2)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/quantization/utils/fp4_utils.py (3)
  • pad_up (22-23)
  • shuffle_matrix_a (149-158)
  • shuffle_matrix_sf_a (183-206)
tensorrt_llm/quantization/mode.py (1)
  • has_w4a8_nvfp4_fp8 (185-186)
tensorrt_llm/quantization/mode.py (2)
tensorrt_llm/_torch/modules/linear.py (1)
  • has_w4a8_nvfp4_fp8 (1825-1828)
cpp/include/tensorrt_llm/common/quantization.h (2)
  • QuantMode (28-440)
  • QuantMode (34-35)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py

895-895: Line too long (132 > 120)

(E501)

🔇 Additional comments (8)
tensorrt_llm/quantization/mode.py (5)

185-187: Predicate addition looks good.

has_w4a8_nvfp4_fp8 aligns with the new flag and mirrors existing helpers.


212-215: Include the new mode in has_any_quant — good.

Including W4A8_NVFP4_FP8 in has_any_quant keeps feature detection consistent.


324-326: Setting the new flag in from_description is fine.

No conflicts with existing constraints. Note that setting both use_nvfp4=True and use_w4a8_nvfp4_fp8=True will set both flags, which impacts get_quant_method routing (see comment in linear.py).

Would you ever call from_description with both use_nvfp4=True and use_w4a8_nvfp4_fp8=True? If yes, we should prefer the “more specific” path in get_quant_method (see proposed change there).


413-415: from_quant_algo mapping added — good.

The new QuantAlgo.W4A8_NVFP4_FP8 is recognized and routed via from_description(use_w4a8_nvfp4_fp8=True), avoiding accidental NVFP4 bit setting.


453-455: to_dict entry added — good.

This keeps surfaces (serialization/inspection) in sync.

tensorrt_llm/_torch/modules/linear.py (3)

823-887: New W4A8NVFP4FP8 path: overall structure is sound.

  • Properly allocates NVFP4-packed weights and per-block scales.
  • Tracks input_scale, weight_scale_2, alpha; dynamic path recomputes alpha.

Two follow-ups are addressed in comments below (weight_scale dtype handling and routing precedence in get_quant_method).


880-883: Kernel input expects FP8-view of weight_scale — OK at apply-time.

Viewing module.weight_scale as torch.float8_e4m3fn here makes sense if the underlying parameter stores the 4-bit scale payload (see load path below).


864-877: Alpha handling for dynamic quantization — double-check dtype and device.

alpha = module.weight_scale_2 * input_scale.to(torch.float32) assumes both are on the same device. If weight_scale_2 lives on CUDA and input_scale is produced on CPU in some paths, add a .to(module.weight_scale_2.device) to avoid device mismatch.

Do you expect torch.ops.tensorrt_llm.quantize_e4m3_per_tensor(input) to return the scale on the same device as input (CUDA)? If yes, we’re safe. Otherwise, we should align devices.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
tensorrt_llm/quantization/mode.py (2)

237-255: Include new argument in raise_error diagnostic

from_description added use_w4a8_nvfp4_fp8 but raise_error omits it from the diagnostics, making debugging harder.

Apply:

         def raise_error():
             raise ValueError(f"Unsupported combination of QuantMode args: "
                              f"{quantize_weights=}, "
                              f"{quantize_activations=}, "
                              f"{per_token=}, "
                              f"{per_channel=}, "
                              f"{per_group=}, "
                              f"{use_int4_weights=}, "
                              f"{use_int8_kv_cache=}, "
                              f"{use_fp8_kv_cache=}, "
                              f"{use_fp8_qdq=}, "
                              f"{use_fp8_block_scales=}, "
                              f"{use_fp8_rowwise=}, "
                              f"{use_nvfp4=}, "
+                             f"{use_w4a8_nvfp4_fp8=}, "
                              f"{use_w4a8_qserve=}, "
                              f"{use_w4a8_mxfp4_fp8=}, "
                              f"{use_w4a8_mxfp4_mxfp8=}, "
                              f"{use_w4a16_mxfp4=}")

413-415: from_quant_algo mapping is minimal; consider encoding W/A intent

Optional: For introspection tools that rely on INT4_WEIGHTS/ACTIVATIONS flags, consider mapping W4A8_NVFP4_FP8 to a mode that also sets quantize_weights=True, quantize_activations=True, use_int4_weights=True. Current routing works since get_quant_method checks the new flag directly, but auxiliary logic (e.g., is_weight_only()) will not reflect W4A8.

Example:

-        elif quant_algo == QuantAlgo.W4A8_NVFP4_FP8:
-            quant_mode = QuantMode.from_description(use_w4a8_nvfp4_fp8=True)
+        elif quant_algo == QuantAlgo.W4A8_NVFP4_FP8:
+            quant_mode = QuantMode.from_description(
+                quantize_weights=True,
+                quantize_activations=True,
+                use_int4_weights=True,
+                use_w4a8_nvfp4_fp8=True,
+            )
tensorrt_llm/_torch/modules/linear.py (2)

825-863: Create-weights shape/dtype choices look sound; add a short docstring for scale semantics

Initialization matches kernel expectations (fp4_e2m1x2 weights, padded float4_sf for per-block scales, scalar FP32 scales). Consider adding a brief docstring on the expected meanings/units of input_scale, weight_scale, weight_scale_2, and alpha to avoid confusion with the NVFP4 path (which uses 448*6/amax conventions).


888-933: Tighten long lines and clarify invariants in load_weight_scales

  • The divide-by-6.0 normalization is justified by the kernel’s nvfp4->e4m3 conversion; thanks for the inline comment.
  • Two lines exceed 120 characters; reflow to satisfy Ruff E501 without changing behavior.

Apply:

-                    assert input_scale == w["input_scale"][...], "The input_scale should be same for all the weights"
+                    assert input_scale == w["input_scale"][...], (
+                        "The input_scale should be same for all the weights"
+                    )
@@
-                    assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, (
-                        f"The weight_scale_2 should be same for all the weights: {weight_scale_2} vs. {w['weight_scale_2']}*6"
-                    )
+                    assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, (
+                        "The weight_scale_2 should be same for all the weights: "
+                        f"{weight_scale_2} vs. {w['weight_scale_2']}*6"
+                    )

Additionally, if you observe minor numerical jitter across shards, consider switching to torch.allclose for these equality checks.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ef5ed4b and 996e9ec.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
  • tensorrt_llm/quantization/mode.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/modules/linear.py
🧬 Code Graph Analysis (2)
tensorrt_llm/quantization/mode.py (2)
tensorrt_llm/_torch/modules/linear.py (1)
  • has_w4a8_nvfp4_fp8 (1826-1829)
cpp/include/tensorrt_llm/common/quantization.h (2)
  • QuantMode (28-440)
  • QuantMode (34-35)
tensorrt_llm/_torch/modules/linear.py (2)
tensorrt_llm/quantization/utils/fp4_utils.py (3)
  • pad_up (22-23)
  • shuffle_matrix_a (149-158)
  • shuffle_matrix_sf_a (183-206)
tensorrt_llm/quantization/mode.py (1)
  • has_w4a8_nvfp4_fp8 (185-186)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py

895-895: Line too long (132 > 120)

(E501)


925-925: Line too long (126 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tensorrt_llm/quantization/mode.py (5)

43-43: New QuantAlgo W4A8_NVFP4_FP8 looks good

Enumeration addition is consistent with the rest of QuantAlgo options.


185-187: Public accessor has_w4a8_nvfp4_fp8 is clear

Accessor naming and semantics align with existing helpers.


204-213: has_any_quant correctly includes the new flag

Including W4A8_NVFP4_FP8 in the has_any_quant mask is necessary for routing.


324-326: Bit wiring in from_description is correct

Setting the W4A8_NVFP4_FP8 flag independently is fine given get_quant_method routes on this flag.


453-455: Expose enable_w4a8_nvfp4_fp8 in to_dict

Public serialization reflecting the new flag is good.

tensorrt_llm/_torch/modules/linear.py (6)

864-887: Apply path is consistent; dynamic alpha handling is correct

  • Uses static FP8 quant when input_scale is provided; falls back to dynamic and recomputes alpha = weight_scale_2 * input_scale.
  • Reinterprets weight_scale to FP8 only at call site, leaving storage as float4_sf_dtype.

934-956: Vanilla load path and swizzling are correct

Weight shuffle and scale shuffling align with epilogue_tile_m and scaling_vector_size; alpha/input_scale propagation is consistent.


957-982: Fused QKV: concatenation and scale shuffling LGTM

Correct order (q, k, v) and shuffling after concatenation before storing.


983-1006: Fused Gate-Up: weight shuffling and scale handling LGTM

Mirrors the QKV path appropriately.


1669-1671: Wire-up in get_quant_method is correct and ordered safely

The check comes after NVFP4 and before MXFP4 branches, matching exclusivity assumptions.


1825-1829: New Linear.has_w4a8_nvfp4_fp8 property is consistent

Matches other boolean helpers and used by get_quant_method.

@sychen52 sychen52 requested a review from a team as a code owner August 21, 2025 20:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1069-1082: Initialize all quant flags unconditionally to avoid AttributeError when quant_config is None.

has_fp8_block_wise, has_fp8_rowwise, and has_w4a8_nvfp4_fp8 are only set inside the quant_config block. If quant_config is None, accessing these in forward will raise. Initialize them to False alongside the other defaults.

     def update_quant_config(self, new_quant_config: Optional[QuantConfig]):
         self.quant_config = new_quant_config
         self.wrapper.update_quant_config(self.quant_config)
 
-        self.has_fp8_qdq = self.has_fp8_kv_cache = self.has_nvfp4 = False
+        self.has_fp8_qdq = self.has_fp8_kv_cache = self.has_nvfp4 = False
+        self.has_fp8_block_wise = False
+        self.has_fp8_rowwise = False
+        self.has_w4a8_nvfp4_fp8 = False
         if self.quant_config is not None:
             self.has_fp8_kv_cache = self.quant_config.layer_quant_mode.has_fp8_kv_cache(
             )
 
             self.has_fp8_qdq = self.quant_config.layer_quant_mode.has_fp8_qdq()
             self.has_fp8_block_wise = self.quant_config.layer_quant_mode.has_fp8_block_scales(
             )
             self.has_fp8_rowwise = self.quant_config.layer_quant_mode.has_fp8_rowwise(
             )
             self.has_nvfp4 = self.quant_config.layer_quant_mode.has_nvfp4()
             self.has_w4a8_nvfp4_fp8 = self.quant_config.layer_quant_mode.has_w4a8_nvfp4_fp8(
             )
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/attention.py (1)

1-1: Add NVIDIA copyright header.

Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1-1: Add NVIDIA copyright header.

Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/modules/linear.py (3)

1-1: Add NVIDIA copyright header.

Per repository guidelines, prepend the current-year NVIDIA copyright header to Python sources.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

867-885: Dynamic-quant path computes alpha on-the-fly — OK; minor readability tweak.

Computation is correct. Consider splitting the long call for readability.

-                fp8_input, input_scale = torch.ops.tensorrt_llm.quantize_e4m3_per_tensor(
-                    input)
+                fp8_input, input_scale = (
+                    torch.ops.tensorrt_llm.quantize_e4m3_per_tensor(input)
+                )

898-898: Wrap overlong lines to satisfy linters (Ruff E501).

Two lines exceed 120 chars. Wrap them to avoid CI noise.

-                ws = (ws.to(torch.float32) / 6.0).to(torch.float8_e4m3fn)
+                ws = (
+                    ws.to(torch.float32).div(6.0).to(torch.float8_e4m3fn)
+                )
-                    assert weight_scale_2 == w["weight_scale_2"][...] * 6.0, (
-                        f"The weight_scale_2 should be same for all the weights: {weight_scale_2} vs. {w['weight_scale_2']}*6"
-                    )
+                    expected = w["weight_scale_2"][...].mul(6.0)
+                    assert weight_scale_2 == expected, (
+                        "The weight_scale_2 should be same for all the weights"
+                    )

Also applies to: 928-928

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 996e9ec and 668a658.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (2 hunks)
  • tensorrt_llm/_torch/modules/attention.py (2 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/linear.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py

898-898: Line too long (132 > 120)

(E501)


928-928: Line too long (126 > 120)

(E501)

🔇 Additional comments (5)
tensorrt_llm/_torch/modules/attention.py (1)

342-352: Correct out_scale logic for new quant case.

Adding has_fp8_rowwise and has_w4a8_nvfp4_fp8 to the has_quant_scale check is consistent with create_output. No issues here.

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1182-1191: Extend FP8 out-dtype condition — looks good.

Including has_w4a8_nvfp4_fp8 and has_fp8_rowwise in the FP8-output decision, still gated by has_fp8_kv_cache, matches the module-side gating.

tensorrt_llm/_torch/modules/linear.py (3)

823-890: New W4A8NVFP4FP8LinearMethod scaffold — OK.

Weight creation parameters, dtype choices, and basic apply path wiring look consistent with the new NVFP4 x FP8 kernel.


1673-1674: Quant method routing includes W4A8_NVFP4_FP8 — OK.

Factory correctly returns W4A8NVFP4FP8LinearMethod when enabled.


1829-1834: Public flag has_w4a8_nvfp4_fp8 — OK.

Property aligns with attention-side gating.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

1-1: Add NVIDIA copyright header (2025) at file top.

Coding guidelines require the NVIDIA copyright header to be prepended to all source files. This file has none.

Apply:

+################################################################################
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the NVIDIA Source Code License.
+################################################################################
 from __future__ import annotations
🧹 Nitpick comments (4)
tensorrt_llm/_torch/modules/linear.py (4)

823-866: New NVFP4×FP8 linear: shapes and packing look correct; consider safer defaults for scales.

  • Weight packing (E2M1x2) and block-scale buffer (float4_sf_dtype) sizing using pad_up(..., 128/4) align with kernel expectations. Good.
  • Minor: input_scale and weight_scale_2 are created with uninitialized values (torch.empty). If a caller mistakenly runs apply() before load_weights, alpha and static quantization could use garbage. Prefer initializing to 1.0 or setting them to None and gating usage.

Example:

-        module.input_scale = Parameter(torch.empty([1], dtype=torch.float32),
+        module.input_scale = Parameter(torch.ones([1], dtype=torch.float32),
                                        requires_grad=False)
-        module.weight_scale_2 = Parameter(torch.empty([1], dtype=torch.float32),
+        module.weight_scale_2 = Parameter(torch.ones([1], dtype=torch.float32),
                                           requires_grad=False)
-        module.alpha = Parameter(torch.empty([1], dtype=torch.float32),
+        module.alpha = Parameter(torch.ones([1], dtype=torch.float32),
                                  requires_grad=False)

937-960: Guard reciprocal to avoid division-by-zero for inv_input_scale.

If an exported checkpoint has zeros or denormals in input_scale, 1.0 / input_scale can INF. Clamp to eps.

Apply:

-        copy_weight(module.inv_input_scale, 1.0 / input_scale)
+        eps = torch.finfo(torch.float32).eps
+        copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))

961-987: Fused QKV: same inv_input_scale robustness as vanilla; weight swizzle order LGTM.

  • Swizzle and shuffle order (concat → shuffle_matrix_sf_a for scales, shuffle_matrix_a for weights) is consistent.
  • Recommend the same clamp as above for inv_input_scale.

Apply:

-        copy_weight(module.inv_input_scale, 1.0 / input_scale)
+        eps = torch.finfo(torch.float32).eps
+        copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))

988-1013: Fused Gate-Up: mirror the inv_input_scale clamp.

Keeps attention FP8 output scaling numerically safe.

Apply:

-        copy_weight(module.inv_input_scale, 1.0 / input_scale)
+        eps = torch.finfo(torch.float32).eps
+        copy_weight(module.inv_input_scale, 1.0 / torch.clamp(input_scale, min=eps))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 668a658 and 2825448.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (2 hunks)
  • tensorrt_llm/_torch/modules/attention.py (2 hunks)
  • tensorrt_llm/_torch/modules/linear.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/modules/linear.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/modules/linear.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/modules/linear.py

898-898: Line too long (132 > 120)

(E501)


928-928: Line too long (126 > 120)

(E501)

🔇 Additional comments (2)
tensorrt_llm/_torch/modules/linear.py (2)

1831-1836: Property accessor aligns with dispatch path.

Accessor matches the new dispatch branch and asserts weights created before use. LGTM.


1675-1676: Here’s a quick script to print out the underlying bit-mask (or enum) definitions for both flags. That will let us confirm whether they share any bits or are truly independent:

#!/bin/bash
# Find the mode definition file
mode_file=$(find tensorrt_llm/quantization -type f -name mode.py)
echo "--> $mode_file"

# Show NVFP4 definition
grep -n 'NVFP4[[:space:]]*=' "$mode_file"

# Show W4A8_NVFP4_FP8 definition
grep -n 'W4A8_NVFP4_FP8[[:space:]]*=' "$mode_file"

If the two constants map to distinct values, you’re safe—the has_nvfp4() check cannot shadow has_w4a8_nvfp4_fp8(). Otherwise, we’ll need to reorder or split them out.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)

339-345: Integer truncation: std::accumulate init should be size_t

The init value 0 makes accumulate compute in int and return int, risking overflow on large workspaces.

-    auto size = std::accumulate(workspaceSizes.begin(), workspaceSizes.end(), 0);
+    auto size = std::accumulate(workspaceSizes.begin(), workspaceSizes.end(), size_t{0});

371-377: Potential overflow in numEltsSplitK computation

numEltsSplitK is inferred as int; the product can exceed 32-bit. Promote to int64_t explicitly.

-        auto numEltsSplitK = options.mNumSlicesForSplitK * numTilesM * numTilesN * options.mTileM * options.mTileN
-            * options.mNumSlicesForSliceK;
+        int64_t numEltsSplitK = static_cast<int64_t>(options.mNumSlicesForSplitK)
+            * static_cast<int64_t>(numTilesM) * static_cast<int64_t>(numTilesN)
+            * static_cast<int64_t>(options.mTileM) * static_cast<int64_t>(options.mTileN)
+            * static_cast<int64_t>(options.mNumSlicesForSliceK);
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (1)

83-86: Accidental nested namespace gemm causes symbol qualification issues

There are two consecutive namespace gemm declarations, yielding gemm::gemm. This breaks qualified calls like gemm::toString and gemm::divUp from within this file and misplaces public symbols under gemm::gemm. Remove the extra namespace and one closing brace.

-namespace gemm
-{
-
-namespace gemm
-{
+namespace gemm
+{
@@
-} // namespace gemm
+// (removed extra closing; only one gemm namespace remains)

Optionally, simplify qualified calls inside the same namespace (see divUpMul comment).

Also applies to: 1351-1352

♻️ Duplicate comments (10)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (1)

296-306: Split-K variants retained despite earlier guidance

There’s still a LowLatency FP8 block with numSlicesForSplitK=2 and clusterDimZ=2. A prior review asked to set both to 1 due to a known split-K issue and you replied “Done.” If the issue isn’t fully resolved, consider removing the split-K=2 entry for now to avoid flaky perf/correctness.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)

251-253: Parameter name mismatch in declaration vs definition

Declaration uses GemmData const& options, definition uses GemmData const& data. Keep parameter names consistent per guidelines.

-    int32_t run(GemmConfig const& config, void* workspace, GemmData const& options, void* cudaStream,
+    int32_t run(GemmConfig const& config, void* workspace, GemmData const& data, void* cudaStream,
         int32_t multiProcessorCount, bool usePdl = true,
         std::optional<std::reference_wrapper<ModuleCache>> moduleCache = std::nullopt) const;

442-449: Symbol resolution: KernelParamsSetup namespace doesn’t match its declaration

This calls gemm::KernelParamsSetup::setKernelParams, but KernelParamsSetup is currently nested under gemm::gemm in KernelParams.h. Either flatten the namespaces in KernelParams.h/Decl.h or qualify the call as gemm::gemm::KernelParamsSetup here.

Minimal local fix:

-    auto kernelParams = gemm::KernelParamsSetup::setKernelParams(options, data.mInputBuffers.mPtrA,
+    auto kernelParams = gemm::gemm::KernelParamsSetup::setKernelParams(options, data.mInputBuffers.mPtrA,

Preferred: remove the redundant inner namespace gemm in the KernelParams headers so callers remain gemm::KernelParamsSetup.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (2)

154-170: Validate tileShapes[0] and avoid silent clamping

boxDim[0] is clamped to <= numEltsIn128B. Silent clamping can hide misconfigurations and produce unexpected perf. Validate that tileShapes[0] is in (0, numEltsIn128B], and fail fast if not.

-    auto const numEltsInClampedFastestTileSize = std::min(numEltsIn128B, tileShapes[0]);
+    if (tileShapes.empty() || tileShapes[0] <= 0 || tileShapes[0] > numEltsIn128B) {
+        std::cerr << "buildNdTmaDescriptor: tileShapes[0] (" << (tileShapes.empty()? -1 : tileShapes[0])
+                  << ") must be in (0, " << numEltsIn128B << "]" << std::endl;
+        assert(false);
+    }
+    auto const numEltsInClampedFastestTileSize = tileShapes[0];

295-306: Type mismatch in diagnostic loops (shapes/strides use uint64_t, not uint32_t)

Both loops truncate 64-bit values into 32-bit. Use uint64_t (or const auto&).

-        ss << "shape:";
-        for (uint32_t shape_i : shapes)
+        ss << "shape:";
+        for (uint64_t shape_i : shapes)
         {
             ss << " " << shape_i;
         }
@@
-        ss << "stridesInBytes:";
-        for (uint32_t stride_i : stridesInBytes)
+        ss << "stridesInBytes:";
+        for (uint64_t stride_i : stridesInBytes)
         {
             ss << " " << stride_i;
         }
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (1)

32-34: Duplicate namespace gemm nesting — breaks symbol lookup

This header declares namespace gemm inside namespace gemm, placing symbols under gemm::gemm. Call sites (e.g., GemmInterface) use gemm::. Remove the redundant inner namespace to flatten to a single gemm.

-namespace gemm
-{
-
-namespace gemm
-{
+namespace gemm
+{
@@
-} // namespace gemm
-
-} // namespace gemm
+} // namespace gemm

Also applies to: 338-343

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (2)

80-91: Name-based chunk lookup: add metadata consistency checks; avoid throwing across ABI

Add asserts that vectors are sized consistently; consider returning a sentinel (-1) instead of throwing in headers to avoid exception crossing lib boundaries.

     int32_t getChunkOffsetByName(std::string const& name) const
     {
+        assert(mNumBytesAndAlignmentPerSmemChunk.size() == mFirstChunkReuse.size());
+        assert(mSmemChunkNames.size() == mFirstChunkReuse.size());
         for (size_t ii = 0; ii < mSmemChunkNames.size(); ++ii)
         {
             if (mSmemChunkNames[ii] == name)
             {
                 return getChunkOffset(ii);
             }
         }
-        throw std::runtime_error("Name not found: " + name);
+        throw std::runtime_error("SMEM chunk name not found: " + name);
     }

124-136: Guard against infinite recursion when the first chunk is marked reusable

If mFirstChunkReuse[0] is ever true, getChunkOffset(0) recurses. Add a guard.

-        if (mFirstChunkReuse[ii])
+        if (mFirstChunkReuse[ii])
         {
-            // Reuse the offset of the 0th chunk.
-            return getChunkOffset(0);
+            // Reuse the offset of the 0th chunk.
+            if (ii == 0)
+            {
+                assert(false && "First chunk cannot be marked reusable");
+                return 0;
+            }
+            return getChunkOffset(0);
         }
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (2)

54-77: LOG macros unconditionally return false and cause unintended early returns in export builds

As previously noted, TLLM_LOG_INFO and TLLM_LOG_WARNING expand to TLLM_CHECK_WARNING(false, ...), which returns false from checkAndUpdateGemmOptions on informative messages. This breaks flows where you intend to warn and auto-correct (e.g., fixing MmaK/MmaM, setting dtypeC, epilogue tiles). Rework LOG macros to only print; reserve CHECK macros for control flow.

Apply:

 #define TLLM_CHECK_ERROR(cond, ...)                                                                                    \
     if (!(cond))                                                                                                       \
     {                                                                                                                  \
         printArgs(__VA_ARGS__);                                                                                        \
-        printArgs("\n");                                                                                               \
+        printArgs("\n");                                                                                               \
         return false;                                                                                                  \
     }
 
-#define TLLM_LOG_ERROR(...) TLLM_CHECK_ERROR(false, __VA_ARGS__)
-
 #define TLLM_CHECK_ERROR_FMT(cond, ...) TLLM_CHECK_ERROR(cond, __VA_ARGS__)
 
-#define TLLM_CHECK_WARNING(cond, ...)                                                                                  \
-    if (!(cond))                                                                                                       \
-    {                                                                                                                  \
-        printArgs(__VA_ARGS__);                                                                                        \
-        printArgs("\n");                                                                                               \
-        return false;                                                                                                  \
-    }
-
-#define TLLM_LOG_WARNING(...) TLLM_CHECK_WARNING(false, __VA_ARGS__)
-
-#define TLLM_LOG_INFO(...) TLLM_CHECK_WARNING(false, __VA_ARGS__)
+#define TLLM_CHECK_WARNING(cond, ...)                                                                                  \
+    if (!(cond))                                                                                                       \
+    {                                                                                                                  \
+        printArgs(__VA_ARGS__);                                                                                        \
+        printArgs("\n");                                                                                               \
+        /* no early return on warnings */                                                                              \
+    }
+
+#define TLLM_LOG_ERROR(...)    do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0)
+#define TLLM_LOG_WARNING(...)  do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0)
+#define TLLM_LOG_INFO(...)     do { printArgs(__VA_ARGS__); printArgs("\n"); } while(0)

1279-1284: Bug: condition checks mLayoutB twice; should check A and B

The intent is to gate features when not both matrices are K‑major. Current code repeats mLayoutB.

-    if (options.mLayoutB != MatrixLayout::MajorK || options.mLayoutB != MatrixLayout::MajorK)
+    if (options.mLayoutA != MatrixLayout::MajorK || options.mLayoutB != MatrixLayout::MajorK)
     {
         TLLM_CHECK_ERROR(isBlackwell, "Non K-major layouts are only supported on Blackwell");
         TLLM_CHECK_ERROR(options.mSplitK == SplitK::None, "Non K-major layouts do not support split K");
     }
🧹 Nitpick comments (23)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x8x512_s2_et128x8_m128x8x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (2)

1-3: Fail fast in configure/build if LFS pointers are not hydrated.

To avoid confusing compiler errors when the working tree still contains LFS pointers, add a CMake-time guard that detects and errors out if any cubin .cpp is a pointer.

Example CMake addition (adjust the glob as needed):

+## Ensure Git LFS is installed and cubin sources are hydrated
+message(STATUS "Checking Git LFS and cubin sources...")
+find_program(GIT_EXE git REQUIRED)
+execute_process(COMMAND "${GIT_EXE}" lfs version
+                RESULT_VARIABLE LFS_RC
+                OUTPUT_QUIET ERROR_QUIET)
+if(NOT LFS_RC EQUAL 0)
+  message(FATAL_ERROR "Git LFS is required to build cubin-wrapped sources. Please install Git LFS.")
+endif()
+
+# Detect unhydrated LFS pointers among cubin .cpp sources
+file(GLOB_RECURSE TRTLLMGEN_CUBIN_SRCS
+     "${CMAKE_SOURCE_DIR}/cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp")
+foreach(_src IN LISTS TRTLLMGEN_CUBIN_SRCS)
+  file(READ "${_src}" _content LIMIT 100)
+  if("${_content}" MATCHES "https://git-lfs.github.com/spec/v1")
+    message(FATAL_ERROR
+      "Unhydrated Git LFS pointer detected: ${_src}\n"
+      "Run: git lfs pull && git lfs checkout")
+  endif()
+endforeach()

1-3: Ensure .gitattributes tracks these cubin wrapper .cpp files via LFS.

If not already present, add/verify a dedicated LFS rule so these large generated .cpp wrappers don’t bloat the Git object store and are consistently smudged.

Proposed .gitattributes addition:

+# TensorRT-LLM generated cubin wrapper sources (store via Git LFS)
+cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp filter=lfs diff=lfs merge=lfs -text

Then run:

  • git lfs track "cpp/tensorrt_llm/kernels/trtllmGenKernels/**/cubins/*.cpp"
  • git add .gitattributes && git add -u
  • git commit -m "Track cubin wrapper .cpp via Git LFS"
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/cubins/Gemm_Bfloat16_E2m1E4m3_castE4m3_Fp32_sfBlk32_t128x32x512_s2_et128x32_m128x32x32_cga1x1x1_16dp256b_TN_transOut_schedS_sm100a_cubin.cpp (2)

1-3: Filename style vs. guideline — acceptable exception for generated artifacts?

The repo guideline prefers lowerCamelCase filenames for C++, while this generated cubin wrapper uses a long, underscore-and-capitalized schema. If this mirrors existing generated cubins, consider documenting an explicit exception in a README within the cubins directory.

If this is a one-off deviation, consider aligning with the established naming or adding a short rationale comment at the top of the generated file.


1-3: Compilation footprint considerations (optional).

Large embedded arrays (~475 KB here; many such files added) can increase compile time and object size. If build times regress, consider:

  • Packaging as external assets loaded at runtime for relevant architectures.
  • Grouping multiple cubins into a single translation unit with linkonce symbols to reduce overhead.
  • Compressing byte arrays with a lightweight runtime decompressor (zstd) if startup cost is acceptable.

No action required if current CI times remain acceptable.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/config.json (3)

252-279: New template GemmFp4xFp8: surface looks coherent; check A/B SF semantics and defaults

  • dtypeA=e2m1 with dtypeMmaA=e4m3 and sfBlockSizeA=32 plus sfLayoutA="128x4" matches the R128c4 path implemented in KernelParamsSetup::makeTmaShapeStrideSfAb.
  • dtypeB=e4m3 (non-block) implies no tmaSfB — that’s consistent.

Two small asks:

  • Please confirm that options default for mSfLayoutB is ignored on this template (since B is non-block), to avoid confusion.
  • Consider adding a brief _comment field documenting expected constraints (e.g., K % 64, M % 128 for A’s R128c4).

352-358: Config sweep for GemmFp4xFp8: set is sensible; add a guardrail for tileK on tileN=128

The (tileN,tileK) grid is reasonable, and lowering tileK to 256 at tileN=128 is prudent. Please also cap epilogueTileN to <= tileN in the validator to prevent misconfig if templates evolve.

If helpful, I can add a minimal validator rule in the generator to enforce epilogueTileN <= tileN and tileK ∈ {256,512}.


349-351: MxE2m1×MxE4m3: dtypeC list includes fp32/mxe4m3 — confirm store paths

Including "fp32" and "mxe4m3" for C implies TMA store must handle these, with mPtrSfC populated for mxe4m3 and two-shot AR promotion for fp32 when needed. Please confirm both paths are exercised in tests.

I can scaffold a small pytest to exercise C={fp16,bf16,fp32,mxe4m3}.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmInterface.h (2)

516-519: Use logical OR for boolean gating

Using bitwise OR on bools works but is non-idiomatic and evaluates all operands. Prefer short-circuit logical OR.

-        usePdl
-            && (config.mOptions.mGridWaitForPrimaryEarlyExit | config.mOptions.mGridWaitForPrimaryA
-                | config.mOptions.mGridWaitForPrimaryB));
+        usePdl
+            && (config.mOptions.mGridWaitForPrimaryEarlyExit || config.mOptions.mGridWaitForPrimaryA
+                || config.mOptions.mGridWaitForPrimaryB));

468-496: Module cache key might collide across different cubins with identical function names

The key is context + function name. If two different cubins expose the same function name in the same context, we could fetch a mismatched CUfunction. Include a fingerprint of config.mData (e.g., address bytes) to disambiguate.

-        auto const moduleKey = ctxName + funcName;
+        std::string moduleKey = ctxName + funcName;
+        moduleKey.append(reinterpret_cast<char const*>(&config.mData), sizeof(config.mData));
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/TmaDescriptor.h (3)

21-27: Missing include for std::stringstream

std::stringstream is used below; include explicitly to avoid relying on transitive includes.

 #include <iostream>
+#include <sstream>

188-221: Error path: include CUDA error string in diagnostics

You fetch errorString but don’t print it for buildNdTmaDescriptor; include it for readability (you already do this for SF descriptor).

-        ss << "Error: Failed to initialize the TMA descriptor " << result << std::endl;
+        ss << "Error: Failed to initialize the TMA descriptor (" << result << "): " << errorString << std::endl;

309-325: Minor: const auto& in loops; unify case label diagnostics

Non-blocking: prefer const auto& for loops over tileShapes/tileStrides; also consider logging both decimal and hex for swizzleType.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelParams.h (2)

216-334: KernelParamsSetup: overall API stands; minor consistency nits

  • Nice: explicit passing of ptrBias and per-token SFs, and auto-selecting dtypeC for two-shot AR.
  • Nit: consider constexpr for MatrixType alias, and marking helpers static inline to avoid ODR issues in headers compiled by multiple TUs. Not critical.

25-28: “Dependency free and NVRTC-compilable” note vs exceptions

This header is included by device code per the comment. Throwing std::runtime_error in device-includable headers can be problematic for NVRTC. If these code paths never reach NVRTC (guarded by TLLM_ENABLE_CUDA or host-only), clarify with comments or guard throws with host-side macros.

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/KernelTraits.h (1)

566-571: Avoid copying KernelTraits by value in small helpers

Passing KernelTraits by value to getters copies the allocator helpers; switch to const&.

-inline int32_t getSmemOffsetLoadA(KernelTraits traits)
+inline int32_t getSmemOffsetLoadA(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetLoadB(KernelTraits traits)
+inline int32_t getSmemOffsetLoadB(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetLoadShuffleB(KernelTraits traits)
+inline int32_t getSmemOffsetLoadShuffleB(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetGmemC(KernelTraits traits, int resIdx = 0)
+inline int32_t getSmemOffsetGmemC(KernelTraits const& traits, int resIdx = 0)
@@
-inline int32_t getSmemOffsetRowMax(KernelTraits traits)
+inline int32_t getSmemOffsetRowMax(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetSliceK(KernelTraits traits)
+inline int32_t getSmemOffsetSliceK(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetPerTokenSf(KernelTraits traits)
+inline int32_t getSmemOffsetPerTokenSf(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetBias(KernelTraits traits)
+inline int32_t getSmemOffsetBias(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetBlockAmax(KernelTraits traits)
+inline int32_t getSmemOffsetBlockAmax(KernelTraits const& traits)
@@
-inline int32_t getSmemOffsetConstSfBuf(KernelTraits traits)
+inline int32_t getSmemOffsetConstSfBuf(KernelTraits const& traits)
@@
-inline int32_t getTmemOffsetD(KernelTraits traits)
+inline int32_t getTmemOffsetD(KernelTraits const& traits)
@@
-inline int32_t getTmemOffsetA(KernelTraits traits)
+inline int32_t getTmemOffsetA(KernelTraits const& traits)
@@
-inline int32_t getTmemOffsetSfA(KernelTraits traits)
+inline int32_t getTmemOffsetSfA(KernelTraits const& traits)
@@
-inline int32_t getTmemOffsetSfB(KernelTraits traits)
+inline int32_t getTmemOffsetSfB(KernelTraits const& traits)

Also applies to: 573-576, 587-590, 594-597, 601-604, 608-611, 615-618, 622-625, 629-632, 636-639, 654-657, 661-664, 668-671, 675-678

cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/trtllmGen_gemm_export/GemmOptions.h (8)

529-533: Unnecessary self-qualification; may mask nested-namespace mistakes

Within namespace gemm, calling gemm::divUp is redundant and, with the accidental nested gemm, can break lookup. Keep it simple:

-inline T divUpMul(T a, T b)
+inline T divUpMul(T a, T b)
 {
-    return gemm::divUp(a, b) * b;
+    return divUp(a, b) * b;
 }

590-597: Comment and code disagree on supported A-cast paths

Comment says only {MxFp4, NvFp4} → Bf16, but code also allows NvFp4 (E2m1) → Fp8 (E4m3). Update comment to match behavior to avoid confusion in downstream API users.

-    // Currently, we only support {MxFp4, NvFp4} -> Bf16.
+    // Currently supported A casts:
+    //   {MxFp4 (MxE2m1), NvFp4 (E2m1)} -> Bf16
+    //   NvFp4 (E2m1) -> Fp8 (E4m3)

774-781: Typos in comments (“Nvida”, “expertiment”)

Polish the user-facing comments.

-        // SfBlockSizeA can also support 64 and 128, although they are not officially supported Nvida
+        // SfBlockSizeA can also support 64 and 128, although they are not officially supported NVIDIA
@@
-        // TODO Leaving this as an option for now in case we want to expertiment with other block sizes
+        // TODO Leaving this as an option for now in case we want to experiment with other block sizes

17-17: Add include guards per repository guideline (keep #pragma once if desired)

Headers must use TRTLLM__H guards. Suggest adding guards around the file content.

 #pragma once
+#ifndef TRTLLM_GEMMOPTIONS_H
+#define TRTLLM_GEMMOPTIONS_H
@@
 } // namespace gemm
-
+#endif // TRTLLM_GEMMOPTIONS_H

Please ensure adding guards doesn’t conflict with existing include order. If other headers already use guards, this will be consistent across the codebase.

Also applies to: 1364-1365


494-503: dumpOptions: std::nullopt formatting nit

Current output prints “std::nullopt, ” with extra space before comma on the next line. Consider consistent comma-spacing for machine-parsable dumps.

-        ss << "mSfBlockSizeA="
-           << "std::nullopt"
-           << ", " << std::endl;
+        ss << "mSfBlockSizeA=std::nullopt," << std::endl;

101-187: Constructor explosion: consider a builder or struct-of-args to reduce call-site errors

The GemmOptions constructor accepts >40 parameters, many booleans. This is error-prone and hurts readability. Consider a fluent builder, or pass a small config struct for logically grouped fields (layouts, dtypes, SF options).

No immediate change required, but worth a follow-up refactor.

I can sketch a GemmOptionsBuilder that preserves defaults and emits compile-time errors for missing required fields.


488-494: Optional: consider always dumping mSfBlockSizeA (print nullopt explicitly)

You already handle nullopt; formatting tweak above will make downstream parsing easier.


359-377: GemmConfig: new mHash field

Looks fine. If you have a config dump elsewhere, consider including mHash for traceability.

Want me to add a dumpConfig(GemmConfig const&) helper mirroring dumpOptions?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)

112-114: Check CUDA return code when querying SM count

Un-checked cudaDeviceGetAttribute can leave multiProcessorCount undefined on failure. Guard with TLLM_CUDA_CHECK and initialize the variable.

Apply:

-    int32_t multiProcessorCount;
-    cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount, device);
+    int32_t multiProcessorCount = 0;
+    TLLM_CUDA_CHECK(cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount, device));

146-185: Fix comparator to satisfy strict weak ordering; use std::abs to avoid overload pitfalls

  • Returning true when all keys are equal violates strict weak ordering and can yield undefined behavior in std::sort. End with return false.
  • Prefer std::abs and precompute distances for clarity; include the proper header if not already pulled transitively.

Apply:

-        [&configs, &gemmData](int32_t idx0, int32_t idx1)
+        [&configs, &gemmData](int32_t idx0, int32_t idx1)
         {
             auto const& optionsA = configs[idx0].mOptions;
             auto const& optionsB = configs[idx1].mOptions;
@@
-            if (optionsA.mTileN != optionsB.mTileN)
-            {
-                return abs(gemmData.mProblemDimensions.mN - optionsA.mTileN)
-                    < abs(gemmData.mProblemDimensions.mN - optionsB.mTileN);
-            }
+            if (optionsA.mTileN != optionsB.mTileN)
+            {
+                auto const n = gemmData.mProblemDimensions.mN;
+                auto const dA = std::abs(n - optionsA.mTileN);
+                auto const dB = std::abs(n - optionsB.mTileN);
+                return dA < dB;
+            }
@@
-            return true;
+            return false; // Equal preference — maintain strict weak ordering.
         });

Additionally, ensure the correct overload is visible:

// At top of file with other includes
#include <cstdlib> // for std::abs on integral types
♻️ Duplicate comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)

45-55: Document B-type “Void as wildcard” filter; avoid copying options

Two small improvements:

  • Clarify in code that mOptions.eltTypeB == Dtype::Void acts as a wildcard.
  • Bind options by const reference to avoid an unnecessary copy.

Apply:

-        auto const options = configs[i].mOptions;
+        auto const& options = configs[i].mOptions;
@@
-        // When we include low-latency kernels we can set transposeMmaOutput via constructor
+        // When we include low-latency kernels we can set transposeMmaOutput via constructor.
+        // For B-type, Dtype::Void is treated as a wildcard (no restriction on options.mDtypeB).
         if (options.mDtypeA == mOptions.eltTypeA && options.mDtypeC == mOptions.outputType
             && options.mUseDeepSeekFp8 == mOptions.deepSeekFp8
             && options.mTransposeMmaOutput == mOptions.transposeMmaOutput
             && (mOptions.eltTypeB == gemm::trtllm::gen::Dtype::Void || options.mDtypeB == mOptions.eltTypeB))

118-120: Good switch to env-gated PDL; minor readability tweak

Nice follow-up adopting getEnvEnablePDL() instead of a bare boolean. For readability in call sites, consider naming the parameter with an inline comment.

Apply:

-    auto const err = gemm.run(config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount,
-        tensorrt_llm::common::getEnvEnablePDL(), globalTrtllmGenGemmModuleCache);
+    bool const usePdl = tensorrt_llm::common::getEnvEnablePDL();
+    auto const err = gemm.run(
+        config, workspace, gemmData, static_cast<void*>(stream), multiProcessorCount, /*usePdl=*/usePdl,
+        globalTrtllmGenGemmModuleCache);
🧹 Nitpick comments (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (2)

59-60: Make the failure message reflect all applied filters

The mismatch can be due to multiple gates (A/B dtype, output type, transposeMmaOutput, deepSeekFp8). Tighten the wording.

Apply:

-    TLLM_CHECK_WITH_INFO(mPassingConfigIndices.size() != 0, "No kernel found for the given output type");
+    TLLM_CHECK_WITH_INFO(!mPassingConfigIndices.empty(),
+        "No GEMM kernel found matching eltTypeA/eltTypeB/outputType and transposeMmaOutput/deepSeekFp8 filters.");

151-176: Align heuristic priority with the comments (latency vs throughput)

Comment says tileN closeness is for low-latency (transposeMmaOutput), while tileM governs throughput (non-transpose). The comparator currently prioritizes tileN closeness unconditionally, which may suboptimal in throughput mode. Consider branching the heuristic based on mOptions.transposeMmaOutput.

Apply:

-    std::sort(sortedIndices.begin(), sortedIndices.end(),
-        [&configs, &gemmData](int32_t idx0, int32_t idx1)
+    std::sort(sortedIndices.begin(), sortedIndices.end(),
+        [this, &configs, &gemmData](int32_t idx0, int32_t idx1)
         {
             auto const& optionsA = configs[idx0].mOptions;
             auto const& optionsB = configs[idx1].mOptions;
 
-            // Choose the tileN that is closest to the problem N
-            // This is the batch size dimension for low latency (transposeMmaOutput) case;
-            if (optionsA.mTileN != optionsB.mTileN)
-            {
-                auto const n = gemmData.mProblemDimensions.mN;
-                auto const dA = std::abs(n - optionsA.mTileN);
-                auto const dB = std::abs(n - optionsB.mTileN);
-                return dA < dB;
-            }
+            // For low latency, prefer tileN close to problem N; for throughput, prioritize larger tileM.
+            if (mOptions.transposeMmaOutput)
+            {
+                if (optionsA.mTileN != optionsB.mTileN)
+                {
+                    auto const n = gemmData.mProblemDimensions.mN;
+                    auto const dA = std::abs(n - optionsA.mTileN);
+                    auto const dB = std::abs(n - optionsB.mTileN);
+                    return dA < dB;
+                }
+            }
+            else
+            {
+                if (optionsA.mTileM != optionsB.mTileM)
+                {
+                    return optionsA.mTileM > optionsB.mTileM;
+                }
+            }

If you have perf data suggesting unconditional tileN closeness still wins, feel free to ignore; otherwise this aligns code with the stated intent.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8569794 and d90aebf.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
🧠 Learnings (4)
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
PR: NVIDIA/TensorRT-LLM#6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm(), along with setWorkspace(). Both stream and workspace are passed as arguments to run(), so there's no need to call setStream() explicitly in loraOp.cpp - this avoids redundancy and follows the intended architectural separation.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
📚 Learning: 2025-08-17T15:07:01.420Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#6968
File: cpp/tensorrt_llm/thop/loraOp.cpp:133-141
Timestamp: 2025-08-17T15:07:01.420Z
Learning: In TensorRT-LLM's LoRA implementation, the LoraImpl::run() method handles setStream() internally in _runGemm() (line 51 in lora.cpp), along with setWorkspace(). The stream parameter flows from loraOp.cpp through LoraImpl::run() to _runGemm() where setStream() is called appropriately. Adding setStream() in loraOp.cpp would be redundant and goes against the intended architectural design.

Applied to files:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/gemm/KernelRunner.cpp (1)

21-21: Env-gated PDL include is correct and scoped to this TU

Including tensorrt_llm/common/envUtils.h here is appropriate for the PDL runtime toggle used below. No issues.

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17399 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17399 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13076 completed with status: 'FAILURE'

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17415 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17415 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13089 completed with status: 'FAILURE'

Signed-off-by: Shiyang Chen <[email protected]>

temp

Signed-off-by: Shiyang Chen <[email protected]>

debug print

Signed-off-by: Shiyang Chen <[email protected]>

export again with updated config

Signed-off-by: Shiyang Chen <[email protected]>

updated according to reviewer comments

use eltTypeA/B instead of dtypeMmaA
rename files
pdl value from env

Signed-off-by: Shiyang Chen <[email protected]>

fix bug

Signed-off-by: Shiyang Chen <[email protected]>

small update based on comments

Signed-off-by: Shiyang Chen <[email protected]>

improve unittest

Signed-off-by: Shiyang Chen <[email protected]>

run pre-commit

Signed-off-by: Shiyang Chen <[email protected]>

export again

Signed-off-by: Shiyang Chen <[email protected]>

export again

Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
Add rules based on tileN and tileM.

Signed-off-by: Shiyang Chen <[email protected]>
Signed-off-by: Shiyang Chen <[email protected]>
@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17486 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17486 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13141 completed with status: 'FAILURE'

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17556 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17556 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13198 completed with status: 'FAILURE'

Signed-off-by: Shiyang Chen <[email protected]>
@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17569 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17569 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13210 completed with status: 'FAILURE'

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Sep 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17575 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17575 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13215 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@QiJune QiJune merged commit 98a1bff into NVIDIA:main Sep 4, 2025
5 checks passed
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants