[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow #5850

Yuening-wa · 2025-07-08T14:25:35Z

Add support of both INT4 and INT8 weight-only-quantization in PyTorch workflow.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Introduced support for weight-only quantized linear layers, enabling efficient inference with quantized weights.
- Added a CUDA-accelerated weight-only quantized GEMM operator with Python bindings for high-performance matrix multiplication.
- Integrated a new linear method for weight-only quantization into the module, supporting various quantization formats.
Bug Fixes
- Adjusted preprocessing logic for mixed GEMM to improve compatibility with specific hardware architectures.
Tests
- Added comprehensive unit tests for the new weight-only quantized GEMM operator and linear layer to ensure correctness and numerical accuracy.

Yuening-wa · 2025-07-08T16:00:01Z

/bot run

tensorrt-cicd · 2025-07-08T16:12:36Z

PR_Github #11327 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T20:27:52Z

PR_Github #11327 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8379 completed with status: 'FAILURE'

Yuening-wa · 2025-07-13T04:12:14Z

/bot run

tensorrt-cicd · 2025-07-13T04:18:03Z

PR_Github #11718 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-13T04:31:09Z

PR_Github #11718 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8674 completed with status: 'FAILURE'

Yuening-wa · 2025-07-13T05:11:15Z

/bot run

tensorrt-cicd · 2025-07-13T05:16:45Z

PR_Github #11720 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-13T07:33:43Z

PR_Github #11720 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8676 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

Yuening-wa · 2025-07-16T09:15:37Z

/bot run

tensorrt-cicd · 2025-07-16T09:21:58Z

PR_Github #12070 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-16T13:16:04Z

PR_Github #12070 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8964 completed with status: 'SUCCESS'

tensorrt_llm/_torch/modules/linear.py

coderabbitai · 2025-07-20T06:00:02Z

Walkthrough

A weight-only quantized GEMM (General Matrix Multiply) runner and operator were introduced, supporting INT8 and INT4 quantized weights with FP16/BF16 activations. This includes CUDA/C++ implementation, Python bindings, a new linear method for weight-only quantization, and comprehensive unit tests for both the GEMM operator and the linear layer integration.

Changes

File(s)	Change Summary
cpp/tensorrt_llm/thop/CMakeLists.txt	Added `weightOnlyQuantGemm.cpp` to the build for the `th_common` shared library.
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h	Implemented and declared a CUDA-accelerated weight-only quantized GEMM runner class with PyTorch bindings.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py	Added `WeightOnlyQuantGemmRunner` class, registered a custom op, and provided a fake for tracing.
tensorrt_llm/_torch/modules/linear.py	Introduced `WeightOnlyQuantLinearMethod`, updated quantization logic, and integrated the new GEMM op.
tensorrt_llm/quantization/functional.py	Tightened a conditional in `preprocess_weights_for_mixed_gemm` for compute capability selection.
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py	Added unit tests for the weight-only quantized GEMM operator.
tests/unittest/_torch/thop/test_weight_only_quant_linear.py	Added unit tests for the weight-only quantized linear layer.

Sequence Diagram(s)

sequenceDiagram
    participant PythonUser
    participant LinearModule
    participant WeightOnlyQuantLinearMethod
    participant WeightOnlyQuantGemmRunner (Python)
    participant TorchScriptClass
    participant CUDA_GEMM

    PythonUser->>LinearModule: forward(input)
    LinearModule->>WeightOnlyQuantLinearMethod: apply(input, bias)
    WeightOnlyQuantLinearMethod->>WeightOnlyQuantGemmRunner (Python): forward(input, weight, weight_scale, tactic, to_userbuffers, out_dtype)
    WeightOnlyQuantGemmRunner (Python)->>TorchScriptClass: run_gemm(input, weight, weight_scale, tactic, to_userbuffers, out_dtype)
    TorchScriptClass->>CUDA_GEMM: launch CUDA GEMM kernel
    CUDA_GEMM-->>TorchScriptClass: output tensor
    TorchScriptClass-->>WeightOnlyQuantGemmRunner (Python): output tensor
    WeightOnlyQuantGemmRunner (Python)-->>WeightOnlyQuantLinearMethod: output tensor
    WeightOnlyQuantLinearMethod-->>LinearModule: output tensor
    LinearModule-->>PythonUser: output tensor

Estimated code review effort

3 (120 minutes)

Suggested labels

Community want to contribute

Suggested reviewers

tomeras91

Poem

🐇 In the land where tensors play,
Quantized weights now lead the way.
INT4, INT8, fused with grace,
CUDA speeds the matrix race.
Linear hops with nimble cheer,
Faster, smarter, far and near!
🎉✨

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tensorrt_llm/_torch/modules/linear.py (2)
113-119: Verify the logic change from AWQ-specific to general weight-only quantization.

The change from module.has_w4a16_awq to module.has_weight_only_quant broadens the condition significantly. This could affect AWQ models if they don't follow the same preprocessing path.

This relates to the previous review comment about whether has_weight_only_quant should include has_w4a16_awq or if they should remain separate checks.

Fix the line length issue:
-        weight = preprocess_weights_for_mixed_gemm(
-            weight.T.to(torch.int8).contiguous().cpu(), weight_dtype,
-            torch.float16).cuda().contiguous()
+        weight = preprocess_weights_for_mixed_gemm(
+            weight.T.to(torch.int8).contiguous().cpu(), 
+            weight_dtype,
+            torch.float16
+        ).cuda().contiguous()
1056-1056: Good fix addressing previous review feedback.

This addresses the previous review comment about unnecessary .to and .contiguous() calls. The input should already be properly typed and contiguous when passed to the apply method.

🧹 Nitpick comments (2)

tensorrt_llm/_torch/modules/linear.py (2)
900-934: Clean implementation with minor improvement opportunity.

The WeightOnlyQuantLinearMethod class follows established patterns well. One minor suggestion for the apply method:

The bias handling can be simplified:
-        bias = bias.contiguous() if bias is not None else None
+        if bias is not None:
+            bias = bias.contiguous()
This avoids unnecessary assignment when bias is None.

971-1014: Consider refactoring to reduce code duplication with AWQ method.

The fused weight loading methods work correctly but share significant code patterns with W4A16_AWQ_LinearMethod. Consider extracting common preprocessing logic into shared helper functions.

The preprocessing logic in lines 979-981 and 1000-1002 is nearly identical to AWQ's implementation and could be refactored into a shared helper.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f9bd01 and 4e55b3a.

📒 Files selected for processing (8)

cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
tensorrt_llm/_torch/modules/linear.py (6 hunks)
tensorrt_llm/quantization/functional.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_linear.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/thop/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/quantization/functional.py
tests/unittest/_torch/thop/test_weight_only_quant_linear.py
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/linear.py

114-114: Line too long (135 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tensorrt_llm/_torch/modules/linear.py (3)

173-192: LGTM! Well-implemented helper function.

The function provides a clean abstraction for determining weight dtype and packing ID based on quantization mode. The documentation is comprehensive and the error handling is appropriate.

1164-1166: LGTM! Correct factory method logic.

The condition properly checks for weight-only quantization without per-group scaling, and the placement before the more specific AWQ check follows the correct precedence order.

1288-1292: LGTM! Consistent property implementation.

The has_weight_only_quant property follows the established pattern of other quantization type properties in the class.

Yuening-wa · 2025-07-20T06:18:20Z

/bot run

tensorrt-cicd · 2025-07-20T06:23:25Z

PR_Github #12371 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-20T09:54:59Z

PR_Github #12371 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9191 completed with status: 'FAILURE'

Yuening-wa · 2025-07-20T10:02:14Z

/bot run

tensorrt-cicd · 2025-07-20T10:07:41Z

PR_Github #12379 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-20T12:48:42Z

PR_Github #12379 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9198 completed with status: 'SUCCESS'

Signed-off-by: Yuening Li <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/modules/linear.py (1)
119-127: Good refactoring to use the consolidated weight-only quantization check.

The change from checking specific AWQ variants to the more general has_weight_only_quant property aligns well with the new architecture. The logic for determining activation dtype and using the new helper function is correct.

However, there's a line length issue that should be addressed:
-        # NOTE: without the preprocess during the runtime, the gemm output nan's. in order to use the preprocess_weights_for_mixed_gemm
+        # NOTE: without the preprocess during the runtime, the gemm output nan's.
+        # In order to use the preprocess_weights_for_mixed_gemm

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e55b3a and 3434e53.

📒 Files selected for processing (8)

cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
tensorrt_llm/_torch/modules/linear.py (5 hunks)
tensorrt_llm/quantization/functional.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_linear.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

cpp/tensorrt_llm/thop/CMakeLists.txt
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

🚧 Files skipped from review as they are similar to previous changes (5)

tests/unittest/_torch/thop/test_weight_only_quant_linear.py
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py
tensorrt_llm/quantization/functional.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/linear.py

120-120: Line too long (135 > 120)

(E501)

🔇 Additional comments (8)

tensorrt_llm/_torch/modules/linear.py (8)

180-199: Well-designed helper function with clear documentation and proper validation.

The function correctly determines weight dtype and packing ID for weight-only quantization modes. The assertions ensure proper usage context, and the error handling covers unsupported cases appropriately.

909-928: Properly implemented weight creation with correct tensor shapes.

The create_weights method correctly handles weight packing considerations (INT4 weights packed into INT8) and creates appropriate parameter tensors for weights and scales.

930-940: Clean implementation of the apply method using the new CUDA operator.

The method correctly determines weight dtype, ensures bias contiguity, and calls the new weight_only_quant_gemm operator with appropriate parameters.

942-966: Load weight scales method correctly handles fused QKV scenarios.

The method properly loads weight scales for all three components (Q, K, V) with appropriate tensor parallel handling.

978-996: Fused QKV weight loading with proper preprocessing.

The implementation correctly concatenates Q, K, V weights, applies necessary preprocessing using preprocess_weights_for_mixed_gemm, and handles weight scales appropriately.

998-1020: Fused gate/up weight loading follows the same solid pattern.

Consistent implementation with the QKV variant, properly handling weight concatenation, preprocessing, and scale loading.

1419-1421: Correct integration of the new weight-only quantization method.

The condition properly identifies weight-only quantization without per-group scaling and returns the appropriate method instance.

1546-1550: Well-implemented property following established patterns.

The has_weight_only_quant property correctly checks for weight-only quantization and maintains consistency with other quantization properties in the class.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tensorrt_llm/_torch/modules/linear.py (3)
119-126: Address formatting issue and verify preprocessing logic.

The conceptual improvements using has_weight_only_quant address previous review feedback well. However, there's a formatting issue and some questions about the preprocessing logic.

Line length violation: Line 120 exceeds the 120-character limit as flagged by static analysis.

Verify preprocessing consistency: The preprocessing involves multiple transformations (.T.to(torch.int8).contiguous().cpu() then back to .cuda().contiguous()). This seems inefficient - verify if all these operations are necessary.
-        # NOTE: without the preprocess during the runtime, the gemm output nan's. in order to use the preprocess_weights_for_mixed_gemm
-        # we need to cast the weight to int8 first.
+        # NOTE: without the preprocess during the runtime, the gemm output nan's. 
+        # in order to use the preprocess_weights_for_mixed_gemm we need to cast the weight to int8 first.
Please verify if the CPU transfer and back to CUDA is intentional or can be optimized.

978-997: Consider refactoring complex fused QKV weight loading.

The fused QKV weight loading method has several complex operations that could benefit from extraction into helper methods for better maintainability.

The method performs:

Weight concatenation

Weight preprocessing with dtype conversion

Scale loading and concatenation

Consider extracting the preprocessing logic into a separate method:
def _preprocess_fused_weights(self, module: Linear, weights: torch.Tensor) -> torch.Tensor:
    weight_dtype, _ = get_weight_dtype_and_id(module)
    return preprocess_weights_for_mixed_gemm(
        weights.to(torch.int8).T.contiguous().cpu(), 
        weight_dtype,
        torch.float16
    ).cuda().contiguous()
This would improve readability and reduce duplication between the fused methods.

998-1021: Similar complexity in fused gate/up weight loading.

This method has similar complexity to the QKV method and would benefit from the same refactoring approach mentioned above.

The preprocessing logic is duplicated here and could use the same helper method suggested for the QKV case.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3434e53 and eb69ef9.

📒 Files selected for processing (8)

cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp (1 hunks)
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h (1 hunks)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (1 hunks)
tensorrt_llm/_torch/modules/linear.py (5 hunks)
tensorrt_llm/quantization/functional.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py (1 hunks)
tests/unittest/_torch/thop/test_weight_only_quant_linear.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/thop/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/quantization/functional.py
tests/unittest/_torch/thop/test_weight_only_quant_linear.py
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.h
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/thop/weightOnlyQuantGemm.cpp
tests/unittest/_torch/thop/test_weight_only_quant_gemm.py

🧰 Additional context used

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/modules/linear.py

120-120: Line too long (135 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/_torch/modules/linear.py (5)

180-199: Well-designed helper function.

This utility function effectively extracts weight dtype and packing logic into a reusable, well-documented component. The input validation, clear return value documentation, and error handling are excellent.

907-929: Clean implementation of weight creation and initialization.

The create_weights method properly uses the helper function and follows established patterns for parameter creation and bias handling.

930-941: Verify bias handling in apply method.

The apply method looks correct but has a potential issue with bias handling.

The bias is made contiguous unconditionally on line 934, but there's no guarantee that bias is not None. While this works because None.contiguous() would fail before the ternary operation, it's more explicit to check for None first:
-        bias = bias.contiguous() if bias is not None else None
+        bias = bias.contiguous() if bias is not None else None
Actually, the current code is correct - my concern was unfounded. The method looks good.

1419-1421: Clean addition to quantization method factory.

The new conditional logic properly identifies weight-only quantization without per-group scaling and returns the appropriate method instance. The placement and logic are correct.

1546-1550: Property consolidation addresses previous review feedback.

This new property provides a clean abstraction for weight-only quantization checks and addresses the consolidation concern raised in previous reviews. The implementation follows established patterns perfectly.

Yuening-wa · 2025-07-21T03:35:28Z

/bot run

tensorrt-cicd · 2025-07-21T03:45:08Z

PR_Github #12409 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-21T07:13:15Z

PR_Github #12409 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9226 completed with status: 'SUCCESS'

…low (NVIDIA#5850) Signed-off-by: Yuening Li <[email protected]> Co-authored-by: Yuening Li <[email protected]>

…low (NVIDIA#5850) Signed-off-by: Yuening Li <[email protected]> Co-authored-by: Yuening Li <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

…low (NVIDIA#5850) Signed-off-by: Yuening Li <[email protected]> Co-authored-by: Yuening Li <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

Yuening-wa requested a review from a team as a code owner July 8, 2025 14:25

Yuening-wa requested review from hyukn and lfr-0531 July 8, 2025 14:25

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Jul 8, 2025

Yuening-wa changed the title ~~[TRTLLM-5863][feat] Support INT4 and INT8 Weight-Only-Quantization in PyTorch Workflow~~ [TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow Jul 8, 2025

Shunkangz removed the Community want to contribute PRs initiated from Community label Jul 9, 2025

wm2012011492 requested review from Barry-Delaney and StudyingShao July 9, 2025 06:33

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from cf6bc94 to c35c81b Compare July 13, 2025 04:02

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from c35c81b to cc62513 Compare July 13, 2025 04:22

lfr-0531 requested a review from Tracin July 14, 2025 03:33

hyukn reviewed Jul 15, 2025

View reviewed changes

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py Show resolved Hide resolved

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch 2 times, most recently from 6c53025 to 7393aae Compare July 16, 2025 09:03

Yuening-wa requested review from Njuapp, bobboli, hyukn and yilin-void July 17, 2025 03:00

bobboli approved these changes Jul 19, 2025

View reviewed changes

tensorrt_llm/_torch/modules/linear.py Outdated Show resolved Hide resolved

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from 7393aae to 2f9bd01 Compare July 20, 2025 05:59

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from 2f9bd01 to 4e55b3a Compare July 20, 2025 06:01

coderabbitai bot reviewed Jul 20, 2025

View reviewed changes

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from 4e55b3a to 3434e53 Compare July 21, 2025 03:30

Yuening-wa added 6 commits July 21, 2025 11:31

support weight-only quant GEMM in torch flow

bb311fd

Signed-off-by: Yuening Li <[email protected]>

Add AutoTuner support

e543ed0

Signed-off-by: Yuening Li <[email protected]>

support weight-only-quant method in linear module

dd0c344

Signed-off-by: Yuening Li <[email protected]>

fix the accuracy issue on Blackwell

22af0e9

Signed-off-by: Yuening Li <[email protected]>

support 'user_buffers'

8d8c95b

Signed-off-by: Yuening Li <[email protected]>

refine weight-only linear method

eb69ef9

Signed-off-by: Yuening Li <[email protected]>

Yuening-wa force-pushed the user/yueningl/weight_only_quantization_support branch from 3434e53 to eb69ef9 Compare July 21, 2025 03:31

coderabbitai bot reviewed Jul 21, 2025

View reviewed changes

lfr-0531 approved these changes Jul 21, 2025

View reviewed changes

lfr-0531 merged commit e8c068b into NVIDIA:main Jul 21, 2025
3 checks passed

coderabbitai bot mentioned this pull request Aug 6, 2025

[TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow #6629

Merged

[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow #5850

[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow #5850

Uh oh!

Conversation

Yuening-wa commented Jul 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

Yuening-wa commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

tensorrt-cicd commented Jul 8, 2025

Uh oh!

Yuening-wa commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

Yuening-wa commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

tensorrt-cicd commented Jul 13, 2025

Uh oh!

Uh oh!

Yuening-wa commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

tensorrt-cicd commented Jul 16, 2025

Uh oh!

Uh oh!

coderabbitai bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Yuening-wa commented Jul 20, 2025

Uh oh!

tensorrt-cicd commented Jul 20, 2025

Uh oh!

tensorrt-cicd commented Jul 20, 2025

Uh oh!

Yuening-wa commented Jul 20, 2025

Uh oh!

tensorrt-cicd commented Jul 20, 2025

Uh oh!

tensorrt-cicd commented Jul 20, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Yuening-wa commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

tensorrt-cicd commented Jul 21, 2025

Uh oh!

Yuening-wa commented Jul 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 20, 2025 •

edited

Loading