[Kernel] Increase precision of GPTQ/AWQ Marlin kernel #6795

alexm-redhat · 2024-07-25T19:25:49Z

This PR increases the precision of Marlin kernel by modifying its "global_reduce" algorithm to use FP32 full-precision reduction, instead of the FP16 half-precision reduction that was used originally. We were able to implement the new FP32 global reduce efficiently, so that it introduces negligible overhead vs the original FP16 reduce.

The key idea is to introduce a temporary FP32 C buffer for the FP32 reduction (and not use the original FP16 C buffer as before). The new FP32 C buffer is limited in size based on the batch size (M dimension) and the potential "max_par" that can be achieved for each specific execution. Internally, each kernel thread-block detects on which thread-column-block it operates, and based on that accesses the appropriate chunk of the temporary C buffer in a fully thread-aligned way (to avoid any bank conflicts or non-contiguous memory reads/stores).

Here are micro-benchmark results for the gptq_marlin_gemm_fp16 vs gptq_marlin_gemm_fp32 (compared vs pytorch_gemm):

End-to-end performance verification on A100 shows max 5% penalty for 8b llama3 and no-penalty for 70b llama3.

github-actions · 2024-07-25T19:26:02Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

alexm-redhat · 2024-07-25T19:44:43Z

/ready

casper-hansen · 2024-07-26T14:44:51Z

Do you have any data showcasing that this fixes the described accuracy issues? I'm guessing you could look at the examples referenced in the issue? And potentially run an evaluation of perplexity?

alexm-redhat · 2024-07-26T14:59:10Z

@casper-hansen We had run internally a test on A10 for GSM dataset to verify accuracy, and we saw that with the old fp16 reduce the total accuracy is 58% and with the new fp32 reduce the accuracy is 73% (as it is supposed to be). Also, in the unit tests, we saw that the max difference is improved from e-3 to e-6 (almost double basically)

mgoin

Great work Alex, I'm glad it didn't require paring down split-k or any drastic changes.

Do you think this should be added to fp8 marlin as well? We could also just wait for @LucasWilkinson 's type refactor

alexm-redhat · 2024-07-27T13:12:53Z

I think it is not critical to add to fp8 marlin, since we did not had so much reports of accuracy issues like we had for GPTQ and especially AWQ (which is even more sensitive than GPTQ).

Sync with upstream change that improves the precision of the 'global_reduce' algorithm from FP16 to FP32. This solves some reported generation quality issues. Upstream issue/PR: vllm-project/vllm#6795

* upstream/main: (66 commits) [Bugfix] Fix PaliGemma MMP (vllm-project#6930) [TPU] Fix greedy decoding (vllm-project#6933) [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848) [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904) [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921) [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908) [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753) [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786) [TPU] Support tensor parallelism in async llm engine (vllm-project#6891) [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901) [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532) [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677) [Model] Initialize support for InternVL2 series models (vllm-project#6514) [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871) Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863) [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795) [TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-project#6856) [Docs] Add RunLLM chat widget (vllm-project#6857) [Model] Initial support for BLIP-2 (vllm-project#5920) [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860) ...

) Signed-off-by: Alvant <[email protected]>

) Signed-off-by: LeiWang1999 <[email protected]>

alexm-redhat added 5 commits July 25, 2024 16:14

tmp commit

c46c223

works

9124a16

sync

4c91de5

fix

dcf0a9f

format

77edb02

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 25, 2024

alexm-redhat mentioned this pull request Jul 25, 2024

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq) #6258

Closed

mgoin mentioned this pull request Jul 25, 2024

[Bug]: Different quality responses using GPTQ / marlin kernels on A10 vs A100 GPUs #5793

Closed

Merge branch 'upstream-main' into marlin_high_precision

c1fa1ec

mgoin approved these changes Jul 27, 2024

View reviewed changes

mgoin merged commit 75acdaa into vllm-project:main Jul 27, 2024

mgoin deleted the marlin_high_precision branch July 27, 2024 21:52

danieldk mentioned this pull request Jul 29, 2024

hotfix: increase precision of GPTQ/AWQ-Marlin huggingface/text-generation-inference#2325

Closed

5 tasks

alexm-redhat mentioned this pull request Aug 2, 2024

[Kernel][Core] Add AWQ support to the Marlin kernel #6612

Merged

dtrifiro mentioned this pull request Aug 5, 2024

Sync with [email protected] opendatahub-io/vllm#120

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795

4543f58

) Signed-off-by: Alvant <[email protected]>

jinzhen-lin mentioned this pull request Mar 3, 2025

[Kernel] optimize performance of gptq marlin kernel when n is small #14138

Merged

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795

9c0f044

) Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel #6795

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel #6795

Uh oh!

alexm-redhat commented Jul 25, 2024 •

edited by mgoin

Loading

Uh oh!

github-actions bot commented Jul 25, 2024

Uh oh!

alexm-redhat commented Jul 25, 2024

Uh oh!

casper-hansen commented Jul 26, 2024

Uh oh!

alexm-redhat commented Jul 26, 2024

Uh oh!

mgoin left a comment •

edited

Loading

Uh oh!

alexm-redhat commented Jul 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel #6795

[Kernel] Increase precision of GPTQ/AWQ Marlin kernel #6795

Uh oh!

Conversation

alexm-redhat commented Jul 25, 2024 • edited by mgoin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 25, 2024

Uh oh!

alexm-redhat commented Jul 25, 2024

Uh oh!

casper-hansen commented Jul 26, 2024

Uh oh!

alexm-redhat commented Jul 26, 2024

Uh oh!

mgoin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexm-redhat commented Jul 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexm-redhat commented Jul 25, 2024 •

edited by mgoin

Loading

mgoin left a comment •

edited

Loading