why load gptq int4 model by default with torch_dtype=auto is always use float16? #1379

zzh-www · 2025-03-05T07:17:15Z

zzh-www
Mar 5, 2025

Line 861 in 0aa0059

# for inference, DynamicCuda, Exllama, Triton, and Marlin are all fp16 kernels

I see the line explain the reason for using float16 with torch_dtype=auto here. Do you have any plan to improve it?

Mar 13, 2025

@zzh-www I have reverted the auto bf16 code. You can still run in bf16 by passing torch_dtype override. However, based on my full kernel output testing, using BF16 (even with marlin kernel) has 2x to 10x the degradtion in raw accuracy drift unless you use the slower triton/torch kernerls. This is very bad for model accuracy. I fully recommend running the model in FP16 mode even in vLLM.

View full answer

Qubitium · 2025-03-05T16:18:26Z

Qubitium
Mar 5, 2025
Maintainer

GPTQModel/gptqmodel/utils/model.py

Line 861 in 0aa0059

# for inference, DynamicCuda, Exllama, Triton, and Marlin are all fp16 kernels

I see the line explain the reason for using float16 with torch_dtype=auto here. Do you have any plan to improve it?

The kernels run and return results as fp16 internally (mostly) so even if you ask it to output bf16, it just converts the result from fp16 to bf16 each time and fp16 to bf16 is actually not lossless.

Do you have a usage case where say, having the gptq kernels output natively as bf16 will help with improving model accuracy or speed?

1 reply

zzh-www Mar 6, 2025
Author

I just run my quanted model with vllm, it can run with bf16/fp16, and the scores in my testdataset, bf16 is better. My base model is qwen2.5

GPTQModel/gptqmodel/utils/model.py

Line 861 in 0aa0059

# for inference, DynamicCuda, Exllama, Triton, and Marlin are all fp16 kernels

I see the line explain the reason for using float16 with torch_dtype=auto here. Do you have any plan to improve it?

The kernels run and return results as fp16 internally (mostly) so even if you ask it to output bf16, it just converts the result from fp16 to bf16 each time and fp16 to bf16 is actually not lossless.

Do you have a usage case where say, having the gptq kernels output natively as bf16 will help with improving model accuracy or speed?

Qubitium · 2025-03-11T03:11:30Z

Qubitium
Mar 11, 2025
Maintainer

@zzh-www We are working on this. All kernels will support bfloat16 for inference soon.

#1410
#1411

0 replies

Qubitium · 2025-03-11T11:45:41Z

Qubitium
Mar 11, 2025
Maintainer

@zzh-www Completed by #1420

Just call and pass bfloat16 to loader: GPTQModel.load(..., torch_dtype=torch.bfloat16)

0 replies

Qubitium · 2025-03-11T12:03:54Z

Qubitium
Mar 11, 2025
Maintainer

@zzh-www Auto also fixed #1421

Now if model.config has bfloat16, auto or default mode will also select bfloat16

We will print a

INFO: Loader: Auto dtype.... log in GPTQModel.load() if you don't specify torch_dtype so you know exactly which dtype the model and inference is operating in .

1 reply

zzh-www Mar 11, 2025
Author

wonderful work!!!👍

Qubitium · 2025-03-12T15:42:58Z

Qubitium
Mar 12, 2025
Maintainer

@zzh-www The fast kernels have severe quality degrations under BF16. We need might need to revert. If this is an issue, we will remove auto-bf16 but still allow manual bf16.

0 replies

Qubitium · 2025-03-13T07:09:20Z

Qubitium
Mar 13, 2025
Maintainer

@zzh-www I have reverted the auto bf16 code. You can still run in bf16 by passing torch_dtype override. However, based on my full kernel output testing, using BF16 (even with marlin kernel) has 2x to 10x the degradtion in raw accuracy drift unless you use the slower triton/torch kernerls. This is very bad for model accuracy. I fully recommend running the model in FP16 mode even in vLLM.

4 replies

zzh-www Mar 13, 2025
Author

Got it 🙏

Qubitium Mar 13, 2025
Maintainer

@zzh-www Keep in mind that even with lower accuracy, the model would output what appears to be the correct answer. Usually the low level kernel bugs are very hard to track since bug or no bug, the output appears correct to the human eyes. But if you run enough queries, the output becomes so different as if you are running two different models.

vLLM had a year+ bug in Marlin kernel due to same kernel accuracy issue which was only recently fixed for GPTQ. We use the same Marlin kernel as vLLM internal.

zzh-www Mar 13, 2025
Author

Could you give me the bug link of vllm? @Qubitium 🙏🏻

Qubitium Mar 13, 2025
Maintainer

vllm-project/vllm#6795

Qubitium · 2025-03-14T07:48:25Z

Qubitium
Mar 14, 2025
Maintainer

@zzh-www Good news. We added even more kernel accuracy bests and now comparing multiple shapes between FP16 and BF16 and now all the results are showing the final max-accuracy drift between FP16 and BF16 are the same. We may add average drift tests in the future to see which mode has better average accurracy. But auto bf16 model loading has been re-enabled!

3 replies

zzh-www Mar 14, 2025
Author

Cool!!!

zzh-www Mar 14, 2025
Author

I notice same test result in vllm show "max difference is improved from e-3 to e-6" by use fp32. What is the difference? You say we use the same kernel. @Qubitium

Qubitium Mar 14, 2025
Maintainer

Yes. We use the same kernel, not 100% the same, but almost the same. Before this fix, marlin error drift was 10x worse vs after the fix.

why load gptq int4 model by default with torch_dtype=auto is always use float16? #1379

Uh oh!

zzh-www Mar 5, 2025

Replies: 7 comments · 9 replies

Uh oh!

Uh oh!

Qubitium Mar 5, 2025 Maintainer

Uh oh!

zzh-www Mar 6, 2025 Author

Uh oh!

Uh oh!

Qubitium Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

Qubitium Mar 11, 2025 Maintainer

Uh oh!

Qubitium Mar 11, 2025 Maintainer

Uh oh!

zzh-www Mar 11, 2025 Author

Uh oh!

Qubitium Mar 12, 2025 Maintainer

Uh oh!

Qubitium Mar 13, 2025 Maintainer

Uh oh!

zzh-www Mar 13, 2025 Author

Uh oh!

Uh oh!

Qubitium Mar 13, 2025 Maintainer

Uh oh!

Uh oh!

zzh-www Mar 13, 2025 Author

Uh oh!

Qubitium Mar 13, 2025 Maintainer

Uh oh!

Qubitium Mar 14, 2025 Maintainer

Uh oh!

zzh-www Mar 14, 2025 Author

Uh oh!

Uh oh!

zzh-www Mar 14, 2025 Author

Uh oh!

Qubitium Mar 14, 2025 Maintainer

zzh-www
Mar 5, 2025

Replies: 7 comments 9 replies

Qubitium
Mar 5, 2025
Maintainer

zzh-www Mar 6, 2025
Author

Qubitium
Mar 11, 2025
Maintainer

Qubitium
Mar 11, 2025
Maintainer

Qubitium
Mar 11, 2025
Maintainer

zzh-www Mar 11, 2025
Author

Qubitium
Mar 12, 2025
Maintainer

Qubitium
Mar 13, 2025
Maintainer

zzh-www Mar 13, 2025
Author

Qubitium Mar 13, 2025
Maintainer

zzh-www Mar 13, 2025
Author

Qubitium Mar 13, 2025
Maintainer

Qubitium
Mar 14, 2025
Maintainer

zzh-www Mar 14, 2025
Author

zzh-www Mar 14, 2025
Author

Qubitium Mar 14, 2025
Maintainer