-
GPTQModel/gptqmodel/utils/model.py Line 861 in 0aa0059 I see the line explain the reason for using float16 with torch_dtype=auto here. Do you have any plan to improve it? |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 9 replies
-
The kernels run and return results as fp16 internally (mostly) so even if you ask it to output bf16, it just converts the result from fp16 to bf16 each time and fp16 to bf16 is actually not lossless. Do you have a usage case where say, having the gptq kernels output natively as bf16 will help with improving model accuracy or speed? |
Beta Was this translation helpful? Give feedback.
-
@zzh-www We are working on this. All kernels will support bfloat16 for inference soon. |
Beta Was this translation helpful? Give feedback.
-
Just call and pass bfloat16 to loader: |
Beta Was this translation helpful? Give feedback.
-
@zzh-www Auto also fixed #1421 Now if model.config has We will print a
|
Beta Was this translation helpful? Give feedback.
-
@zzh-www The fast kernels have severe quality degrations under BF16. We need might need to revert. If this is an issue, we will remove auto-bf16 but still allow manual bf16. |
Beta Was this translation helpful? Give feedback.
-
@zzh-www I have reverted the auto bf16 code. You can still run in bf16 by passing |
Beta Was this translation helpful? Give feedback.
-
@zzh-www Good news. We added even more kernel accuracy bests and now comparing multiple shapes between FP16 and BF16 and now all the results are showing the final max-accuracy drift between FP16 and BF16 are the same. We may add average drift tests in the future to see which mode has better average accurracy. But |
Beta Was this translation helpful? Give feedback.
@zzh-www I have reverted the auto bf16 code. You can still run in bf16 by passing
torch_dtype
override. However, based on my full kernel output testing, using BF16 (even with marlin kernel) has 2x to 10x the degradtion in raw accuracy drift unless you use the slower triton/torch kernerls. This is very bad for model accuracy. I fully recommend running the model in FP16 mode even in vLLM.