-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
I find an artice Accelerating Generative AI with PyTorch II: GPT, Fast
The optimization used in this article is as shown below
I simply tried gpt-fast, the improvement is huge
codellama-python-7b, 2xA10(24G)
infer | speed(token/s) |
---|---|
vllm fp16 | 45.2 |
gpt-fast fp16 | 66.5 |
gpt-fast int8 | 105.1 |
gpt-fast int4 | 145.9 |
ps: results of int4 is terrible
I'm curious, can these optimizations be used on vllm?
I can see some discussion about these optimizations, but it doesn't look like they will be possible in the short term (because of some problems about vllm?)
torch.compile
+34% higher throughput?
Compiled model with torch.compile, unfortunately without performance improvements
quantization
Add GPTQ support (I tried a version before but it didn't work well
Speculative Decoding
vllm is a great project!! I really hope to see these optimizations in vllm. I also want to know the difficulties that still exist :)