-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
🐛 Describe the bug
Hello,
I am running llama3-70b and mixtral with VLLM on a bunch of different kinds of machines. I encountered wildly different quality performance on A10 GPUs vs A100/H100 GPUs for ONLY gptq models and marlin kernels. GPTQ with marlin kernels is way faster than AWQ but with AWQ, i see roughly the same response on my test queries on either kind of GPU environment. For A10 deployments, the only difference in the settings is that I use 2 A10 24GB GPUs instead of 1 A100 or H100 (using the tensor parallelism param). I am running vllm = 0.4.2. I am using examples from llama3-70b testing on a very simple test query but I also saw the similar flavor of quality issues with mixtral-awq vs mixtral-gptq as well and I also saw the same flavor of issues on all my other more complicated RAG test queries as well. Generally the gptq models on A10s are psychotic and unusable and behave as if the model files are screwed up, which is why i tested two publicly available quantizations of llama3...
I also tried to simulate a GPU environment of 2 A10s on 1 single A100 by setting --gpu-memory-utilization to 0.55 i.e. I choked the vram on the A100 deployment to choke the kv-cache available space. I got the same good quality performance on the A100 that I always see albeit with less throughput since my kv-cache was choked. No psychotic responses like I see on A10s.
Here's my examples,
Test prompt: "what model are you and what quantization do you use?"
Model : https://huggingface.co/casperhansen/llama-3-70b-instruct-awq
Model Args: --max-model-len 8192 --quantization awq --dtype half --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8
Response on A10 or A100 (doesn't matter where I deploy it, I get pretty much the same response):
"I'm an AI assistant, and I'm happy to share some information about my architecture.
I'm a large language model, specifically a transformer-based model, which is a type of neural network designed primarily for natural language processing tasks. My model is based on the BERT (Bidirectional Encoder Representations from Transformers) architecture, which was developed by Google in 2018.
Regarding quantization, I don't use traditional quantization techniques like fixed-point or floating-point quantization. Instead, my model uses a technique called "knowledge distillation" to reduce the precision of my weights and activations while maintaining the accuracy of my responses.
Knowledge distillation is a method that involves training a smaller, simpler model (the "student") to mimic the behavior of a larger, more complex model (the "teacher"). In my case, the teacher model is a large, floating-point BERT model, and the student model is a smaller, integer-based version of myself.
During training, the student model learns to approximate the behavior of the teacher model by minimizing the difference between their outputs. This process allows me to maintain high accuracy while using fewer computational resources and reducing my memory footprint.
So, to summarize: I'm a BERT-based language model that uses knowledge distillation to reduce precision and improve efficiency, rather than traditional quantization techniques."
Model : https://huggingface.co/TechxGenus/Meta-Llama-3-70B-GPTQ or https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GPTQ (tested with both)
Model Args: --max-model-len 8192 --quantization marlin --dtype half --gpu-memory-utilization 0.99 --trust-remote-code --enforce-eager --kv-cache-dtype fp8
Response on A100:
"Thank you for asking!
I am a type of artificial intelligence language model, to be specific, I'm a transformer-based language model. My architecture is based on the transformer model introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017.
As for quantization, I don't use traditional quantization methods like 8-bit or 16-bit integer quantization. Instead, I rely on a technique called "floating-point quantization" which allows me to represent my weights and activations using a combination of floating-point numbers and integers.
In more detail, my model uses a combination of the following techniques:
1. **Weight pruning**: This involves removing redundant or unnecessary weights from my neural network to reduce its size and computational requirements.
2. **Knowledge distillation**: This is a process where a smaller model (like me) is trained to mimic the behavior of a larger, more complex model. This helps to transfer knowledge from the larger model to me.
3. **Quantization-aware training**: During training, my weights and activations are simulated using lower precision data types (like integers) to mimic the effects of quantization. This helps me to learn how to represent information efficiently even with reduced precision.
These techniques allow me to maintain a good balance between accuracy and computational efficiency, I can provide helpful responses quickly while still being able to understand and respond accurately.
Please let me know if you have any other questions or if there's anything else I can help with!"
Response on 2 A10s (tensor parallelism):
"into the 2000s and the 2000s of the 2000s and the 2000s of the 2000s and the 2000s of the 2000s."
or something else equally psychotic and random (it produces a different response every time so I have a feeling this has to do with nondeterminism). Heres another example of a response on the same prompt:
"Anita; and the last; and the first; and the last; and the first; and the last.
Anita;
and the last;
and the first;
and the last;
and the first;
and the last.
[Anita];
[the last];
[the first];
[the last];
[the first];
[the last].
Anita;
and the last;
and the first;
and the last;
and the first;
and the last.
[Anita]];
[the last]];
[[the first]];
[[the last]];
[[the first]];
[[the last]].
Anita:;
and [[the last]];
[[the first]]];
[[the last]]];
[[the first]]];
[[the last]].
Anita:;
[[last]]];
[[first]]];
[[last]]];
[[first]];
[[last]].
Anita:;
[[[last]]],
[[[first]]],
[[[last]]],
[[[first]]],
[[[last]]].
Anita:;
[[[last]],
[[[first]],
[[[last]],
{{{first}},
{{{last}}].
Anita:;
{{{last}},
{{{first}},
{{{last}},
{{{first}},
{{{last}}].
Anita:;
{{{last}},
{{{first}},
{{{last}},
{{{first}},
{{{last}}].
This is a series of nested brackets,2019. The output is a series of nested brackets."
Let me know what you all think, I am very confused and surprised at this bug since both GPUs are ampere architecture and I wasn't able to pin the bug down on anything EXCEPT the difference in GPU in the deployment hardware. I've also been deploying and testing many different LLMs on a wide variety of hardware settings and have always been able to trust the heuristic that the quality of responses will be roughly the same but the speed of inference will be different.