-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Qwen FP8 ModelOPT support #21978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen FP8 ModelOPT support #21978
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds FP8 support for several Qwen models from ModelOPT. The changes involve updating the model configuration parsing to recognize ModelOPT FP8 checkpoints, and adjusting the weight loading logic in Qwen2 and Qwen3-MoE models to handle FP8-specific parameter names and loader function signatures.
The changes in vllm/model_executor/models/qwen2.py
and vllm/model_executor/models/qwen3_moe.py
for weight loading are correct and improve robustness. However, I've identified a high-severity issue in vllm/config.py
where the quantization configuration is completely overwritten, which could lead to the loss of important settings like kv_cache_quant_algo
. I have provided a suggestion to fix this by preserving the original configuration while adding the necessary quant_method
.
79a8f94
to
60b89dd
Compare
@mgoin Please help review this PR when you got a chance. Thanks! |
This pull request has merge conflicts that must be resolved before it can be |
c5f542a
to
d5b6ffe
Compare
This pull request has merge conflicts that must be resolved before it can be |
63002ce
to
16a673e
Compare
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
f9a759f
to
5b68a27
Compare
Signed-off-by: jingyu <[email protected]>
5b68a27
to
0b754e0
Compare
@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations. Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience |
These types of changes are unmaintainable over time |
@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815. In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples. For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM. cc @mgoin |
in #11148 (this was one of many PRs that added the concept of remapping kv cache scales to all models), #20046, and #16803, generic functionality was introduced into vllm to support a new feature, this is very different from making one-off changes to models |
#11148 modified the specific model file vllm/model_executor/models/starcoder2.py just like this PR. |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Add the QwQ-32B/Qwen2.5/Qwen3/Qwen3-MoE FP8 support from ModelOPT.
The FP8 ckpt can be generated using ModelOPT's example: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq.
Tested on QwQ-32B/Qwen2.5-14B/Qwen3-1.7B/Qwen3-30B-A3B
Test Plan
Using this script to test it:
Test Result
(Optional) Documentation Update