Qwen FP8 ModelOPT support #21978

jingyu-ml · 2025-07-30T22:47:55Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add the QwQ-32B/Qwen2.5/Qwen3/Qwen3-MoE FP8 support from ModelOPT.
The FP8 ckpt can be generated using ModelOPT's example: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq.

Tested on QwQ-32B/Qwen2.5-14B/Qwen3-1.7B/Qwen3-30B-A3B

Test Plan

Using this script to test it:

from vllm import LLM, SamplingParams

def main():
    # model_id = "QwQ-32B"
    # model_id = "Qwen2.5-14B"
    # model_id = "Qwen3-1.7B"
    model_id = "Qwen3-30B-A3B"
    sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    llm = LLM(model=model_id, quantization="modelopt")
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

if __name__ == "__main__":
    main()

Test Result

Prompt: 'Hello, my name is', Generated text: ' Chris. I need to make a form with a custom submit button. The problem'
Prompt: 'The president of the United States is', Generated text: ' not allowed to be a foreign national. The president must be a natural-born citizen'
Prompt: 'The capital of France is', Generated text: ' Paris. The capital of the United States is Washington, D.C. What is'
Prompt: 'The future of AI is', Generated text: ' about the emergence of super-intelligence, but the problem is that humans are not'

(Optional) Documentation Update

github-actions · 2025-07-30T22:48:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds FP8 support for several Qwen models from ModelOPT. The changes involve updating the model configuration parsing to recognize ModelOPT FP8 checkpoints, and adjusting the weight loading logic in Qwen2 and Qwen3-MoE models to handle FP8-specific parameter names and loader function signatures.

The changes in vllm/model_executor/models/qwen2.py and vllm/model_executor/models/qwen3_moe.py for weight loading are correct and improve robustness. However, I've identified a high-severity issue in vllm/config.py where the quantization configuration is completely overwritten, which could lead to the loss of important settings like kv_cache_quant_algo. I have provided a suggestion to fix this by preserving the original configuration while adding the necessary quant_method.

vllm/config.py

Edwardf0t1 · 2025-07-31T01:38:10Z

@mgoin Please help review this PR when you got a chance. Thanks!

vllm/config.py

mergify · 2025-08-05T07:03:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-08-08T02:26:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-08-08T23:38:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-08-11T20:25:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: jingyu <[email protected]>

robertgshaw2-redhat · 2025-08-12T13:29:50Z

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.

Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

robertgshaw2-redhat · 2025-08-12T13:30:23Z

These types of changes are unmaintainable over time

Edwardf0t1 · 2025-08-12T17:59:41Z

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.

Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.

In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.

For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.

I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.

cc @mgoin

robertgshaw2-redhat · 2025-08-13T00:14:37Z

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.
Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.

In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.

For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.

I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.

cc @mgoin

in #11148 (this was one of many PRs that added the concept of remapping kv cache scales to all models), #20046, and #16803, generic functionality was introduced into vllm to support a new feature, this is very different from making one-off changes to models

Edwardf0t1 · 2025-08-13T00:55:22Z

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.
Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.
In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.
For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.
I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.
cc @mgoin

in #11148 (this was one of many PRs that added the concept of remapping kv cache scales to all models), #20046, and #16803, generic functionality was introduced into vllm to support a new feature, this is very different from making one-off changes to models

#11148 modified the specific model file vllm/model_executor/models/starcoder2.py just like this PR.
#20046 is a bugfix PR which includes a fix for Qwen2_5_VL.
#16803 is for Mistral-format support.

jingyu-ml requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, tlrmchlsmth and youkaichao as code owners July 30, 2025 22:47

mergify bot added the qwen Related to Qwen models label Jul 30, 2025

gemini-code-assist bot reviewed Jul 30, 2025

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from 79a8f94 to 60b89dd Compare July 30, 2025 23:01

hmellor reviewed Jul 31, 2025

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Aug 5, 2025

jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from c5f542a to d5b6ffe Compare August 6, 2025 23:40

mergify bot removed the needs-rebase label Aug 7, 2025

mergify bot added needs-rebase and removed needs-rebase labels Aug 8, 2025

jingyu-ml requested a review from hmellor August 8, 2025 21:12

jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch 4 times, most recently from 63002ce to 16a673e Compare August 8, 2025 22:48

mergify bot added needs-rebase and removed needs-rebase labels Aug 8, 2025

jingyu-ml requested review from aarnphm, alexm-redhat, comaniac and njhill as code owners August 11, 2025 20:24

mergify bot added the needs-rebase label Aug 11, 2025

jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from f9a759f to 5b68a27 Compare August 11, 2025 20:28

mergify bot removed the tpu Related to Google TPUs label Aug 11, 2025

Qwen FP8 Support

0b754e0

Signed-off-by: jingyu <[email protected]>

jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from 5b68a27 to 0b754e0 Compare August 11, 2025 20:29

mergify bot removed the needs-rebase label Aug 11, 2025

hmellor closed this Aug 12, 2025

Uh oh!

Qwen FP8 ModelOPT support #21978

Qwen FP8 ModelOPT support #21978

Uh oh!

Conversation

jingyu-ml commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Edwardf0t1 commented Jul 31, 2025

Uh oh!

Uh oh!

mergify bot commented Aug 5, 2025

Uh oh!

mergify bot commented Aug 8, 2025

Uh oh!

mergify bot commented Aug 8, 2025

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

robertgshaw2-redhat commented Aug 12, 2025

Uh oh!

robertgshaw2-redhat commented Aug 12, 2025

Uh oh!

Edwardf0t1 commented Aug 12, 2025

Uh oh!

robertgshaw2-redhat commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwardf0t1 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jingyu-ml commented Jul 30, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Aug 13, 2025 •

edited

Loading