Skip to content

Conversation

jingyu-ml
Copy link

@jingyu-ml jingyu-ml commented Jul 30, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Add the QwQ-32B/Qwen2.5/Qwen3/Qwen3-MoE FP8 support from ModelOPT.
The FP8 ckpt can be generated using ModelOPT's example: https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq.

Tested on QwQ-32B/Qwen2.5-14B/Qwen3-1.7B/Qwen3-30B-A3B

Test Plan

Using this script to test it:

from vllm import LLM, SamplingParams

def main():
    # model_id = "QwQ-32B"
    # model_id = "Qwen2.5-14B"
    # model_id = "Qwen3-1.7B"
    model_id = "Qwen3-30B-A3B"
    sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    llm = LLM(model=model_id, quantization="modelopt")
    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

if __name__ == "__main__":
    main()

Test Result

Prompt: 'Hello, my name is', Generated text: ' Chris. I need to make a form with a custom submit button. The problem'
Prompt: 'The president of the United States is', Generated text: ' not allowed to be a foreign national. The president must be a natural-born citizen'
Prompt: 'The capital of France is', Generated text: ' Paris. The capital of the United States is Washington, D.C. What is'
Prompt: 'The future of AI is', Generated text: ' about the emergence of super-intelligence, but the problem is that humans are not'

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Jul 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds FP8 support for several Qwen models from ModelOPT. The changes involve updating the model configuration parsing to recognize ModelOPT FP8 checkpoints, and adjusting the weight loading logic in Qwen2 and Qwen3-MoE models to handle FP8-specific parameter names and loader function signatures.

The changes in vllm/model_executor/models/qwen2.py and vllm/model_executor/models/qwen3_moe.py for weight loading are correct and improve robustness. However, I've identified a high-severity issue in vllm/config.py where the quantization configuration is completely overwritten, which could lead to the loss of important settings like kv_cache_quant_algo. I have provided a suggestion to fix this by preserving the original configuration while adding the necessary quant_method.

@jingyu-ml jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from 79a8f94 to 60b89dd Compare July 30, 2025 23:01
@Edwardf0t1
Copy link
Contributor

@mgoin Please help review this PR when you got a chance. Thanks!

Copy link

mergify bot commented Aug 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 5, 2025
@jingyu-ml jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from c5f542a to d5b6ffe Compare August 6, 2025 23:40
@mergify mergify bot removed the needs-rebase label Aug 7, 2025
Copy link

mergify bot commented Aug 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@jingyu-ml jingyu-ml requested a review from hmellor August 8, 2025 21:12
@jingyu-ml jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch 4 times, most recently from 63002ce to 16a673e Compare August 8, 2025 22:48
Copy link

mergify bot commented Aug 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm speculative-decoding v1 tpu Related to Google TPUs labels Aug 11, 2025
Copy link

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jingyu-ml.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
@jingyu-ml jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from f9a759f to 5b68a27 Compare August 11, 2025 20:28
@mergify mergify bot removed the tpu Related to Google TPUs label Aug 11, 2025
Signed-off-by: jingyu <[email protected]>
@jingyu-ml jingyu-ml force-pushed the jingyux/dev-qwen-fp8 branch from 5b68a27 to 0b754e0 Compare August 11, 2025 20:29
@mergify mergify bot removed the needs-rebase label Aug 11, 2025
@robertgshaw2-redhat
Copy link
Collaborator

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.

Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat
Copy link
Collaborator

These types of changes are unmaintainable over time

@Edwardf0t1
Copy link
Contributor

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.

Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.

In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.

For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.

I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.

cc @mgoin

@hmellor hmellor closed this Aug 12, 2025
@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Aug 13, 2025

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.
Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.

In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.

For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.

I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.

cc @mgoin

in #11148 (this was one of many PRs that added the concept of remapping kv cache scales to all models), #20046, and #16803, generic functionality was introduced into vllm to support a new feature, this is very different from making one-off changes to models

@Edwardf0t1
Copy link
Contributor

@jingyu-ml, in general, why does ModelOpt need to add support on a model-by-model basis? We dont have to do this for any other quantization integrations.
Can ModelOpt be improved to work generically? Having a patchwork of support is a suboptimal user experience

@robertgshaw2-redhat This PR can be closed — ModelOpt Qwen NVFP4/FP8 support is already available after the following two PRs were merged: #20101, #19815.
In general, ModelOpt works out of the box for a wide range of popular models, meaning we can quantize and export them without model-specific tuning. You can explore our APIs here: ModelOpt LLM PTQ Examples.
For vLLM deployment, only minor adjustments to scale naming are sometimes needed—similar to the changes in this PR (though these cases have already been handled in other PRs). This is because vLLM processes load_weights within each model class and applies name mapping there. I’ve also noticed some quantization-support PRs modify model files or handle model-specific configs, for example: #11148, #20046, #16803.
I think there’s a good opportunity for us to collaborate on improving quantization support in vLLM.
cc @mgoin

in #11148 (this was one of many PRs that added the concept of remapping kv cache scales to all models), #20046, and #16803, generic functionality was introduced into vllm to support a new feature, this is very different from making one-off changes to models

#11148 modified the specific model file vllm/model_executor/models/starcoder2.py just like this PR.
#20046 is a bugfix PR which includes a fix for Qwen2_5_VL.
#16803 is for Mistral-format support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants