-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Add Nvidia ModelOpt config adaptation #19815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nvidia ModelOpt config adaptation #19815
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Edwardf0t1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request focuses on enhancing the clarity and flexibility of quantization configuration parsing within vLLM. My primary goal was to standardize the naming convention for quantization libraries while ensuring backward compatibility, and to integrate robust support for Nvidia TensorRT Model Optimizer (ModelOpt) quantized models. This allows vLLM to seamlessly load and utilize models quantized with ModelOpt's FP8 and NVFP4 schemes, adapting to different configuration formats.
Highlights
- Quantization Config Parsing Improvement: I've updated the quantization configuration parsing logic to prioritize the
quant_libraryfield for improved semantic clarity. This change also ensures backward compatibility by falling back to thequant_methodfield ifquant_libraryis not present. - Nvidia ModelOpt Integration: I've added comprehensive support for Nvidia TensorRT Model Optimizer (ModelOpt) quantization, specifically for both FP8 and NVFP4 formats. This includes recognizing these as valid quantization methods within vLLM.
- Automatic ModelOpt Detection and Format Adaptation: I've implemented
override_quantization_methodfor ModelOpt FP8 and NVFP4 configurations. This allows vLLM to automatically detect and apply the correct ModelOpt quantization method based on thehf_quant_config.jsonfile, supporting both traditional ModelOpt nested config structures and flattenedcompressed-tensorsstyle formats.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request improves the semantic clarity of quantization configurations and adds adaptation for Nvidia ModelOpt configurations, including auto-detection. The changes involve updating config parsing to prioritize quant_library over quant_method while maintaining backward compatibility, and adding modelopt config adaptation. The code has been reviewed and suggestions have been provided to improve robustness and clarity.
785f0b0 to
aca87c2
Compare
5257ade to
447d23b
Compare
|
Hi @mgoin @robertgshaw2-redhat , it was very nice meeting you and team regarding collaborations between NV Modelopt and llm-compressor yesterday. Could you help review this PR as discussed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation doesn't seem like it is utilizing the structure of the CT format and instead includes duplicate/fixed information through the "quant_algo" and "kv_cache_scheme" entries.
What I mean is, I would expect your Llama FP8 config to be more like this:
"quantization_config": {
"config_groups": {
"group_0": {
"input_activations": {
"dynamic": false,
"strategy": "tensor",
"num_bits": 8,
"type": "float"
},
"weights": {
"dynamic": false,
"strategy": "tensor",
"num_bits": 8,
"type": "float"
}
"targets": ["Linear"],
}
},
"ignore": [
"lm_head"
],
"kv_cache_scheme": {
"dynamic": false,
"strategy": "tensor",
"num_bits": 8,
"type": "float"
},
"quant_method": "compressed-tensors",
"producer": {
"name": "modelopt",
"version": "0.33.0"
}
}And then the vLLM modelopt backend to have matching checks for that "FP8" scheme based on the sub-configs. This is like the _is_fp4a4_nvfp4 style functions we have in CT
vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Lines 239 to 257 in c18b3b8
| def _is_fp4a4_nvfp4(self, weight_quant: BaseModel, input_quant: BaseModel): | |
| if weight_quant is None or input_quant is None: | |
| return False | |
| is_tensor_group_quant = (weight_quant.strategy | |
| == QuantizationStrategy.TENSOR_GROUP.value | |
| and input_quant.strategy | |
| == QuantizationStrategy.TENSOR_GROUP.value) | |
| is_symmetric = weight_quant.symmetric and input_quant.symmetric | |
| is_group_size_16 = (weight_quant.group_size == 16 | |
| and input_quant.group_size == 16) | |
| is_float_type = (weight_quant.type == QuantizationType.FLOAT | |
| and input_quant.type == QuantizationType.FLOAT.value) | |
| is_4_bits = weight_quant.num_bits == 4 and input_quant.num_bits == 4 | |
| return (is_tensor_group_quant and is_float_type and is_4_bits | |
| and is_group_size_16 and is_symmetric) |
It would also be good to add a small config unit test to make sure vLLM parses the expected format and dispatched to the quant method correctly, similar to the tests here
| def test_compressed_tensors_w8a8_static_setup(vllm_runner, model_args): |
Thank you for the feedback! @mgoin
|
For a list of arguments that are defined in each ct structure, you can refer to the Pydantic model here: |
Thanks for the pointer. @dsikka I was trying to find what arguments need to be explicitly defined in model's quant config. Looks like |
447d23b to
eefe692
Compare
0d30361 to
79c5e76
Compare
tests/quantization/test_modelopt.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to skip this test for now until you have a public checkpoint? I think this will break the quantization test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
tests/quantization/test_modelopt.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this require V0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I was aligning with the test here which requires V0. https://github.com/vllm-project/vllm/blob/main/tests/quantization/test_compressed_tensors.py#L44
Do you know which type of module test requires v0?
1477e32 to
0a3c6ee
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0a3c6ee to
47e8419
Compare
… detection Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
47e8419 to
6dcc860
Compare
Signed-off-by: Zhiyu Cheng <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.This PR is to add Nvidia TensorRT Model Optimizer (
modelopt) config adaptation with auto detection. This is part of the efforts to unifymodelopt's config format andcompressed-tensor's config format.Update config parsing to look forquant_libraryinstead ofquant_method.Maintain backward compatibility by checking both field names.modeloptconfig adaptation to handle it as a quant method option.test_modelopt.pyIt's essentially format standardization while preserving library-specific functionality.
Test Plan
Llama-3.1-8B-Instruct-FP8is produced bymodeloptwith per-tensor FP8 weights and activations. It can be generated by running the following command under this directory.python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.1-8B-Instruct --qformat fp8 --export_fmt hf --export_path Llama-3.1-8B-Instruct-FP8 --trust_remote_codeThe quantization config in the exported
config.jsonwould look like the following:Test Result
(Optional) Documentation Update