-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated #22355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated #22355
Conversation
Signed-off-by: Felix Marty <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request modifies the mxfp4 quantization logic to simulate its execution on platforms that natively support it, such as CDNA4, until the kernels are fully integrated. This is a temporary measure as indicated by the TODO. The change correctly enables the simulation path for all platforms. However, this introduces code duplication, as both branches of the conditional now execute the same logic. I've suggested a simplification to remove this redundancy.
| if not current_platform.supports_mx(): | ||
| A = quant_dequant_mxfp4(A) | ||
| else: | ||
| raise NotImplementedError() | ||
| # TODO: native mxfp4 is currently not integrated in vllm, | ||
| # so simulating even on devices supporting this data type natively. | ||
| A = quant_dequant_mxfp4(A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both branches of the if/else statement now execute the same code (A = quant_dequant_mxfp4(A)), the conditional is redundant. You can simplify the code by removing the if/else block and keeping only the necessary logic and the explanatory comment.
# TODO: native mxfp4 is currently not integrated in vllm,
# so simulating even on devices supporting this data type natively.
A = quant_dequant_mxfp4(A)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're somewhat splitting hairs here, but I agree with this Gemini bot. I'm in favor of just deleting the if/else for now and adding it back when the two codepaths diverge again. Let's leave the todo, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore fixed in 13e8bfd, thank you!
…project#22355 Signed-off-by: Felix Marty <[email protected]>
Signed-off-by: Felix Marty <[email protected]>
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself looks reasonable to me, but it would be nice to get some of these mxfp4 quark tests, which rely on having amd-quark installed, enabled in CI. Since importing quark.torch.kernel compiles kernels on first call we woulds still want to lazily import it in the custom op, but I don't immediately see any issues with adding amd-quark to requirements/rocm.txt and the AMD dockerfile . I am worried about it using different versions of common dependencies like torch, numpy, and compressed_tensors, though. Let's save it for another PR, but something to think about.
Can you do an lm_eval run of a mxfp4 model on a machine that would run through the simulator logic and post the results here.
Good point, I'll have a look cc @BowenBao
Will do as well! |
Signed-off-by: Felix Marty <[email protected]>
|
@SageMoore |
Signed-off-by: Felix Marty <[email protected]>
|
@SageMoore do you need anything else from me to get this merged? Happy to update if needed! |
|
cc @mgoin |
Signed-off-by: Felix Marty <[email protected]>
|
Failing CI seems unrelated |
|
cc @mgoin |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing as fixed as part of #21166, see https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/utils.py |
As per title. The current logic relies on
current_platform.supports_mx()to detect whether QDQ should be applied on activations, or simply quantization to lower precision passed to a kernel taking that as input.As mxfp4 kernels are not yet integrated with quark models in vllm, and although CDNA4 supports native fp4 matrix core execution, let's simulate it for now as well.
Proper warnings are already shown to users:
vllm/vllm/model_executor/layers/quantization/quark/quark_moe.py
Lines 284 to 291 in 9edd1db
vllm/vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py
Lines 45 to 52 in 9edd1db
This is ported from #21166 for faster merge.