Skip to content

Conversation

@fxmarty-amd
Copy link
Contributor

@fxmarty-amd fxmarty-amd commented Aug 6, 2025

As per title. The current logic relies on current_platform.supports_mx() to detect whether QDQ should be applied on activations, or simply quantization to lower precision passed to a kernel taking that as input.

As mxfp4 kernels are not yet integrated with quark models in vllm, and although CDNA4 supports native fp4 matrix core execution, let's simulate it for now as well.

Proper warnings are already shown to users:

else:
self.emulate = True
logger.warning_once(
"The current platform supports native MXFP4 "
"computation, but kernels are not yet integrated in vLLM. "
"Simulated weight dequantization and activation "
"QDQ (quantize and dequantize) will be used, with the linear "
"layers computed in high precision.")

else:
self.emulate = True
logger.warning_once(
"The current platform supports native MXFP4 "
"computation, but kernels are not yet integrated in vLLM. "
"Simulated weight dequantization and activation "
"QDQ (quantize and dequantize) will be used, with the linear "
"layers computed in high precision.")

This is ported from #21166 for faster merge.

@github-actions
Copy link

github-actions bot commented Aug 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the mxfp4 quantization logic to simulate its execution on platforms that natively support it, such as CDNA4, until the kernels are fully integrated. This is a temporary measure as indicated by the TODO. The change correctly enables the simulation path for all platforms. However, this introduces code duplication, as both branches of the conditional now execute the same logic. I've suggested a simplification to remove this redundancy.

Comment on lines 172 to 177
if not current_platform.supports_mx():
A = quant_dequant_mxfp4(A)
else:
raise NotImplementedError()
# TODO: native mxfp4 is currently not integrated in vllm,
# so simulating even on devices supporting this data type natively.
A = quant_dequant_mxfp4(A)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since both branches of the if/else statement now execute the same code (A = quant_dequant_mxfp4(A)), the conditional is redundant. You can simplify the code by removing the if/else block and keeping only the necessary logic and the explanatory comment.

    # TODO: native mxfp4 is currently not integrated in vllm,
    # so simulating even on devices supporting this data type natively.
    A = quant_dequant_mxfp4(A)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're somewhat splitting hairs here, but I agree with this Gemini bot. I'm in favor of just deleting the if/else for now and adding it back when the two codepaths diverge again. Let's leave the todo, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SageMoore fixed in 13e8bfd, thank you!

fxmarty-amd added a commit to fxmarty-amd/vllm that referenced this pull request Aug 6, 2025
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself looks reasonable to me, but it would be nice to get some of these mxfp4 quark tests, which rely on having amd-quark installed, enabled in CI. Since importing quark.torch.kernel compiles kernels on first call we woulds still want to lazily import it in the custom op, but I don't immediately see any issues with adding amd-quark to requirements/rocm.txt and the AMD dockerfile . I am worried about it using different versions of common dependencies like torch, numpy, and compressed_tensors, though. Let's save it for another PR, but something to think about.

Can you do an lm_eval run of a mxfp4 model on a machine that would run through the simulator logic and post the results here.

@fxmarty-amd
Copy link
Contributor Author

it would be nice to get some of these mxfp4 quark tests, which rely on having amd-quark installed, enabled in CI

Good point, I'll have a look cc @BowenBao

Can you do an lm_eval run of a mxfp4 model on a machine that would run through the simulator logic and post the results here.

Will do as well!

@fxmarty-amd
Copy link
Contributor Author

@SageMoore pytest tests/quantization/test_quark.py -s -vvvvv -k "test_mxfp4_gsm8k_correctness" is passing fine on this branch, running on an mi355 machine.

Signed-off-by: Felix Marty <[email protected]>
@fxmarty-amd
Copy link
Contributor Author

@SageMoore do you need anything else from me to get this merged? Happy to update if needed!

@fxmarty-amd
Copy link
Contributor Author

cc @mgoin

@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
Signed-off-by: Felix Marty <[email protected]>
@fxmarty-amd
Copy link
Contributor Author

Failing CI seems unrelated

@fxmarty-amd
Copy link
Contributor Author

cc @mgoin

@mergify
Copy link

mergify bot commented Oct 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 8, 2025
@fxmarty-amd
Copy link
Contributor Author

@fxmarty-amd fxmarty-amd closed this Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants