[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated #22355

fxmarty-amd · 2025-08-06T10:18:08Z

As per title. The current logic relies on current_platform.supports_mx() to detect whether QDQ should be applied on activations, or simply quantization to lower precision passed to a kernel taking that as input.

As mxfp4 kernels are not yet integrated with quark models in vllm, and although CDNA4 supports native fp4 matrix core execution, let's simulate it for now as well.

Proper warnings are already shown to users:

vllm/vllm/model_executor/layers/quantization/quark/quark_moe.py

Lines 284 to 291 in 9edd1db

    
           else: 
        
               self.emulate = True 
        
               logger.warning_once( 
        
                   "The current platform supports native MXFP4 " 
        
                   "computation, but kernels are not yet integrated in vLLM. " 
        
                   "Simulated weight dequantization and activation " 
        
                   "QDQ (quantize and dequantize) will be used, with the linear " 
        
                   "layers computed in high precision.")

vllm/vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Lines 45 to 52 in 9edd1db

    
           else: 
        
               self.emulate = True 
        
               logger.warning_once( 
        
                   "The current platform supports native MXFP4 " 
        
                   "computation, but kernels are not yet integrated in vLLM. " 
        
                   "Simulated weight dequantization and activation " 
        
                   "QDQ (quantize and dequantize) will be used, with the linear " 
        
                   "layers computed in high precision.")

This is ported from #21166 for faster merge.

Signed-off-by: Felix Marty <[email protected]>

github-actions · 2025-08-06T10:18:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request modifies the mxfp4 quantization logic to simulate its execution on platforms that natively support it, such as CDNA4, until the kernels are fully integrated. This is a temporary measure as indicated by the TODO. The change correctly enables the simulation path for all platforms. However, this introduces code duplication, as both branches of the conditional now execute the same logic. I've suggested a simplification to remove this redundancy.

gemini-code-assist · 2025-08-06T10:19:03Z

vllm/model_executor/layers/fused_moe/utils.py

    if not current_platform.supports_mx():
        A = quant_dequant_mxfp4(A)
    else:
-        raise NotImplementedError()
+        # TODO: native mxfp4 is currently not integrated in vllm,
+        # so simulating even on devices supporting this data type natively.
+        A = quant_dequant_mxfp4(A)


Since both branches of the if/else statement now execute the same code (A = quant_dequant_mxfp4(A)), the conditional is redundant. You can simplify the code by removing the if/else block and keeping only the necessary logic and the explanatory comment.

# TODO: native mxfp4 is currently not integrated in vllm, # so simulating even on devices supporting this data type natively. A = quant_dequant_mxfp4(A)

We're somewhat splitting hairs here, but I agree with this Gemini bot. I'm in favor of just deleting the if/else for now and adding it back when the two codepaths diverge again. Let's leave the todo, though

@SageMoore fixed in 13e8bfd, thank you!

…project#22355 Signed-off-by: Felix Marty <[email protected]>

Signed-off-by: Felix Marty <[email protected]>

SageMoore

The PR itself looks reasonable to me, but it would be nice to get some of these mxfp4 quark tests, which rely on having amd-quark installed, enabled in CI. Since importing quark.torch.kernel compiles kernels on first call we woulds still want to lazily import it in the custom op, but I don't immediately see any issues with adding amd-quark to requirements/rocm.txt and the AMD dockerfile . I am worried about it using different versions of common dependencies like torch, numpy, and compressed_tensors, though. Let's save it for another PR, but something to think about.

Can you do an lm_eval run of a mxfp4 model on a machine that would run through the simulator logic and post the results here.

fxmarty-amd · 2025-08-13T14:36:33Z

it would be nice to get some of these mxfp4 quark tests, which rely on having amd-quark installed, enabled in CI

Good point, I'll have a look cc @BowenBao

Can you do an lm_eval run of a mxfp4 model on a machine that would run through the simulator logic and post the results here.

Will do as well!

Signed-off-by: Felix Marty <[email protected]>

fxmarty-amd · 2025-08-13T15:28:25Z

@SageMoore pytest tests/quantization/test_quark.py -s -vvvvv -k "test_mxfp4_gsm8k_correctness" is passing fine on this branch, running on an mi355 machine.

Signed-off-by: Felix Marty <[email protected]>

fxmarty-amd · 2025-09-04T13:46:43Z

@SageMoore do you need anything else from me to get this merged? Happy to update if needed!

fxmarty-amd · 2025-09-15T10:10:17Z

cc @mgoin

Signed-off-by: Felix Marty <[email protected]>

fxmarty-amd · 2025-09-24T10:22:09Z

Failing CI seems unrelated

fxmarty-amd · 2025-10-02T17:09:51Z

cc @mgoin

mergify · 2025-10-08T05:02:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @fxmarty-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fxmarty-amd · 2025-10-08T08:43:28Z

Closing as fixed as part of #21166, see https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/utils.py

simulate on cdna4 as well for now

19d4277

Signed-off-by: Felix Marty <[email protected]>

gemini-code-assist bot reviewed Aug 6, 2025

View reviewed changes

fxmarty-amd added a commit to fxmarty-amd/vllm that referenced this pull request Aug 6, 2025

undo current_platform.supports_mx() change, moved to standalone vllm-…

bad17cc

…project#22355 Signed-off-by: Felix Marty <[email protected]>

fxmarty-amd mentioned this pull request Aug 6, 2025

[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 #21166

Merged

3 tasks

fxmarty-amd added 2 commits August 12, 2025 08:45

Merge branch 'main' into quark-mi350-mxfp4-simulate

5098e0f

address comment - remove controlflow

13e8bfd

Signed-off-by: Felix Marty <[email protected]>

SageMoore reviewed Aug 12, 2025

View reviewed changes

use public dsr1 mxfp4 model in the tests

637f8db

Signed-off-by: Felix Marty <[email protected]>

fxmarty-amd requested review from mgoin, robertgshaw2-redhat and yewentao256 as code owners August 13, 2025 15:27

linting

225731b

Signed-off-by: Felix Marty <[email protected]>

SageMoore approved these changes Aug 19, 2025

View reviewed changes

Merge branch 'main' into quark-mi350-mxfp4-simulate

b0856f0

fxmarty-amd mentioned this pull request Sep 18, 2025

[ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model #24239

Merged

5 tasks

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025

gshtras approved these changes Sep 23, 2025

View reviewed changes

fix test model path

b303569

Signed-off-by: Felix Marty <[email protected]>

SageMoore approved these changes Sep 25, 2025

View reviewed changes

Merge branch 'main' into quark-mi350-mxfp4-simulate

00c457b

mergify bot added the needs-rebase label Oct 8, 2025

fxmarty-amd closed this Oct 8, 2025

	else:
	self.emulate = True
	logger.warning_once(
	"The current platform supports native MXFP4 "
	"computation, but kernels are not yet integrated in vLLM. "
	"Simulated weight dequantization and activation "
	"QDQ (quantize and dequantize) will be used, with the linear "
	"layers computed in high precision.")

Uh oh!

[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated #22355

[Bugfix] Simulate mxfp4 quark model execution on cdna4 until kernels are integrated #22355

Uh oh!

Conversation

fxmarty-amd commented Aug 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd commented Aug 13, 2025

Uh oh!

fxmarty-amd commented Aug 13, 2025

Uh oh!

fxmarty-amd commented Sep 4, 2025

Uh oh!

fxmarty-amd commented Sep 15, 2025

Uh oh!

fxmarty-amd commented Sep 24, 2025

Uh oh!

fxmarty-amd commented Oct 2, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

fxmarty-amd commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fxmarty-amd commented Aug 6, 2025 •

edited by github-actions bot

Loading