Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607

tlrmchlsmth · 2025-09-24T20:57:10Z

Revert #24666 due to #25623, reapplied in #25696.

This reverts commit 6340025.

That commit is causing an illegal memory access when torch.compile is used with decode DBO:

^[[1;36m(APIServer pid=1)^[[0;0m ^[[1;36m(EngineCore_DP0 pid=276)^[[0;0m ERROR 09-24 13:00:49 [core.py:708] RuntimeError: Failed: CUDA error /tmp/deepep/csrc/kernels/internode_ll.cu:391 'an illegal memory access was encountered'

Repro instructions:

I'm deploying vLLM using the llm-d WideEP well-lit-path

See the decoder manifest here:
https://github.com/llm-d/llm-d/blob/4970c7c2703dc23605719491c4fb380973b13517/guides/wide-ep-lws/manifests/modelserver/base/decode.yaml

In particular this is the vLLM launch command.

              exec vllm serve \
                deepseek-ai/DeepSeek-R1-0528 \
                --port 8200 \
                --disable-uvicorn-access-log \
                --trust-remote-code \
                --enable-expert-parallel \
                --data-parallel-hybrid-lb \
                --tensor-parallel-size $TP_SIZE \
                --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
                --data-parallel-size-local $DP_SIZE_LOCAL \
                --data-parallel-address ${LWS_LEADER_ADDRESS} \
                --data-parallel-rpc-port 5555 \
                --data-parallel-start-rank $START_RANK \
                --enable-eplb \
                --eplb-config '{"window_size":"1000",
                                "step_interval":"3000",
                                "num_redundant_experts":"32",
                                "log_balancedness":"False"}' \
                --enable-dbo \
                --dbo-decode-token-threshold 32 \
                --kv_transfer_config '{"kv_connector":"NixlConnector",
                                        "kv_role":"kv_both"}'

From investigations of @LucasWilkinson:

the weird part is it is failing in triton_poi_fused.to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2
and DBO isnt even running
the fishy thing is torch compile appears to be rounding the input up to 4?

        triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2_xnumel = 7168*s72 + 7168*(((-1)*s72) % 4)
        stream0 = get_raw_stream(0)
        triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2.run(buf17, buf13, buf12, arg4_1, buf14, arg7_1, s72, triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2_xnumel, stream=stream0)

possibly related to

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Lines 229 to 233 in e6750d0

    
           if self.is_hopper: 
        
               # We pad unconditionally (even if shape is already divisible by 4) 
        
               # to support dynamic shape for input_2d.shape[0] in torch.compile 
        
               x = torch.nn.functional.pad(input_2d, 
        
                                           (0, 0, 0, -input_2d.shape[0] % 4))

This is failing on a store:

Address	Instruction
0x0000002ad1ea96c0 <+3520>	PRMT R15, R8, 0x5410, R9
0x0000002ad1ea96d0 <+3536>	@p2 EXIT
*> 0x0000002ad1ea96e0 <+3552>	STG.E.128 desc[UR6][R26.64], R12
=> 0x0000002ad1ea96f0 <+3568>	EXIT

…vllm-project#24666)" This reverts commit 6340025. Signed-off-by: Tyler Michael Smith <[email protected]>

LucasWilkinson · 2025-09-24T21:04:58Z

For posterity; this fails when capturing the first cudagraph with a token count that is not a multiple of 4 (in this case 2)

This commit made it so we started using the cutlass kernel instead of triton hence the introduction of padding

ProExpertProg · 2025-09-24T21:18:08Z

@tlrmchlsmth I assume this is on Hopper, can you post repro instructions?

This commit made it so we started using the cutlass kernel instead of triton hence the introduction of padding

@LucasWilkinson are you saying we were dynamically dispatching to Triton based on num_tokens? AFAIU from the logic:

def apply_w8a8_block_fp8_linear(
    input: torch.Tensor,
    weight: torch.Tensor,
    block_size: list[int],
    weight_scale: torch.Tensor,
    input_scale: Optional[torch.Tensor] = None,
    bias: Optional[torch.Tensor] = None,
    cutlass_block_fp8_supported: bool = CUTLASS_BLOCK_FP8_SUPPORTED,
    use_aiter_and_is_supported: bool = False,
) -> torch.Tensor:
    w8a8_blockscale_func = dispatch_w8a8_blockscale_func(
        cutlass_block_fp8_supported, use_aiter_and_is_supported)
    if cutlass_block_fp8_supported:
        num_pad = 0
        if current_platform.is_device_capability(90):
            # pad first dimension to be divisible by 4 due to
            # cutlass blockwise gemm limitation for hopper
            num_pad = 4 - (input_2d.shape[0] % 4)

    quant()
    w8a8_blockscale_func() # is cutlass here

mergify · 2025-09-24T22:56:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tyler Michael Smith <[email protected]>

ProExpertProg · 2025-09-25T03:45:59Z

EDIT: unrelated issue below

I extracted the issue into #25623 - I got a IMA without using the fp8 block quant path and confirmed it still happens even with this revert:

vllm serve deepseek-ai/DeepSeek-V2-Lite --disable-uvicorn-access-log --trust-remote-code --enable-dbo --dbo-decode-token-threshold 32 --tensor-parallel 2

vllm-project#25607) Signed-off-by: Tyler Michael Smith <[email protected]>

#25607) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class (…

7c6404d

…vllm-project#24666)" This reverts commit 6340025. Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth requested review from mgoin, robertgshaw2-redhat, yewentao256, WoosukKwon, simon-mo, youkaichao, houseroad, hmellor and ProExpertProg as code owners September 24, 2025 20:57

tlrmchlsmth added this to the v0.11.0 milestone Sep 24, 2025

tlrmchlsmth requested a review from LucasWilkinson September 24, 2025 20:57

mergify bot added the performance Performance-related issues label Sep 24, 2025

This comment was marked as resolved.

Sign in to view

LucasWilkinson approved these changes Sep 24, 2025

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) September 24, 2025 21:12

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025

mergify bot added the needs-rebase label Sep 24, 2025

Merge branch 'main' into test_revert

4da7521

Signed-off-by: Tyler Michael Smith <[email protected]>

mergify bot removed the needs-rebase label Sep 24, 2025

DarkLight1337 mentioned this pull request Sep 25, 2025

[BugFix] Fix DBO hang #25625

Merged

tlrmchlsmth merged commit 1260180 into vllm-project:main Sep 25, 2025
49 checks passed

dsikka mentioned this pull request Sep 25, 2025

Add block quantization e2e test vllm-project/llm-compressor#1867

Draft

ElizaWszola mentioned this pull request Sep 25, 2025

[Perf] Fix and reapply move apply w8a8 block fp8 linear to class #25696

Merged

Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Sep 26, 2025

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… (

a5d3e35

vllm-project#25607) Signed-off-by: Tyler Michael Smith <[email protected]>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… (

6c6e553

#25607) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: yewentao256 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607

Uh oh!

tlrmchlsmth commented Sep 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

LucasWilkinson commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 24, 2025 •

edited

Loading

Uh oh!

mergify bot commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	if self.is_hopper:
	# We pad unconditionally (even if shape is already divisible by 4)
	# to support dynamic shape for input_2d.shape[0] in torch.compile
	x = torch.nn.functional.pad(input_2d,
	(0, 0, 0, -input_2d.shape[0] % 4))

Uh oh!

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607

Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607

Uh oh!

Conversation

tlrmchlsmth commented Sep 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro instructions:

From investigations of @LucasWilkinson:

Uh oh!

This comment was marked as resolved.

Uh oh!

LucasWilkinson commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth commented Sep 24, 2025 •

edited by github-actions bot

Loading

ProExpertProg commented Sep 24, 2025 •

edited

Loading

ProExpertProg commented Sep 25, 2025 •

edited

Loading