Skip to content

Conversation

tlrmchlsmth
Copy link
Member

@tlrmchlsmth tlrmchlsmth commented Sep 24, 2025

Revert #24666 due to #25623, reapplied in #25696.

This reverts commit 6340025.

That commit is causing an illegal memory access when torch.compile is used with decode DBO:

^[[1;36m(APIServer pid=1)^[[0;0m ^[[1;36m(EngineCore_DP0 pid=276)^[[0;0m ERROR 09-24 13:00:49 [core.py:708] RuntimeError: Failed: CUDA error /tmp/deepep/csrc/kernels/internode_ll.cu:391 'an illegal memory access was encountered'

Repro instructions:

I'm deploying vLLM using the llm-d WideEP well-lit-path

See the decoder manifest here:
https://github.com/llm-d/llm-d/blob/4970c7c2703dc23605719491c4fb380973b13517/guides/wide-ep-lws/manifests/modelserver/base/decode.yaml

In particular this is the vLLM launch command.

              exec vllm serve \
                deepseek-ai/DeepSeek-R1-0528 \
                --port 8200 \
                --disable-uvicorn-access-log \
                --trust-remote-code \
                --enable-expert-parallel \
                --data-parallel-hybrid-lb \
                --tensor-parallel-size $TP_SIZE \
                --data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
                --data-parallel-size-local $DP_SIZE_LOCAL \
                --data-parallel-address ${LWS_LEADER_ADDRESS} \
                --data-parallel-rpc-port 5555 \
                --data-parallel-start-rank $START_RANK \
                --enable-eplb \
                --eplb-config '{"window_size":"1000",
                                "step_interval":"3000",
                                "num_redundant_experts":"32",
                                "log_balancedness":"False"}' \
                --enable-dbo \
                --dbo-decode-token-threshold 32 \
                --kv_transfer_config '{"kv_connector":"NixlConnector",
                                        "kv_role":"kv_both"}'

From investigations of @LucasWilkinson:

the weird part is it is failing in triton_poi_fused.to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2
and DBO isnt even running
the fishy thing is torch compile appears to be rounding the input up to 4?

        triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2_xnumel = 7168*s72 + 7168*(((-1)*s72) % 4)
        stream0 = get_raw_stream(0)
        triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2.run(buf17, buf13, buf12, arg4_1, buf14, arg7_1, s72, triton_poi_fused__to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2_xnumel, stream=stream0)

possibly related to

if self.is_hopper:
# We pad unconditionally (even if shape is already divisible by 4)
# to support dynamic shape for input_2d.shape[0] in torch.compile
x = torch.nn.functional.pad(input_2d,
(0, 0, 0, -input_2d.shape[0] % 4))

This is failing on a store:

Address Instruction
0x0000002ad1ea96c0 <+3520> PRMT R15, R8, 0x5410, R9
0x0000002ad1ea96d0 <+3536> @p2 EXIT
*> 0x0000002ad1ea96e0 <+3552> STG.E.128 desc[UR6][R26.64], R12
=> 0x0000002ad1ea96f0 <+3568> EXIT

gemini-code-assist[bot]

This comment was marked as resolved.

@LucasWilkinson
Copy link
Collaborator

For posterity; this fails when capturing the first cudagraph with a token count that is not a multiple of 4 (in this case 2)

This commit made it so we started using the cutlass kernel instead of triton hence the introduction of padding

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) September 24, 2025 21:12
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025
@ProExpertProg
Copy link
Collaborator

ProExpertProg commented Sep 24, 2025

@tlrmchlsmth I assume this is on Hopper, can you post repro instructions?

This commit made it so we started using the cutlass kernel instead of triton hence the introduction of padding

@LucasWilkinson are you saying we were dynamically dispatching to Triton based on num_tokens? AFAIU from the logic:

def apply_w8a8_block_fp8_linear(
    input: torch.Tensor,
    weight: torch.Tensor,
    block_size: list[int],
    weight_scale: torch.Tensor,
    input_scale: Optional[torch.Tensor] = None,
    bias: Optional[torch.Tensor] = None,
    cutlass_block_fp8_supported: bool = CUTLASS_BLOCK_FP8_SUPPORTED,
    use_aiter_and_is_supported: bool = False,
) -> torch.Tensor:
    w8a8_blockscale_func = dispatch_w8a8_blockscale_func(
        cutlass_block_fp8_supported, use_aiter_and_is_supported)
    if cutlass_block_fp8_supported:
        num_pad = 0
        if current_platform.is_device_capability(90):
            # pad first dimension to be divisible by 4 due to
            # cutlass blockwise gemm limitation for hopper
            num_pad = 4 - (input_2d.shape[0] % 4)

    quant()
    w8a8_blockscale_func() # is cutlass here

Copy link

mergify bot commented Sep 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 24, 2025
Signed-off-by: Tyler Michael Smith <[email protected]>
@mergify mergify bot removed the needs-rebase label Sep 24, 2025
@ProExpertProg
Copy link
Collaborator

ProExpertProg commented Sep 25, 2025

EDIT: unrelated issue below

I extracted the issue into #25623 - I got a IMA without using the fp8 block quant path and confirmed it still happens even with this revert:

vllm serve deepseek-ai/DeepSeek-V2-Lite --disable-uvicorn-access-log --trust-remote-code --enable-dbo --dbo-decode-token-threshold 32 --tensor-parallel 2

@tlrmchlsmth tlrmchlsmth merged commit 1260180 into vllm-project:main Sep 25, 2025
49 checks passed
Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Sep 26, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants