Skip to content

[Bug]: Torch SDPA path broken on AArch64 due to default chunked_prefill in vLLM Engine V1 #20622

@nikhil-arm

Description

@nikhil-arm

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.1 LTS (aarch64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : 18.1.3 (1ubuntu1)
CMake version                : version 3.31.4
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.8.0-1029-aws-aarch64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         aarch64
CPU op-mode(s):                       64-bit
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            ARM
Model name:                           Neoverse-V2
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             r0p1
BogoMIPS:                             2000.00
Flags:                                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                            12 MiB (192 instances)
L1i cache:                            12 MiB (192 instances)
L2 cache:                             384 MiB (192 instances)
L3 cache:                             72 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95
NUMA node1 CPU(s):                    96-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; __user pointer sanitization
Vulnerability Spectre v2:             Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.5
[pip3] pynvml==11.5.3
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+cpu
[pip3] torchaudio==2.7.0
[pip3] torchdata==0.7.1
[pip3] torchtune==0.5.0.dev20241218+cpu
[pip3] torchvision==0.22.0
[pip3] transformers==4.51.3
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.2rc2.dev51+gbb043af7d (git sha: bb043af7d)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

🐛 Describe the bug

Description:

After rebasing PR #17112 on the mainline, We have encountered a new issue stemming from recent changes in the default engine behavior in vLLM.

It appears that the vLLM Engine V1 is now used by default for the LLM.generate() API, and this engine enables chunked_prefill by default.

Due to this updated behavior, the Torch SDPA path is currently broken on the AArch64 backend.

Issue Details:
• The condition in the following line evaluates to False on AArch64:
🔗 torch_sdpa.py#L391
• As a result, the fallback path attempts to use Intel IPEX without checking the _use_ipex flag.
• There is currently no valid chunked_prefill attention path enabled for AArch64.

Impact:
• Any LLM.generate() invocation under the V1 engine on AArch64 fails due to the lack of a valid attention backend.

How to Reproduce:
1. Apply PR #17112:
2. Convert the model using the script shared in the comment here:
#17112 (comment)
3. Run the model using the LLM.generate() API on an AArch64 machine:
• Ensure the model is run in float32 (torch.float32) datatype.
• Confirm that vLLM Engine V1 is active (default behavior as of recent commits).

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions