[Bug]: CUDA illegal memory access in flash attention only for specific values of --max-num-seqs (with AWQ model )

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.7 (main, Oct  1 2024, 08:52:12) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.6.56+-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.216.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4400.37
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            192 KiB (6 instances)
L1i cache:                            192 KiB (6 instances)
L2 cache:                             6 MiB (6 instances)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-11
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] flashinfer==0.1.6+cu121torch2.4
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-11	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.1
VLLM_ENGINE_ITERATION_TIMEOUT_S=120
LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
CUDA_MODULE_LOADING=LAZY
```

</details>


### Model Input Dumps

Failed to dump the input in pickle:
``` bash
INFO 12-19 07:19:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241219-071924.pkl...
WARNING 12-19 07:19:24 model_runner_base.py:143] Failed to pickle inputs of failed execution
```

### 🐛 Describe the bug

Using `vllm/vllm-openai:v0.6.3` docker image, with entrypoint `openai.run_batch`, I encounter errors of CUDA illegal memory access that seemed related to flash attention kernels only with some specific values of `--max-num-seqs`.
**Some power of 2 for `--max-num-seqs`, such as `256`, cause this failure, but not `255`. I monitored the sequence count running in parallel and it effectively use the max value as inputs are small.**

The model used is an AWQ model: `casperhansen/llama-3-70b-instruct-awq`.


## Steps to reproduce
```
python3 -m vllm.entrypoints.openai.run_batch \
        -i <input jsonl> -o <output jsonl> \
        --max-model-len 8192 \
        --max-num-batched-tokens 8192 \
        --max-num-seqs 256 \
        --tensor-parallel-size 1 \
        --model casperhansen/llama-3-70b-instruct-awq \
        --enforce-eager \
        --gpu-memory-utilization 0.9
```
You can try with the provided `test_inputs.jsonl`: [test_input.jsonl.zip](https://github.com/user-attachments/files/18200380/test_input.jsonl.zip)

## Traceback
```
WARNING 12-19 07:19:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 12-19 07:19:24 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
WARNING 12-19 07:19:24 model_runner_base.py:143]
ERROR 12-19 07:19:24 async_llm_engine.py:66] Engine background task failed
ERROR 12-19 07:19:24 async_llm_engine.py:66] Traceback (most recent call last):
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return func(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1665, in execute_model
ERROR 12-19 07:19:24 async_llm_engine.py:66]     hidden_or_intermediate_states = model_executable(
ERROR 12-19 07:19:24 async_llm_engine.py:66]                                     ^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._call_impl(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return forward_call(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 556, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     model_output = self.model(input_ids, positions, kv_caches,
ERROR 12-19 07:19:24 async_llm_engine.py:66]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._call_impl(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return forward_call(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 345, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     hidden_states, residual = layer(positions, hidden_states,
ERROR 12-19 07:19:24 async_llm_engine.py:66]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._call_impl(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return forward_call(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 257, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     hidden_states = self.self_attn(positions=positions,
ERROR 12-19 07:19:24 async_llm_engine.py:66]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._call_impl(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return forward_call(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 187, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 12-19 07:19:24 async_llm_engine.py:66]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._call_impl(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return forward_call(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 100, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self.impl.forward(query,
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/flash_attn.py", line 586, in forward
ERROR 12-19 07:19:24 async_llm_engine.py:66]     output = torch.ops.vllm.unified_flash_attention(
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1061, in __call__
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self_._op(*args, **(kwargs or {}))
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 494, in adinplaceorview_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self._opoverload.redispatch(
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 672, in redispatch
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self_._handle.redispatch_boxed(keyset, *args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl
ERROR 12-19 07:19:24 async_llm_engine.py:66]     result = self._backend_fns[device_type](*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/flash_attn.py", line 736, in unified_flash_attention
ERROR 12-19 07:19:24 async_llm_engine.py:66]     decode_output = flash_attn_with_kvcache(
ERROR 12-19 07:19:24 async_llm_engine.py:66]                     ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 1296, in flash_attn_with_kvcache
ERROR 12-19 07:19:24 async_llm_engine.py:66]     out, softmax_lse = torch.ops.vllm_flash_attn_c.fwd_kvcache(
ERROR 12-19 07:19:24 async_llm_engine.py:66]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1061, in __call__
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return self_._op(*args, **(kwargs or {}))
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 12-19 07:19:24 async_llm_engine.py:66] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 12-19 07:19:24 async_llm_engine.py:66]
ERROR 12-19 07:19:24 async_llm_engine.py:66]
ERROR 12-19 07:19:24 async_llm_engine.py:66] The above exception was the direct cause of the following exception:
ERROR 12-19 07:19:24 async_llm_engine.py:66]
ERROR 12-19 07:19:24 async_llm_engine.py:66] Traceback (most recent call last):
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 56, in _log_task_completion
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return_value = task.result()
ERROR 12-19 07:19:24 async_llm_engine.py:66]                    ^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 853, in run_engine_loop
ERROR 12-19 07:19:24 async_llm_engine.py:66]     result = task.result()
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 776, in engine_step
ERROR 12-19 07:19:24 async_llm_engine.py:66]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 12-19 07:19:24 async_llm_engine.py:66]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 348, in step_async
ERROR 12-19 07:19:24 async_llm_engine.py:66]     outputs = await self.model_executor.execute_model_async(
ERROR 12-19 07:19:24 async_llm_engine.py:66]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 189, in execute_model_async
ERROR 12-19 07:19:24 async_llm_engine.py:66]     output = await make_async(self.driver_worker.execute_model
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 12-19 07:19:24 async_llm_engine.py:66]     result = self.fn(*self.args, **self.kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 12-19 07:19:24 async_llm_engine.py:66]     output = self.model_runner.execute_model(
ERROR 12-19 07:19:24 async_llm_engine.py:66]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-19 07:19:24 async_llm_engine.py:66]     return func(*args, **kwargs)
ERROR 12-19 07:19:24 async_llm_engine.py:66]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 12-19 07:19:24 async_llm_engine.py:66]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 12-19 07:19:24 async_llm_engine.py:66]     raise type(err)(f"Error in model execution: "
ERROR 12-19 07:19:24 async_llm_engine.py:66] RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
ERROR 12-19 07:19:24 async_llm_engine.py:66] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 12-19 07:19:24 async_llm_engine.py:66]
Exception in callback _log_task_completion(error_callback=<bound method...7e89fdedffb0>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py:46
handle: <Handle _log_task_completion(error_callback=<bound method...7e89fdedffb0>>)(<Task finishe...sertions.\n')>) at /usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py:46>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1665, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 556, in forward
    model_output = self.model(input_ids, positions, kv_caches,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 345, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 257, in forward
    hidden_states = self.self_attn(positions=positions,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 187, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 100, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/flash_attn.py", line 586, in forward
    output = torch.ops.vllm.unified_flash_attention(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 494, in adinplaceorview_impl
    return self._opoverload.redispatch(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 672, in redispatch
    return self_._handle.redispatch_boxed(keyset, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/backends/flash_attn.py", line 736, in unified_flash_attention
    decode_output = flash_attn_with_kvcache(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 1296, in flash_attn_with_kvcache
    out, softmax_lse = torch.ops.vllm_flash_attn_c.fwd_kvcache(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 56, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 853, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 776, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 348, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 189, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
    raise type(err)(f"Error in model execution: "
RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 68, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
```

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bug]: CUDA illegal memory access in flash attention only for specific values of --max-num-seqs (with AWQ model ) #11340

Your current environment

Model Input Dumps

🐛 Describe the bug

Steps to reproduce

Traceback

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Bug]: CUDA illegal memory access in flash attention only for specific values of --max-num-seqs (with AWQ model ) #11340

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Steps to reproduce

Traceback

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions