Skip to content

[Bug]: Florence2 example fails with UnboundLocalError: cannot access local variable 'key_cache' #21749

@osma

Description

@osma

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : AlmaLinux release 8.7 (Stone Smilodon) (x86_64)
GCC version                  : (GCC) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.20.2
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Jun  7 2024, 16:49:59) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-4.18.0-372.9.1.el8.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.3.52
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version        : 525.60.13
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7543 32-Core Processor
Stepping:            1
CPU MHz:             2794.789
BogoMIPS:            5589.57
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.1
[pip3] torchaudio==2.7.1
[pip3] torchvision==0.22.1
[pip3] transformers==4.54.0
[pip3] triton==3.3.1
[conda] _anaconda_depends         2023.09             py311_mkl_1  
[conda] blas                      1.0                         mkl  
[conda] mkl                       2023.1.0         h213fc3f_46343  
[conda] mkl-service               2.4.0           py311h5eee18b_1  
[conda] mkl_fft                   1.3.8           py311h5eee18b_0  
[conda] mkl_random                1.2.4           py311hdb19cb5_0  
[conda] numpy                     1.24.3          py311h08b1b3b_1  
[conda] numpy-base                1.24.3          py311hf175353_1  
[conda] numpydoc                  1.5.0           py311h06a4308_0  
[conda] pyzmq                     23.2.0          py311h6a678d5_0  
[conda] transformers              4.32.1          py311h06a4308_0  

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  	�[4mGPU0	mlx5_0	CPU Affinity	NUMA Affinity�[0m
GPU0	 X 	SYS	4-5,68-69	0-7
mlx5_0	SYS	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/wrk-vakka/appl/easybuild/opt/cuDNN/8.9.7.29-CUDA-12.3.0/lib:/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0/nvvm/lib64:/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0/extras/CUPTI/lib64:/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0/lib:/appl/easybuild/opt/Python/3.12.3-GCCcore-13.3.0/lib:/wrk-vakka/appl/easybuild/opt/OpenSSL/3.0/lib:/appl/easybuild/opt/libffi/3.4.5-GCCcore-13.3.0/lib64:/appl/easybuild/opt/XZ/5.4.5-GCCcore-13.3.0/lib:/appl/easybuild/opt/SQLite/3.45.3-GCCcore-13.3.0/lib:/appl/easybuild/opt/Tcl/8.6.14-GCCcore-13.3.0/lib:/appl/easybuild/opt/libreadline/8.2-GCCcore-13.3.0/lib:/appl/easybuild/opt/ncurses/6.5-GCCcore-13.3.0/lib:/appl/easybuild/opt/bzip2/1.0.8-GCCcore-13.3.0/lib:/appl/easybuild/opt/binutils/2.42-GCCcore-13.3.0/lib:/appl/easybuild/opt/zlib/1.3.1-GCCcore-13.3.0/lib:/appl/easybuild/opt/GCCcore/13.3.0/lib64
CUDA_PATH=/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0
CUDA_HOME=/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0
CUDA_HOME=/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0
CUDA_ROOT=/wrk-vakka/appl/easybuild/opt/CUDA/12.3.0
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I installed vLLM v0.10.0 on a HPC cluster with access to an A100 GPU. I want to experiment with Florence-2 and other multimodal LLMs. Install was simply pip install vllm

I tried running the Florence2 example which is part of the Encode Decode Multimodal example script.

wget https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/examples/offline_inference/encoder_decoder_multimodal.py
python encoder_decoder_multimodal.py -m florence2

vLLM crashes during CUDA graph capture. The error is:

UnboundLocalError: cannot access local variable 'key_cache' where it is not associated with a value

Here is the full output:

INFO 07-28 14:14:45 [__init__.py:235] Automatically detected platform cuda.
INFO 07-28 14:14:56 [config.py:1604] Using max model len 4096
WARNING 07-28 14:14:56 [arg_utils.py:1690] ['Florence2ForConditionalGeneration', 'TransformersForMultimodalLM'] is not supported by the V1 Engine. Falling back to V0. 
INFO 07-28 14:14:56 [llm_engine.py:228] Initializing a V0 LLM engine (v0.10.0) with config: model='microsoft/Florence-2-large', speculative_config=None, tokenizer='Isotr0py/Florence-2-tokenizer', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=microsoft/Florence-2-large, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":8,"local_cache_dir":null}, use_cached_outputs=False, 
INFO 07-28 14:14:58 [cuda.py:398] Using Flash Attention backend.
INFO 07-28 14:14:58 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-28 14:14:58 [model_runner.py:1083] Starting to load model microsoft/Florence-2-large...
INFO 07-28 14:14:59 [weight_utils.py:296] Using model weights format ['*.bin']

Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.36it/s]

Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.36it/s]

INFO 07-28 14:15:00 [default_loader.py:262] Loading weights took 0.43 seconds
INFO 07-28 14:15:00 [model_runner.py:1115] Model loading took 1.5467 GiB and 1.012831 seconds
INFO 07-28 14:15:03 [enc_dec_model_runner.py:314] Starting profile run for multi-modal models.
WARNING 07-28 14:15:03 [registry.py:325] Expected at least 640 dummy encoder tokens for profiling, but found 579 tokens instead.
INFO 07-28 14:15:05 [worker.py:295] Memory profiling takes 5.06 seconds
INFO 07-28 14:15:05 [worker.py:295] the current vLLM instance can use total_gpu_memory (79.18GiB) x gpu_memory_utilization (0.90) = 71.27GiB
INFO 07-28 14:15:05 [worker.py:295] model weights take 1.55GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 2.43GiB; the rest of the memory reserved for KV Cache is 67.20GiB.
INFO 07-28 14:15:05 [executor_base.py:113] # cuda blocks: 91744, # CPU blocks: 5461
INFO 07-28 14:15:05 [executor_base.py:118] Maximum concurrency for 4096 tokens per request: 358.38x
INFO 07-28 14:15:07 [model_runner.py:1385] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   0%|          | 0/4 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   0%|          | 0/4 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/XXXX/encoder_decoder_multimodal.py", line 196, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/XXXX/encoder_decoder_multimodal.py", line 163, in main
[rank0]:     llm = LLM(**engine_args)
[rank0]:           ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 273, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 497, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 473, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 266, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 422, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 124, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2985, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 336, in initialize_cache
[rank0]:     self._warm_up_model()
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/worker/worker.py", line 387, in _warm_up_model
[rank0]:     self.model_runner.capture_model(self.gpu_cache)
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1524, in capture_model
[rank0]:     graph_runner.capture(**capture_inputs)
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1926, in capture
[rank0]:     self.model(
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/florence2.py", line 1095, in forward
[rank0]:     hidden_states = self.language_model(input_ids,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/florence2.py", line 699, in forward
[rank0]:     return self.model(input_ids,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/florence2.py", line 641, in forward
[rank0]:     encoder_hidden_states = self.encoder(input_ids=encoder_input_ids,
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/bart.py", line 608, in forward
[rank0]:     hidden_states = encoder_layer(hidden_states=hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/bart.py", line 408, in forward
[rank0]:     hidden_states = self.self_attn(hidden_states=hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/model_executor/models/bart.py", line 193, in forward
[rank0]:     attn_output = self.attn(q, k, v)
[rank0]:                   ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/attention/layer.py", line 275, in forward
[rank0]:     torch.ops.vllm.unified_attention_with_output(
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/attention/layer.py", line 488, in unified_attention_with_output
[rank0]:     self.impl.forward(self,
[rank0]:   File "/wrk-vakka/group/YYYY/vllm-venv/lib/python3.12/site-packages/vllm/attention/backends/flash_attn.py", line 904, in forward
[rank0]:     descale_shape = (seq_lens_arg.shape[0], key_cache.shape[-2])
[rank0]:                                             ^^^^^^^^^
[rank0]: UnboundLocalError: cannot access local variable 'key_cache' where it is not associated with a value
[rank0]:[W728 14:15:07.625280450 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

The bug appears to be in vllm/attention/backends/flash_attn.py. This is an UnboundLocalError so it should be fairly easy to identify. I took a look at recent commits and one that seems suspicious is a597a57, which is part of PR #14570 (merged before v0.8.2) - it adds the line (currently 904) where the crash happens and also initializes the variable key_cache, but only if fp8_attention is true, which is apparently not the case with the Florence2 example.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions