[Bug]: Number of available GPU blocks drop significantly for Phi3-vision

### Your current environment

Two docker containers based on images built from vllm source **3de6e6a3** and **3f3b6b21**

### 🐛 Describe the bug

I passed the same model Phi-3-vision-128k-instruct to each docker container:
```
--tensor-parallel-size=1 \
--model=/models/Phi-3-vision-128k-instruct \
```
For the version needs VLMConfig, here are the parameters
```
--image-input-type="pixel_values" \
--image-feature-size=1921 \
--image-token-id=32044 \
--image-input-shape="1, 3, 1008, 1344" 
```
And with the container based on 3de6e6a3 **more latest**, it raises error:
```
INFO 07-04 01:04:14 gpu_executor.py:84] # GPU blocks: 5970, # CPU blocks: 682
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (95520). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
```
But the container based on **3f3b6b21**:
```
INFO 07-04 01:40:03 gpu_executor.py:83] # GPU blocks: 8825, # CPU blocks: 682
INFO 07-04 01:40:05 model_runner.py:906] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision #6124

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision #6124

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions