Skip to content

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision #6124

@CatherineSue

Description

@CatherineSue

Your current environment

Two docker containers based on images built from vllm source 3de6e6a and 3f3b6b2

🐛 Describe the bug

I passed the same model Phi-3-vision-128k-instruct to each docker container:

--tensor-parallel-size=1 \
--model=/models/Phi-3-vision-128k-instruct \

For the version needs VLMConfig, here are the parameters

--image-input-type="pixel_values" \
--image-feature-size=1921 \
--image-token-id=32044 \
--image-input-shape="1, 3, 1008, 1344" 

And with the container based on 3de6e6a more latest, it raises error:

INFO 07-04 01:04:14 gpu_executor.py:84] # GPU blocks: 5970, # CPU blocks: 682
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (95520). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

But the container based on 3f3b6b2:

INFO 07-04 01:40:03 gpu_executor.py:83] # GPU blocks: 8825, # CPU blocks: 682
INFO 07-04 01:40:05 model_runner.py:906] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions