-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
Two docker containers based on images built from vllm source 3de6e6a and 3f3b6b2
🐛 Describe the bug
I passed the same model Phi-3-vision-128k-instruct to each docker container:
--tensor-parallel-size=1 \
--model=/models/Phi-3-vision-128k-instruct \
For the version needs VLMConfig, here are the parameters
--image-input-type="pixel_values" \
--image-feature-size=1921 \
--image-token-id=32044 \
--image-input-shape="1, 3, 1008, 1344"
And with the container based on 3de6e6a more latest, it raises error:
INFO 07-04 01:04:14 gpu_executor.py:84] # GPU blocks: 5970, # CPU blocks: 682
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (95520). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
But the container based on 3f3b6b2:
INFO 07-04 01:40:03 gpu_executor.py:83] # GPU blocks: 8825, # CPU blocks: 682
INFO 07-04 01:40:05 model_runner.py:906] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working