- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
H100
🐛 Describe the bug
I can launch the server with
vllm serve facebook/opt-125m --num_gpu_blocks_override=1
....
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:772] Overriding num_gpu_blocks=125643 with num_gpu_blocks_override=1
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1201] GPU KV cache size: 16 tokens
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1206] Maximum concurrency for 2,048 tokens per request: 0.01x
...
However, as there is only one block, no request with length > 16 can be scheduled.
the expect behavior should be raising an error during initialization like
vllm/vllm/v1/core/kv_cache_utils.py
Lines 667 to 687 in f32bf75
| if needed_memory > available_memory: | |
| # Estimate the maximum model length that can fit in the available memory | |
| estimated_max_len = estimate_max_model_len( | |
| vllm_config, kv_cache_spec, available_memory | |
| ) | |
| estimated_msg = "" | |
| if estimated_max_len > 0: | |
| estimated_msg = ( | |
| "Based on the available memory, " | |
| f"the estimated maximum model length is {estimated_max_len}." | |
| ) | |
| raise ValueError( | |
| f"To serve at least one request with the models's max seq len " | |
| f"({max_model_len}), ({needed_memory / GiB_bytes:.2f} GiB KV " | |
| f"cache is needed, which is larger than the available KV cache " | |
| f"memory ({available_memory / GiB_bytes:.2f} GiB). " | |
| f"{estimated_msg} " | |
| f"Try increasing `gpu_memory_utilization` or decreasing " | |
| f"`max_model_len` when initializing the engine." | |
| ) | 
Hope it can be fixed when iterating on #26939, but create a seperate issue to track it.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working