Skip to content

[Bug]: check_enough_kv_cache_memory didn't consider num_gpu_blocks_override #27181

@heheda12345

Description

@heheda12345

Your current environment

H100

🐛 Describe the bug

I can launch the server with

vllm serve facebook/opt-125m --num_gpu_blocks_override=1
....
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:772] Overriding num_gpu_blocks=125643 with num_gpu_blocks_override=1
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1201] GPU KV cache size: 16 tokens
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1206] Maximum concurrency for 2,048 tokens per request: 0.01x
...

However, as there is only one block, no request with length > 16 can be scheduled.
the expect behavior should be raising an error during initialization like

if needed_memory > available_memory:
# Estimate the maximum model length that can fit in the available memory
estimated_max_len = estimate_max_model_len(
vllm_config, kv_cache_spec, available_memory
)
estimated_msg = ""
if estimated_max_len > 0:
estimated_msg = (
"Based on the available memory, "
f"the estimated maximum model length is {estimated_max_len}."
)
raise ValueError(
f"To serve at least one request with the models's max seq len "
f"({max_model_len}), ({needed_memory / GiB_bytes:.2f} GiB KV "
f"cache is needed, which is larger than the available KV cache "
f"memory ({available_memory / GiB_bytes:.2f} GiB). "
f"{estimated_msg} "
f"Try increasing `gpu_memory_utilization` or decreasing "
f"`max_model_len` when initializing the engine."
)

Hope it can be fixed when iterating on #26939, but create a seperate issue to track it.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions