[Bug]: `check_enough_kv_cache_memory` didn't consider `num_gpu_blocks_override`

### Your current environment

H100

### 🐛 Describe the bug

I can launch the server with 
```
vllm serve facebook/opt-125m --num_gpu_blocks_override=1
....
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:772] Overriding num_gpu_blocks=125643 with num_gpu_blocks_override=1
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1201] GPU KV cache size: 16 tokens
(EngineCore_DP0 pid=2717573) INFO 10-19 21:10:01 [kv_cache_utils.py:1206] Maximum concurrency for 2,048 tokens per request: 0.01x
...
```
However, as there is only one block, no request with length > 16 can be scheduled.
the expect behavior should be raising an error during initialization like 
https://github.com/vllm-project/vllm/blob/f32bf7582e74ef967e657a25fd93186b31a46bed/vllm/v1/core/kv_cache_utils.py#L667-L687

Hope it can be fixed when iterating on https://github.com/vllm-project/vllm/pull/26939, but create a seperate issue to track it.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	if needed_memory > available_memory:
	# Estimate the maximum model length that can fit in the available memory
	estimated_max_len = estimate_max_model_len(
	vllm_config, kv_cache_spec, available_memory
	)
	estimated_msg = ""
	if estimated_max_len > 0:
	estimated_msg = (
	"Based on the available memory, "
	f"the estimated maximum model length is {estimated_max_len}."
	)

	raise ValueError(
	f"To serve at least one request with the models's max seq len "
	f"({max_model_len}), ({needed_memory / GiB_bytes:.2f} GiB KV "
	f"cache is needed, which is larger than the available KV cache "
	f"memory ({available_memory / GiB_bytes:.2f} GiB). "
	f"{estimated_msg} "
	f"Try increasing `gpu_memory_utilization` or decreasing "
	f"`max_model_len` when initializing the engine."
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: `check_enough_kv_cache_memory` didn't consider `num_gpu_blocks_override` #27181

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: check_enough_kv_cache_memory didn't consider num_gpu_blocks_override #27181

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `check_enough_kv_cache_memory` didn't consider `num_gpu_blocks_override` #27181