[Feature]: Add token-level progress bar for `LLM.beam_search` inference

### 🚀 The feature, motivation and pitch

I'm working on LLM inference using the [`LLM.beam_search`](https://docs.vllm.ai/en/stable/models/generative_models.html#llmbeam_search) function. One major usability issue I've encountered is the lack of progress visibility during inference. When running beam search, I don’t know how long to wait or how far along the process is, which makes development feel unsafe and unpredictable.

A similar request was previously raised https://github.com/vllm-project/vllm/issues/11835#issue-2774761012, but it was closed. However, the need still exists and is especially relevant for anyone using beam search in long sequences or production loops.

The beam search logic is currently implemented using a token-level loop, not per-instance logic, as seen in:

https://github.com/vllm-project/vllm/blob/ca27f0f9c1452a0e73126be2b1666c3067cf6290/vllm/entrypoints/llm.py#L605

https://github.com/vllm-project/vllm/blob/ca27f0f9c1452a0e73126be2b1666c3067cf6290/vllm/entrypoints/llm.py#L622-L627

Because of this, implementing an instance-level progress bar (like in `LLM.generate` and `LLM.chat`) isn’t straightforward. However, a token-level progress bar would still provide massive UX benefits, letting users see progress and estimate runtime.

This is especially important for debugging, monitoring batch inference jobs, or integrating into real-time systems.

### Alternatives

The simplest approach is to wrap the `range(max_tokens)` loop with tqdm, like so:
```
from tqdm import tqdm
for _ in tqdm(range(max_tokens)):
```
This could be conditionally enabled with a flag like `use_tqdm=True` in the function signature to keep it optional.

https://github.com/vllm-project/vllm/blob/ca27f0f9c1452a0e73126be2b1666c3067cf6290/vllm/entrypoints/llm.py#L614-L615
While the loop may terminate early due to stopping conditions, the progress bar can simply show an estimated upper bound on time. A small warning or log message can inform users that this is just an estimate and the loop may finish earlier.


### Additional context

This change would bring `LLM.beam_search` more in line with `LLM.generate` and `LLM.chat`, both of which offer some level of progress monitoring. Users running long sequences (e.g., for summarization, translation, or CoT tasks) are especially impacted.

**The implementation cost is low, but the value in usability and developer confidence is high.** This also helps in CI/testing pipelines where inference time needs to be monitored.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	# only runs for one step
	# we don't need to use tqdm here
	output = self.generate(prompts_batch,
	sampling_params=beam_search_params,
	use_tqdm=False,
	lora_request=lora_req_batch)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Add token-level progress bar for `LLM.beam_search` inference #19300

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if len(all_beams) == 0:
	break

Uh oh!

[Feature]: Add token-level progress bar for LLM.beam_search inference #19300

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: Add token-level progress bar for `LLM.beam_search` inference #19300