Skip to content

[Feature]: Add token-level progress bar for LLM.beam_search inference #19300

@NekoMimiUnagi

Description

@NekoMimiUnagi

🚀 The feature, motivation and pitch

I'm working on LLM inference using the LLM.beam_search function. One major usability issue I've encountered is the lack of progress visibility during inference. When running beam search, I don’t know how long to wait or how far along the process is, which makes development feel unsafe and unpredictable.

A similar request was previously raised #11835 (comment), but it was closed. However, the need still exists and is especially relevant for anyone using beam search in long sequences or production loops.

The beam search logic is currently implemented using a token-level loop, not per-instance logic, as seen in:

for _ in range(max_tokens):

# only runs for one step
# we don't need to use tqdm here
output = self.generate(prompts_batch,
sampling_params=beam_search_params,
use_tqdm=False,
lora_request=lora_req_batch)

Because of this, implementing an instance-level progress bar (like in LLM.generate and LLM.chat) isn’t straightforward. However, a token-level progress bar would still provide massive UX benefits, letting users see progress and estimate runtime.

This is especially important for debugging, monitoring batch inference jobs, or integrating into real-time systems.

Alternatives

The simplest approach is to wrap the range(max_tokens) loop with tqdm, like so:

from tqdm import tqdm
for _ in tqdm(range(max_tokens)):

This could be conditionally enabled with a flag like use_tqdm=True in the function signature to keep it optional.

if len(all_beams) == 0:
break

While the loop may terminate early due to stopping conditions, the progress bar can simply show an estimated upper bound on time. A small warning or log message can inform users that this is just an estimate and the loop may finish earlier.

Additional context

This change would bring LLM.beam_search more in line with LLM.generate and LLM.chat, both of which offer some level of progress monitoring. Users running long sequences (e.g., for summarization, translation, or CoT tasks) are especially impacted.

The implementation cost is low, but the value in usability and developer confidence is high. This also helps in CI/testing pipelines where inference time needs to be monitored.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requeststaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions