-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Description
🚀 The feature, motivation and pitch
I'm working on LLM inference using the LLM.beam_search function. One major usability issue I've encountered is the lack of progress visibility during inference. When running beam search, I don’t know how long to wait or how far along the process is, which makes development feel unsafe and unpredictable.
A similar request was previously raised #11835 (comment), but it was closed. However, the need still exists and is especially relevant for anyone using beam search in long sequences or production loops.
The beam search logic is currently implemented using a token-level loop, not per-instance logic, as seen in:
Line 605 in ca27f0f
| for _ in range(max_tokens): |
Lines 622 to 627 in ca27f0f
| # only runs for one step | |
| # we don't need to use tqdm here | |
| output = self.generate(prompts_batch, | |
| sampling_params=beam_search_params, | |
| use_tqdm=False, | |
| lora_request=lora_req_batch) |
Because of this, implementing an instance-level progress bar (like in LLM.generate and LLM.chat) isn’t straightforward. However, a token-level progress bar would still provide massive UX benefits, letting users see progress and estimate runtime.
This is especially important for debugging, monitoring batch inference jobs, or integrating into real-time systems.
Alternatives
The simplest approach is to wrap the range(max_tokens) loop with tqdm, like so:
from tqdm import tqdm
for _ in tqdm(range(max_tokens)):
This could be conditionally enabled with a flag like use_tqdm=True in the function signature to keep it optional.
Lines 614 to 615 in ca27f0f
| if len(all_beams) == 0: | |
| break |
While the loop may terminate early due to stopping conditions, the progress bar can simply show an estimated upper bound on time. A small warning or log message can inform users that this is just an estimate and the loop may finish earlier.
Additional context
This change would bring LLM.beam_search more in line with LLM.generate and LLM.chat, both of which offer some level of progress monitoring. Users running long sequences (e.g., for summarization, translation, or CoT tasks) are especially impacted.
The implementation cost is low, but the value in usability and developer confidence is high. This also helps in CI/testing pipelines where inference time needs to be monitored.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.