[Frontend] Optimize beam search loop by sorting and then splicing #19347

zhanggzh · 2025-06-09T04:50:52Z

Purpose

In the source code of vllm's beam_search, a two-fold loop is used that will splice the generated tokens and compute the cumulative probabilities This splicing operation takes essentially the same order of magnitude of time as each execution of generate.

Code:

Time-Consuming:

Optimize the above logic into the following code, using sorting and then splicing to reduce the time complexity.

Test Plan

Set request parameters:

payload = {
    "model": MODEL,
    "messages": [
      {
        "role": "user",
        "content": prompt
      }
    ],
    "use_beam_search": True,
    "n": 30,
    "max_tokens": 8
}

Test Result

github-actions · 2025-06-09T04:51:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @zhanggzh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR focuses on optimizing the beam search implementation within vLLM. The primary goal is to improve performance by replacing a potentially time-consuming nested loop structure used for processing beam search candidates with a more efficient, vectorized approach leveraging NumPy. The original implementation involved iterating through each beam and then iterating through its potential next tokens, which could be slow, especially with large beam widths and many candidates per step. This change aims to reduce the overhead associated with splicing tokens and calculating cumulative probabilities by collecting all candidates and processing them in bulk using NumPy's array operations and sorting/selection capabilities.

Highlights

Beam Search Optimization: The core of this PR is an optimization to the beam search logic in vllm/engine/protocol.py.
Vectorized Processing with NumPy: The previous nested loop structure for processing beam search candidates is replaced with a vectorized approach using NumPy arrays for collecting token IDs and cumulative log probabilities, and then using NumPy's argpartition for efficient top-k selection.
Improved Performance: The PR description includes test results suggesting a significant reduction in processing time for beam search requests, particularly when n (number of sequences) is large.
Separate EOS Handling: Completed sequences (those ending with the EOS token) are now identified and processed separately before the top beam_width non-EOS candidates are selected using NumPy.

Changelog

vllm/engine/protocol.py
- Added import for the numpy library (line 7).
- Introduced logprobs_num variable for clarity when setting SamplingParams.logprobs (lines 102-104).
- Completely refactored the main loop that processes the output of each generation step (lines 144-206).
- Collected all potential next token IDs and their cumulative log probabilities from all beams into NumPy arrays all_beams_token_id and all_beams_logprob (lines 146-160).
- Implemented separate handling for EOS tokens: identified their indices using np.where, added completed sequences to the completed list, and set their log probabilities to negative infinity in all_beams_logprob to exclude them from further selection (lines 162-185).
- Used np.argpartition to efficiently find the indices of the top beam_width probabilities among the remaining candidates (lines 187-190).
- Constructed the new_beams list by iterating through the indices returned by np.argpartition and creating BeamSearchSequence objects (lines 192-204).
- Replaced the previous sorted call and slicing with the new NumPy-based selection logic (lines 176-177 in original code vs. lines 206 in new code).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant optimization to the beam search algorithm by leveraging NumPy for more efficient processing of candidate tokens. The approach of batching token processing and using np.argpartition is a good strategy for performance improvement, as indicated by the test results.

I've identified a couple of areas that need attention: one critical issue related to the usage of np.argpartition which could lead to errors or incorrect results in certain edge cases, and a potential behavioral change concerning the handling of length_penalty.

Overall, the optimization is a valuable contribution. Addressing these points will help ensure the robustness and correctness of the new implementation.

Summary of Findings

Robustness of np.argpartition: The current use of np.argpartition might be susceptible to errors if the number of candidate log probabilities is less than or equal to beam_width. This needs to be handled to prevent potential runtime errors or incorrect selections.
Handling of length_penalty: The refactored beam pruning logic selects candidates based on cumulative log probability, differing from the previous approach that used a length-penalized score. This could change results if length_penalty is not 1.0.
Comment Style: A comment on line 162 uses ## instead of the standard #. This is a minor stylistic point.

Merge Readiness

The optimization to the beam search logic is a promising improvement for performance. However, there are a couple of important points to address before this PR is ready for merging:

The usage of np.argpartition needs to be made more robust to handle edge cases where the number of candidates is small relative to beam_width. This is a high-severity issue that could lead to runtime errors.
There's a potential change in behavior related to length_penalty. Clarification is needed on whether this change is intended, as it could affect the beam search path for non-default penalty values.

Once these issues are resolved, the PR should be in good shape. I am unable to approve pull requests, so please ensure these changes are reviewed and approved by other maintainers before merging.

gemini-code-assist · 2025-06-09T04:53:11Z

vllm/engine/protocol.py

The usage of np.argpartition here might lead to issues if all_beams_logprob.size <= beam_width.
Specifically, the kth argument (which is beam_width in the current code) in np.argpartition(a, kth) must satisfy 0 <= kth < a.shape[axis] for a non-empty array a. If all_beams_logprob.size <= beam_width, then beam_width can be an invalid kth argument, potentially causing an IndexError or unexpected behavior.

Could you consider a more robust way to select the top beam_width candidates? This would involve handling cases where all_beams_logprob.size == 0 or all_beams_logprob.size < beam_width explicitly.

Suggested change

topn_idx = np.argpartition(np.negative(all_beams_logprob),

beam_width)[:beam_width]

num_candidates = all_beams_logprob.size

if num_candidates > 0:

# We want to select min(beam_width, num_candidates) elements.

num_to_select = min(beam_width, num_candidates)

if num_to_select > 0:

# np.argpartition requires kth < arr.size for the pivot.

# kth is 0-indexed. To get num_to_select smallest items,

# we partition around the (num_to_select - 1)-th element.

kth_pivot_index = num_to_select - 1

neg_logprobs = np.negative(all_beams_logprob)

partitioned_indices = np.argpartition(neg_logprobs, kth_pivot_index)

topn_idx = partitioned_indices[:num_to_select]

else:

# This case occurs if beam_width is 0.

topn_idx = np.array([], dtype=int)

else:

# No candidates to select from.

topn_idx = np.array([], dtype=int)

gemini-code-assist · 2025-06-09T04:53:11Z

vllm/engine/protocol.py

The previous beam pruning logic (sorting new_beams with sort_beams_key) incorporated length_penalty. The new logic selects the top beam_width candidates based on raw cum_logprob (using np.argpartition on all_beams_logprob at lines 189-190, and then assigning new_beams to all_beams at line 206).

If length_penalty == 1.0 (the default), this change is likely fine because score = cum_logprob / seq_len, and seq_len is constant for all candidates at a given step. Thus, maximizing cum_logprob is equivalent to maximizing the score.

However, if length_penalty != 1.0, the score function cum_logprob / (seq_len ** length_penalty) is not necessarily maximized by cum_logprob alone. This means the set of beams chosen at each step could differ from the original logic, potentially leading to different final results.

Could you clarify if this change in behavior for length_penalty != 1.0 is intended or if the beam search should still prune based on the length-penalized score at each step? If the latter, the selection logic might need to incorporate the sort_beams_key or an equivalent calculation before np.argpartition.

Signed-off-by: zhangguozhu <[email protected]>

github-actions · 2025-09-08T02:50:28Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2025-09-11T06:18:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhanggzh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

denadai2 · 2025-10-18T20:23:39Z

is this PR still active?

Signed-off-by: mgoin <[email protected]>

aarnphm

LGTM

Signed-off-by: mgoin <[email protected]>

…lm-project#19347) Signed-off-by: zhangguozhu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: zhangguozhu <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: LuminolT <[email protected]>

gemini-code-assist bot reviewed Jun 9, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 9, 2025

View reviewed changes

optimize beamSearch

68a4001

Signed-off-by: zhangguozhu <[email protected]>

zhanggzh force-pushed the opt_beam_search branch from a568bcf to 68a4001 Compare June 9, 2025 05:42

zhangguozhu added 2 commits June 9, 2025 14:13

optimize beamSearch

52d01a7

Signed-off-by: zhangguozhu <[email protected]>

optimize beamSearch

45b2e29

Signed-off-by: zhangguozhu <[email protected]>

zhanggzh mentioned this pull request Jul 1, 2025

[Performance]: Optimize beam search code #20316

Open

1 task

github-actions bot added the stale Over 90 days of inactivity label Sep 8, 2025

mergify bot added the needs-rebase label Sep 11, 2025

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Sep 12, 2025

martinbomio added a commit to martinbomio/vllm that referenced this pull request Oct 23, 2025

cherry-pick vllm-project#19347

e162145

Merge branch 'main' into opt_beam_search

31106f4

Signed-off-by: mgoin <[email protected]>

mgoin requested review from aarnphm and chaunceyjiang as code owners November 18, 2025 22:46

mergify bot added frontend and removed needs-rebase labels Nov 18, 2025

mergify bot mentioned this pull request Nov 18, 2025

Opt beam search #28969

Closed

5 tasks

aarnphm approved these changes Nov 19, 2025

View reviewed changes

mgoin changed the title ~~[Frontend]Opt beam search~~ [Frontend] Optimize beam search Nov 19, 2025

mgoin changed the title ~~[Frontend] Optimize beam search~~ [Frontend] Optimize beam search loop by sorting and then splicing Nov 19, 2025

mgoin added 2 commits November 19, 2025 15:44

Fix format

6920961

Signed-off-by: mgoin <[email protected]>

Lint

bd4f5ab

Signed-off-by: mgoin <[email protected]>

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed and removed unstale Recieved activity after being labelled stale labels Nov 20, 2025

vllm-bot merged commit 56f45ed into vllm-project:main Nov 20, 2025
45 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Optimize beam search loop by sorting and then splicing #19347

[Frontend] Optimize beam search loop by sorting and then splicing #19347

Uh oh!

zhanggzh commented Jun 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 9, 2025

Uh oh!

gemini-code-assist bot Jun 9, 2025

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

mergify bot commented Sep 11, 2025

Uh oh!

denadai2 commented Oct 18, 2025

Uh oh!

aarnphm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-            topn_idx = np.argpartition(np.negative(all_beams_logprob),
-                                       beam_width)[:beam_width]
+            num_candidates = all_beams_logprob.size
+            if num_candidates > 0:
+                # We want to select min(beam_width, num_candidates) elements.
+                num_to_select = min(beam_width, num_candidates)
+                if num_to_select > 0:
+                    # np.argpartition requires kth < arr.size for the pivot.
+                    # kth is 0-indexed. To get num_to_select smallest items,
+                    # we partition around the (num_to_select - 1)-th element.
+                    kth_pivot_index = num_to_select - 1
+                    neg_logprobs = np.negative(all_beams_logprob)
+                    partitioned_indices = np.argpartition(neg_logprobs, kth_pivot_index)
+                    topn_idx = partitioned_indices[:num_to_select]
+                else:
+                    # This case occurs if beam_width is 0.
+                    topn_idx = np.array([], dtype=int)
+            else:
+                # No candidates to select from.
+                topn_idx = np.array([], dtype=int)

Uh oh!

[Frontend] Optimize beam search loop by sorting and then splicing #19347

[Frontend] Optimize beam search loop by sorting and then splicing #19347

Uh oh!

Conversation

zhanggzh commented Jun 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

mergify bot commented Sep 11, 2025

Uh oh!

denadai2 commented Oct 18, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhanggzh commented Jun 9, 2025 •

edited by github-actions bot

Loading