Skip to content

Conversation

@zhanggzh
Copy link
Contributor

@zhanggzh zhanggzh commented Jun 9, 2025

Purpose

In the source code of vllm's beam_search, a two-fold loop is used that will splice the generated tokens and compute the cumulative probabilities This splicing operation takes essentially the same order of magnitude of time as each execution of generate.

Code:

Time-Consuming:

Optimize the above logic into the following code, using sorting and then splicing to reduce the time complexity.

Test Plan

Set request parameters:

payload = {
    "model": MODEL,
    "messages": [
      {
        "role": "user",
        "content": prompt
      }
    ],
    "use_beam_search": True,
    "n": 30,
    "max_tokens": 8
}

Test Result

@github-actions
Copy link

github-actions bot commented Jun 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @zhanggzh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR focuses on optimizing the beam search implementation within vLLM. The primary goal is to improve performance by replacing a potentially time-consuming nested loop structure used for processing beam search candidates with a more efficient, vectorized approach leveraging NumPy. The original implementation involved iterating through each beam and then iterating through its potential next tokens, which could be slow, especially with large beam widths and many candidates per step. This change aims to reduce the overhead associated with splicing tokens and calculating cumulative probabilities by collecting all candidates and processing them in bulk using NumPy's array operations and sorting/selection capabilities.

Highlights

  • Beam Search Optimization: The core of this PR is an optimization to the beam search logic in vllm/engine/protocol.py.
  • Vectorized Processing with NumPy: The previous nested loop structure for processing beam search candidates is replaced with a vectorized approach using NumPy arrays for collecting token IDs and cumulative log probabilities, and then using NumPy's argpartition for efficient top-k selection.
  • Improved Performance: The PR description includes test results suggesting a significant reduction in processing time for beam search requests, particularly when n (number of sequences) is large.
  • Separate EOS Handling: Completed sequences (those ending with the EOS token) are now identified and processed separately before the top beam_width non-EOS candidates are selected using NumPy.

Changelog

  • vllm/engine/protocol.py
    • Added import for the numpy library (line 7).
    • Introduced logprobs_num variable for clarity when setting SamplingParams.logprobs (lines 102-104).
    • Completely refactored the main loop that processes the output of each generation step (lines 144-206).
    • Collected all potential next token IDs and their cumulative log probabilities from all beams into NumPy arrays all_beams_token_id and all_beams_logprob (lines 146-160).
    • Implemented separate handling for EOS tokens: identified their indices using np.where, added completed sequences to the completed list, and set their log probabilities to negative infinity in all_beams_logprob to exclude them from further selection (lines 162-185).
    • Used np.argpartition to efficiently find the indices of the top beam_width probabilities among the remaining candidates (lines 187-190).
    • Constructed the new_beams list by iterating through the indices returned by np.argpartition and creating BeamSearchSequence objects (lines 192-204).
    • Replaced the previous sorted call and slicing with the new NumPy-based selection logic (lines 176-177 in original code vs. lines 206 in new code).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization to the beam search algorithm by leveraging NumPy for more efficient processing of candidate tokens. The approach of batching token processing and using np.argpartition is a good strategy for performance improvement, as indicated by the test results.

I've identified a couple of areas that need attention: one critical issue related to the usage of np.argpartition which could lead to errors or incorrect results in certain edge cases, and a potential behavioral change concerning the handling of length_penalty.

Overall, the optimization is a valuable contribution. Addressing these points will help ensure the robustness and correctness of the new implementation.

Summary of Findings

  • Robustness of np.argpartition: The current use of np.argpartition might be susceptible to errors if the number of candidate log probabilities is less than or equal to beam_width. This needs to be handled to prevent potential runtime errors or incorrect selections.
  • Handling of length_penalty: The refactored beam pruning logic selects candidates based on cumulative log probability, differing from the previous approach that used a length-penalized score. This could change results if length_penalty is not 1.0.
  • Comment Style: A comment on line 162 uses ## instead of the standard #. This is a minor stylistic point.

Merge Readiness

The optimization to the beam search logic is a promising improvement for performance. However, there are a couple of important points to address before this PR is ready for merging:

  1. The usage of np.argpartition needs to be made more robust to handle edge cases where the number of candidates is small relative to beam_width. This is a high-severity issue that could lead to runtime errors.
  2. There's a potential change in behavior related to length_penalty. Clarification is needed on whether this change is intended, as it could affect the beam search path for non-default penalty values.

Once these issues are resolved, the PR should be in good shape. I am unable to approve pull requests, so please ensure these changes are reviewed and approved by other maintainers before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The usage of np.argpartition here might lead to issues if all_beams_logprob.size <= beam_width.
Specifically, the kth argument (which is beam_width in the current code) in np.argpartition(a, kth) must satisfy 0 <= kth < a.shape[axis] for a non-empty array a. If all_beams_logprob.size <= beam_width, then beam_width can be an invalid kth argument, potentially causing an IndexError or unexpected behavior.

Could you consider a more robust way to select the top beam_width candidates? This would involve handling cases where all_beams_logprob.size == 0 or all_beams_logprob.size < beam_width explicitly.

Suggested change
topn_idx = np.argpartition(np.negative(all_beams_logprob),
beam_width)[:beam_width]
num_candidates = all_beams_logprob.size
if num_candidates > 0:
# We want to select min(beam_width, num_candidates) elements.
num_to_select = min(beam_width, num_candidates)
if num_to_select > 0:
# np.argpartition requires kth < arr.size for the pivot.
# kth is 0-indexed. To get num_to_select smallest items,
# we partition around the (num_to_select - 1)-th element.
kth_pivot_index = num_to_select - 1
neg_logprobs = np.negative(all_beams_logprob)
partitioned_indices = np.argpartition(neg_logprobs, kth_pivot_index)
topn_idx = partitioned_indices[:num_to_select]
else:
# This case occurs if beam_width is 0.
topn_idx = np.array([], dtype=int)
else:
# No candidates to select from.
topn_idx = np.array([], dtype=int)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The previous beam pruning logic (sorting new_beams with sort_beams_key) incorporated length_penalty. The new logic selects the top beam_width candidates based on raw cum_logprob (using np.argpartition on all_beams_logprob at lines 189-190, and then assigning new_beams to all_beams at line 206).

If length_penalty == 1.0 (the default), this change is likely fine because score = cum_logprob / seq_len, and seq_len is constant for all candidates at a given step. Thus, maximizing cum_logprob is equivalent to maximizing the score.

However, if length_penalty != 1.0, the score function cum_logprob / (seq_len ** length_penalty) is not necessarily maximized by cum_logprob alone. This means the set of beams chosen at each step could differ from the original logic, potentially leading to different final results.

Could you clarify if this change in behavior for length_penalty != 1.0 is intended or if the beam search should still prune based on the length-penalized score at each step? If the latter, the selection logic might need to incorporate the sort_beams_key or an equivalent calculation before np.argpartition.

Signed-off-by: zhangguozhu <[email protected]>
zhangguozhu added 2 commits June 9, 2025 14:13
Signed-off-by: zhangguozhu <[email protected]>
Signed-off-by: zhangguozhu <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 8, 2025

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Sep 8, 2025
@mergify
Copy link

mergify bot commented Sep 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhanggzh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 11, 2025
@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Sep 12, 2025
@denadai2
Copy link

is this PR still active?

martinbomio added a commit to martinbomio/vllm that referenced this pull request Oct 23, 2025
@mergify mergify bot added frontend and removed needs-rebase labels Nov 18, 2025
@mergify mergify bot mentioned this pull request Nov 18, 2025
5 tasks
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgoin mgoin changed the title [Frontend]Opt beam search [Frontend] Optimize beam search Nov 19, 2025
@mgoin mgoin changed the title [Frontend] Optimize beam search [Frontend] Optimize beam search loop by sorting and then splicing Nov 19, 2025
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
@mgoin mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed and removed unstale Recieved activity after being labelled stale labels Nov 20, 2025
@vllm-bot vllm-bot merged commit 56f45ed into vllm-project:main Nov 20, 2025
45 of 49 checks passed
LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025
…lm-project#19347)

Signed-off-by: zhangguozhu <[email protected]>
Signed-off-by: mgoin <[email protected]>
Co-authored-by: zhangguozhu <[email protected]>
Co-authored-by: mgoin <[email protected]>
Signed-off-by: LuminolT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants