-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Frontend] Optimize beam search loop by sorting and then splicing #19347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @zhanggzh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR focuses on optimizing the beam search implementation within vLLM. The primary goal is to improve performance by replacing a potentially time-consuming nested loop structure used for processing beam search candidates with a more efficient, vectorized approach leveraging NumPy. The original implementation involved iterating through each beam and then iterating through its potential next tokens, which could be slow, especially with large beam widths and many candidates per step. This change aims to reduce the overhead associated with splicing tokens and calculating cumulative probabilities by collecting all candidates and processing them in bulk using NumPy's array operations and sorting/selection capabilities.
Highlights
- Beam Search Optimization: The core of this PR is an optimization to the beam search logic in
vllm/engine/protocol.py. - Vectorized Processing with NumPy: The previous nested loop structure for processing beam search candidates is replaced with a vectorized approach using NumPy arrays for collecting token IDs and cumulative log probabilities, and then using NumPy's
argpartitionfor efficient top-k selection. - Improved Performance: The PR description includes test results suggesting a significant reduction in processing time for beam search requests, particularly when
n(number of sequences) is large. - Separate EOS Handling: Completed sequences (those ending with the EOS token) are now identified and processed separately before the top
beam_widthnon-EOS candidates are selected using NumPy.
Changelog
- vllm/engine/protocol.py
- Added import for the
numpylibrary (line 7). - Introduced
logprobs_numvariable for clarity when settingSamplingParams.logprobs(lines 102-104). - Completely refactored the main loop that processes the output of each generation step (lines 144-206).
- Collected all potential next token IDs and their cumulative log probabilities from all beams into NumPy arrays
all_beams_token_idandall_beams_logprob(lines 146-160). - Implemented separate handling for EOS tokens: identified their indices using
np.where, added completed sequences to thecompletedlist, and set their log probabilities to negative infinity inall_beams_logprobto exclude them from further selection (lines 162-185). - Used
np.argpartitionto efficiently find the indices of the topbeam_widthprobabilities among the remaining candidates (lines 187-190). - Constructed the
new_beamslist by iterating through the indices returned bynp.argpartitionand creatingBeamSearchSequenceobjects (lines 192-204). - Replaced the previous
sortedcall and slicing with the new NumPy-based selection logic (lines 176-177 in original code vs. lines 206 in new code).
- Added import for the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant optimization to the beam search algorithm by leveraging NumPy for more efficient processing of candidate tokens. The approach of batching token processing and using np.argpartition is a good strategy for performance improvement, as indicated by the test results.
I've identified a couple of areas that need attention: one critical issue related to the usage of np.argpartition which could lead to errors or incorrect results in certain edge cases, and a potential behavioral change concerning the handling of length_penalty.
Overall, the optimization is a valuable contribution. Addressing these points will help ensure the robustness and correctness of the new implementation.
Summary of Findings
- Robustness of np.argpartition: The current use of
np.argpartitionmight be susceptible to errors if the number of candidate log probabilities is less than or equal tobeam_width. This needs to be handled to prevent potential runtime errors or incorrect selections. - Handling of length_penalty: The refactored beam pruning logic selects candidates based on cumulative log probability, differing from the previous approach that used a length-penalized score. This could change results if
length_penaltyis not 1.0. - Comment Style: A comment on line 162 uses
##instead of the standard#. This is a minor stylistic point.
Merge Readiness
The optimization to the beam search logic is a promising improvement for performance. However, there are a couple of important points to address before this PR is ready for merging:
- The usage of
np.argpartitionneeds to be made more robust to handle edge cases where the number of candidates is small relative tobeam_width. This is a high-severity issue that could lead to runtime errors. - There's a potential change in behavior related to
length_penalty. Clarification is needed on whether this change is intended, as it could affect the beam search path for non-default penalty values.
Once these issues are resolved, the PR should be in good shape. I am unable to approve pull requests, so please ensure these changes are reviewed and approved by other maintainers before merging.
vllm/engine/protocol.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of np.argpartition here might lead to issues if all_beams_logprob.size <= beam_width.
Specifically, the kth argument (which is beam_width in the current code) in np.argpartition(a, kth) must satisfy 0 <= kth < a.shape[axis] for a non-empty array a. If all_beams_logprob.size <= beam_width, then beam_width can be an invalid kth argument, potentially causing an IndexError or unexpected behavior.
Could you consider a more robust way to select the top beam_width candidates? This would involve handling cases where all_beams_logprob.size == 0 or all_beams_logprob.size < beam_width explicitly.
| topn_idx = np.argpartition(np.negative(all_beams_logprob), | |
| beam_width)[:beam_width] | |
| num_candidates = all_beams_logprob.size | |
| if num_candidates > 0: | |
| # We want to select min(beam_width, num_candidates) elements. | |
| num_to_select = min(beam_width, num_candidates) | |
| if num_to_select > 0: | |
| # np.argpartition requires kth < arr.size for the pivot. | |
| # kth is 0-indexed. To get num_to_select smallest items, | |
| # we partition around the (num_to_select - 1)-th element. | |
| kth_pivot_index = num_to_select - 1 | |
| neg_logprobs = np.negative(all_beams_logprob) | |
| partitioned_indices = np.argpartition(neg_logprobs, kth_pivot_index) | |
| topn_idx = partitioned_indices[:num_to_select] | |
| else: | |
| # This case occurs if beam_width is 0. | |
| topn_idx = np.array([], dtype=int) | |
| else: | |
| # No candidates to select from. | |
| topn_idx = np.array([], dtype=int) |
vllm/engine/protocol.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous beam pruning logic (sorting new_beams with sort_beams_key) incorporated length_penalty. The new logic selects the top beam_width candidates based on raw cum_logprob (using np.argpartition on all_beams_logprob at lines 189-190, and then assigning new_beams to all_beams at line 206).
If length_penalty == 1.0 (the default), this change is likely fine because score = cum_logprob / seq_len, and seq_len is constant for all candidates at a given step. Thus, maximizing cum_logprob is equivalent to maximizing the score.
However, if length_penalty != 1.0, the score function cum_logprob / (seq_len ** length_penalty) is not necessarily maximized by cum_logprob alone. This means the set of beams chosen at each step could differ from the original logic, potentially leading to different final results.
Could you clarify if this change in behavior for length_penalty != 1.0 is intended or if the beam search should still prune based on the length-penalized score at each step? If the latter, the selection logic might need to incorporate the sort_beams_key or an equivalent calculation before np.argpartition.
Signed-off-by: zhangguozhu <[email protected]>
Signed-off-by: zhangguozhu <[email protected]>
Signed-off-by: zhangguozhu <[email protected]>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
is this PR still active? |
Signed-off-by: mgoin <[email protected]>
aarnphm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
…lm-project#19347) Signed-off-by: zhangguozhu <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: zhangguozhu <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: LuminolT <[email protected]>
Purpose
In the source code of vllm's beam_search, a two-fold loop is used that will splice the generated tokens and compute the cumulative probabilities This splicing operation takes essentially the same order of magnitude of time as each execution of generate.
Code:

Time-Consuming:

Optimize the above logic into the following code, using sorting and then splicing to reduce the time complexity.
Test Plan
Set request parameters:
Test Result