-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled #25739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled #25739
Conversation
… is enabled Signed-off-by: Andrew Sansom <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization by avoiding unnecessary GPU copies of inputs_embeds and is_token_ids when prompt embeddings are not enabled. The changes are implemented correctly by adding if self.enable_prompt_embeds guards. I've also pointed out a related pre-existing issue in the async scheduling path where is_token_ids is not fully updated, which would be good to address for complete correctness, though it might be out of scope for this specific performance-focused PR.
| if self.enable_prompt_embeds: | ||
| self.is_token_ids.gpu[:num_commmon_tokens] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly handles is_token_ids for the fast path in async scheduling. However, a similar update is missing for the slow path (the scatter_ case for input_ids around line 900). For correctness when prompt embeddings are enabled, is_token_ids for the scattered tokens should also be set to True, as they correspond to sampled token IDs. A complete fix would handle both paths.
|
We can also disable this code? |
…pt embeds is enabled Signed-off-by: Andrew Sansom <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]>
… is enabled (#25739) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: yewentao256 <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
… is enabled (vllm-project#25739) Signed-off-by: Andrew Sansom <[email protected]>
Purpose
@njhill pointed out that #24278 introduced a performance regression in the case when prompt embeds is disabled, where inputs_embeds tensors are copied to GPU, even though those tensors are either empty or filled with garbage when
--enable-prompt-embedsis not on. We can guard those copies withself.enable_prompt_embedsto avoid these copies unless prompt embeds is enabled.Test Plan
No new tests are needed. I have some local scripts to send thousands of requests through vLLM with this change. Everything works.
Test Result
Relevant local tests pass. Pending CI.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.