-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Refactor] Remove redundant TP gather/split in split_qkv in QwenVL #28271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the split_qkv method in Qwen2_5_VisionAttention for the Qwen2.5-VL model. The change removes redundant tensor parallelism communication logic, specifically an all_gather_interleave operation followed by a re-slicing of the Q, K, and V tensors. As the model uses head-parallel attention where each tensor-parallel rank already computes its own subset of heads, these communication steps were unnecessary and introduced performance overhead. The removal of this logic is correct, simplifies the code, and should improve performance by eliminating redundant synchronization. The change is well-justified and looks good.
|
@tjtanaa Thanks for the reminder! I reviewed the PR #16907 and #16974 mentioned by @DarkLight1337. It seems that the part actually causing the break is the gather function we previously considered redundant. For now, we haven’t observed this optimization causing any Omni-related break on Ascend. We also enable Omni to benefit from this optimization as well in this PR. |
|
@ywang96 can you ask Qwen team for comment? |
|
@tjtanaa @DarkLight1337 I conducted a correctness verification for Qwen2.5-Omni-3B under tp=2, and the specific command and results are shown below. It seems that this PR’s removal of gather and slice operations did not cause any breakage for Qwen2.5-Omni in multi-TP settings. Could you please take a look for this verification? I’m not sure whether this verification is sufficient. Do I need to run additional tests to cover scenarios that might potentially cause a break? Please feel free to let me know if you think further testing is needed. Thanks! Commanduv run lm_eval --model vllm-vlm --model_args "pretrained=Qwen/Qwen2.5-Omni-3B,model=Qwen/Qwen2.5-Omni-3B,tensor_parallel_size=2" --tasks mmmu_val,chartqa --batch_size 32 --apply_chat_templateBeforeAfter |
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing this! After discussing offline with @ywang96 , we think it is ok to merge this until proven otherwise.
|
@DarkLight1337 Sorry to bother you again. The CI can finish running, but there are 3 failing tests. At first glance, they don’t seem related to this PR. Should we fix them? |
|
Retrying the tests to see if they are flaky |
|
Sadly it failed again. I will try to find the root cause. Or maybe should I try rebasing this PR? |
|
Yeah can you rebase? |
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Signed-off-by: gcanlin <[email protected]>
Head branch was pushed to by a user without write access
bfc2bd2 to
525f05a
Compare
Purpose
This code path uses head-parallel attention, where each rank holds full Q/K/V vectors for its own subset of heads. All attention backends operate per-rank on local heads. Therefore, the previous
tp_size > 1communication logic (all_gather_interleaveand re-slicing) is unnecessary and introduces redundant synchronization overhead. Removing it preserves correctness and improves performance.Test Plan
It has been tested in vllm-ascend. And we indeed observed improved performance.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.