-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Bugfix] Honor --mm_encoder_attn_backend when used #27124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request makes the FlashAttention backend upgrade for Vision Transformer (ViT) models an opt-in behavior, addressing an issue where it was unconditionally attempted, causing problems on AMD platforms. The change is implemented by introducing a try_switch_to_fa flag in maybe_get_vit_flash_attn_backend and updating the call sites in various models.
The overall approach is sound and correctly addresses the reported issue. However, I've identified a critical bug in the new implementation that could lead to crashes on platforms not supporting FlashAttention, like XPU. I've also pointed out a high-severity maintainability issue regarding the modification of function parameters, which could make the code harder to reason about. Addressing these points will improve the robustness and clarity of the code.
vllm/attention/layer.py
Outdated
| if try_switch_to_fa and not is_fa_backend(attn_backend): | ||
| attn_backend = _Backend.FLASH_ATTN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic unconditionally switches the backend to FLASH_ATTN if try_switch_to_fa is true. This can cause a crash on platforms that do not support FlashAttention, such as XPU, because the subsequent import of vllm.vllm_flash_attn will fail. The switch should be guarded to only occur on supported platforms (CUDA and ROCm).
| if try_switch_to_fa and not is_fa_backend(attn_backend): | |
| attn_backend = _Backend.FLASH_ATTN | |
| if try_switch_to_fa and not is_fa_backend(attn_backend) and ( | |
| current_platform.is_cuda() or current_platform.is_rocm()): | |
| attn_backend = _Backend.FLASH_ATTN |
vllm/attention/layer.py
Outdated
| attn_backend == _Backend.FLASH_ATTN: | ||
| # Always try upstream on ROCM. | ||
| logger.info_once("maybe_get_vit_flash_attn_backend: forcing upstream FlashAttn on ROCM.") | ||
| try_use_upstream_fa = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying an input parameter try_use_upstream_fa directly is confusing and can lead to unexpected side effects. It's better to use a local variable to track the state within the function. For example, you could introduce use_upstream_fa = try_use_upstream_fa at the beginning of the function and then modify and use use_upstream_fa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Summary: In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Differential Revision: D84946967
cf6b734 to
3405c66
Compare
|
Updated to try and mimic #26104 as closely as possible to make this an equivalent change. Not sure the behavior in the original PR's is good / should be preserved, though. |
|
This PR also fix existing AMD failures(example): |
|
cc @tjtanaa |
|
This logic is very confusing now; would be good to get more context here and try to refactor this a bit more aggressively, Seems like the original intention of using upstream FA is: #24347 , i.e. use it for models with head dim that is not supported by vllm-FA but is supported by upstream-FA; anything thats a multiple of 8 but not a multiple of 32 Can we just make all this logic; if on cuda and if head dim is not a multiple of 32 use upstream FA? otherwise just use |
To add on to @LucasWilkinson 's feedback The logic that we should update should be Lines 204 to 211 in c3a2c6a
Right now, as long as on AMD instinct, we assume that ck-flash-attention library is installed. If we want to enable Another thing that I notice that The set of ATTENTION BACKEND supported by ViT are However, the ATTENTION_BACKEND for LLM Backbone are So, I would suggest reserving the Moreover, on MI300 series, flash attention/ aiter flash attention is recommended to be used for ViT as it is the fastest. When torch.sdpa is selection, it is extremely slow as it does for loop to compute the attention output in majority of the vision models. |
|
Heads up that we have decoupled the two backends in #27061 |
@DarkLight1337 thanks. is |
|
@LucasWilkinson @tjtanaa @bradleyhd FYI on parallel to this PR, I've also made #27061 which decouples ViT attn backend from LM attn backend (which should probably be something we should've done from the get go). |
3405c66 to
2bcf680
Compare
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
2bcf680 to
a1250f0
Compare
|
alright folks, I've updated this to make use of the new --mm_encoder_attn_backend. When supplied, it won't auto-upgrade to FA. We need this asap to unblock as it allows us to specify torch_sdpa usage. We can and should circle back for a more comprehensive refactor here |
a1250f0 to
d58c252
Compare
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
d58c252 to
41d4cd1
Compare
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
41d4cd1 to
9adb7ec
Compare
Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967
9adb7ec to
c036f06
Compare
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I think we do need to think about how to deal with override in a better way (whether we should honor it truly with the risk of failure or handle the fallback automatically)
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: 0xrushi <[email protected]>
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>
Summary:
In #26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when
VLLM_ATTENTION_BACKENDis set.This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior.
Differential Revision: D84946967