[Bugfix] Honor --mm_encoder_attn_backend when used #27124

bradleyhd · 2025-10-17T20:54:21Z

Summary:
In #26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when VLLM_ATTENTION_BACKEND is set.

This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior.

Differential Revision: D84946967

github-actions · 2025-10-17T20:54:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request makes the FlashAttention backend upgrade for Vision Transformer (ViT) models an opt-in behavior, addressing an issue where it was unconditionally attempted, causing problems on AMD platforms. The change is implemented by introducing a try_switch_to_fa flag in maybe_get_vit_flash_attn_backend and updating the call sites in various models.

The overall approach is sound and correctly addresses the reported issue. However, I've identified a critical bug in the new implementation that could lead to crashes on platforms not supporting FlashAttention, like XPU. I've also pointed out a high-severity maintainability issue regarding the modification of function parameters, which could make the code harder to reason about. Addressing these points will improve the robustness and clarity of the code.

gemini-code-assist · 2025-10-17T20:57:16Z

vllm/attention/layer.py

+    if try_switch_to_fa and not is_fa_backend(attn_backend):
+        attn_backend = _Backend.FLASH_ATTN


The current logic unconditionally switches the backend to FLASH_ATTN if try_switch_to_fa is true. This can cause a crash on platforms that do not support FlashAttention, such as XPU, because the subsequent import of vllm.vllm_flash_attn will fail. The switch should be guarded to only occur on supported platforms (CUDA and ROCm).

Suggested change

if try_switch_to_fa and not is_fa_backend(attn_backend):

attn_backend = _Backend.FLASH_ATTN

if try_switch_to_fa and not is_fa_backend(attn_backend) and (

current_platform.is_cuda() or current_platform.is_rocm()):

attn_backend = _Backend.FLASH_ATTN

gemini-code-assist · 2025-10-17T20:57:16Z

vllm/attention/layer.py

+        attn_backend == _Backend.FLASH_ATTN:
+        # Always try upstream on ROCM.
+        logger.info_once("maybe_get_vit_flash_attn_backend: forcing upstream FlashAttn on ROCM.")
+        try_use_upstream_fa = True


Modifying an input parameter try_use_upstream_fa directly is confusing and can lead to unexpected side effects. It's better to use a local variable to track the state within the function. For example, you could introduce use_upstream_fa = try_use_upstream_fa at the beginning of the function and then modify and use use_upstream_fa.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/attention/layer.py

Summary: In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Differential Revision: D84946967

bradleyhd · 2025-10-17T21:31:50Z

Updated to try and mimic #26104 as closely as possible to make this an equivalent change. Not sure the behavior in the original PR's is good / should be preserved, though.

zhewenl · 2025-10-17T21:59:09Z

This PR also fix existing AMD failures(example):

(EngineCore_DP0 pid=50574)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layers/cross_attention.py", line 168, in __init__
--
  | (EngineCore_DP0 pid=50574)     super().__init__(
  | (EngineCore_DP0 pid=50574)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 236, in __init__
  | (EngineCore_DP0 pid=50574)     self.impl = impl_cls(
  | (EngineCore_DP0 pid=50574)                 ^^^^^^^^^
  | (EngineCore_DP0 pid=50574)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/triton_attn.py", line 248, in __init__
  | (EngineCore_DP0 pid=50574)     raise NotImplementedError(
  | (EngineCore_DP0 pid=50574) NotImplementedError: Encoder self-attention and encoder/decoder cross-attention are not implemented for TritonAttentionImpl
  | [rank0]:[W1017 04:54:49.728888986 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cc @Alexei-V-Ivanov-AMD

DarkLight1337 · 2025-10-20T17:03:29Z

cc @tjtanaa

LucasWilkinson · 2025-10-20T23:04:10Z

This logic is very confusing now; would be good to get more context here and try to refactor this a bit more aggressively,

cc @wwl2755 @tjtanaa

Seems like the original intention of using upstream FA is: #24347 , i.e. use it for models with head dim that is not supported by vllm-FA but is supported by upstream-FA; anything thats a multiple of 8 but not a multiple of 32

Can we just make all this logic; if on cuda and if head dim is not a multiple of 32 use upstream FA? otherwise just use get_attn_backend (like pre: #24347)?

tjtanaa · 2025-10-21T08:32:05Z

This logic is very confusing now; would be good to get more context here and try to refactor this a bit more aggressively,

cc @wwl2755 @tjtanaa

Seems like the original intention of using upstream FA is: #24347 , i.e. use it for models with head dim that is not supported by vllm-FA but is supported by upstream-FA; anything thats a multiple of 8 but not a multiple of 32

Can we just make all this logic; if on cuda and if head dim is not a multiple of 32 use upstream FA? otherwise just use get_attn_backend (like pre: #24347)?

To add on to @LucasWilkinson 's feedback

The logic that we should update should be

vllm/vllm/platforms/rocm.py

Lines 204 to 211 in c3a2c6a

    
           def get_vit_attn_backend(cls, head_size: int, dtype: torch.dtype) -> "_Backend": 
        
               from vllm.attention.backends.registry import _Backend 
        
               if envs.VLLM_ROCM_USE_AITER and envs.VLLM_ROCM_USE_AITER_MHA and on_gfx9(): 
        
                   return _Backend.ROCM_AITER_FA 
        
               if on_gfx9(): 
        
                   return _Backend.FLASH_ATTN 
        
               return _Backend.TORCH_SDPA

Right now, as long as on AMD instinct, we assume that ck-flash-attention library is installed. If we want to enable torch.sdpa , we can start from modifying this part first. If ck-flash-attention is not installed, or the head_dim is not supported by the specified backend, we fall back to torch.sdpa

Another thing that I notice that VLLM_ATTENTION_BACKEND semantics should be meant for Text Model backbone.

The set of ATTENTION BACKEND supported by ViT are TORCH_SDPA, FLASH_ATTN and ROCM_AITER_FA only.

However, the ATTENTION_BACKEND for LLM Backbone are TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA or ROCM_AITER_UNIFIED_ATTN.

So, I would suggest reserving the VLLM_ATTENTION_BACKEND environment variable for LLM Attention Backend Selection.

Moreover, on MI300 series, flash attention/ aiter flash attention is recommended to be used for ViT as it is the fastest. When torch.sdpa is selection, it is extremely slow as it does for loop to compute the attention output in majority of the vision models.

DarkLight1337 · 2025-10-21T08:37:38Z

Heads up that we have decoupled the two backends in #27061

bradleyhd · 2025-10-21T17:30:18Z

Heads up that we have decoupled the two backends in #27061

@DarkLight1337 thanks. is maybe_get_vit_flash_attn_backend needed now in light of this PR?

DarkLight1337 · 2025-10-22T03:24:08Z

cc @ywang96 @Isotr0py

ywang96 · 2025-10-22T07:53:48Z

@LucasWilkinson @tjtanaa @bradleyhd FYI on parallel to this PR, I've also made #27061 which decouples ViT attn backend from LM attn backend (which should probably be something we should've done from the get go).

bradleyhd · 2025-10-22T18:03:26Z

@tjtanaa curious, is upstream FA in this case expected to be FAv3? (looking at #24347)

bradleyhd · 2025-10-22T21:25:44Z

@ywang96 #27061 only works if we override to ROCM_AITER_FA, because it is exempt from the logic in maybe_get_vit_flash_attn_backend. If we set to TORCH_SDPA, it just gets overwritten with FA because we have a module named flash_attn installed

Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967

bradleyhd · 2025-10-22T22:53:31Z

alright folks, I've updated this to make use of the new --mm_encoder_attn_backend. When supplied, it won't auto-upgrade to FA. We need this asap to unblock as it allows us to specify torch_sdpa usage. We can and should circle back for a more comprehensive refactor here

Summary: Pull Request resolved: vllm-project#27124 In vllm-project#26104, some changes were made in layer.py that resulted in always trying to switch to FA backend for ViT, even when `VLLM_ATTENTION_BACKEND` is set. This broke Meta's internal AMD pipelines as it is not desired nor expected behavior. With this change, the models that were changed in the offending PR can explicitly opt-in to this behavior. Reviewed By: Prowindy Differential Revision: D84946967

ywang96

LGTM - I think we do need to think about how to deal with override in a better way (whether we should honor it truly with the risk of failure or handle the fallback automatically)

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: 0xrushi <[email protected]>

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>

bradleyhd requested review from LucasWilkinson and sighingnow as code owners October 17, 2025 20:54

mergify bot added the qwen Related to Qwen models label Oct 17, 2025

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 17, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

bradleyhd force-pushed the export-D84946967 branch from cf6b734 to 3405c66 Compare October 17, 2025 21:27

zhewenl added rocm Related to AMD ROCm ci/build ci-failure Issue about an unexpected test failure in CI labels Oct 17, 2025

github-project-automation bot added this to CI Failures Oct 17, 2025

LucasWilkinson requested review from DarkLight1337 and Isotr0py October 20, 2025 22:40

bradleyhd force-pushed the export-D84946967 branch from 3405c66 to 2bcf680 Compare October 22, 2025 22:47

bradleyhd force-pushed the export-D84946967 branch from 2bcf680 to a1250f0 Compare October 22, 2025 22:51

bradleyhd changed the title ~~make flash_attn ViT upgrade opt-in~~ honor --mm_encoder_attn_backend when used Oct 22, 2025

bradleyhd force-pushed the export-D84946967 branch from a1250f0 to d58c252 Compare October 22, 2025 23:10

bradleyhd force-pushed the export-D84946967 branch from d58c252 to 41d4cd1 Compare October 22, 2025 23:15

bradleyhd force-pushed the export-D84946967 branch from 41d4cd1 to 9adb7ec Compare October 22, 2025 23:52

bradleyhd force-pushed the export-D84946967 branch from 9adb7ec to c036f06 Compare October 23, 2025 00:00

Merge branch 'main' into export-D84946967

fc3130a

ywang96 approved these changes Oct 23, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 23, 2025

Merge branch 'main' into export-D84946967

467f646

DarkLight1337 changed the title ~~honor --mm_encoder_attn_backend when used~~ [Bugfix] Honor --mm_encoder_attn_backend when used Oct 23, 2025

DarkLight1337 merged commit 570c3e1 into vllm-project:main Oct 23, 2025
56 checks passed

github-project-automation bot moved this to Done in CI Failures Oct 23, 2025

Kay-Tian mentioned this pull request Oct 24, 2025

vLLM PR #27124 变更核心文件提醒 Kay-Tian/vllm#37

Closed

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025

[Bugfix] Honor --mm_encoder_attn_backend when used (vllm-project#27124)

3092658

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>

ywang96 mentioned this pull request Oct 30, 2025

[BugFix][VL] Fix FA selection on Qwen2.5-VL #27790

Merged

This was referenced Oct 30, 2025

[RFC]: Fixing the ViT Backend especially ROCm EmbeddedLLM/vllm#75

Open

[RFC]: Reorganizing ViT Abstraction and Attention Selection Logic #27821

Open

[RFC]: Reorganizing ViT Abstraction and Attention Selection Logic #27822

Closed

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Bugfix] Honor --mm_encoder_attn_backend when used (vllm-project#27124)

cb8a6a1

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix] Honor --mm_encoder_attn_backend when used (vllm-project#27124)

7485d3b

Co-authored-by: Bradley D <[email protected]> Co-authored-by: Roger Wang <[email protected]>

		if try_switch_to_fa and not is_fa_backend(attn_backend):
		attn_backend = _Backend.FLASH_ATTN

Uh oh!

[Bugfix] Honor --mm_encoder_attn_backend when used #27124

[Bugfix] Honor --mm_encoder_attn_backend when used #27124

Uh oh!

Conversation

bradleyhd commented Oct 17, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

bradleyhd commented Oct 17, 2025

Uh oh!

zhewenl commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Oct 20, 2025

Uh oh!

LucasWilkinson commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Oct 21, 2025

Uh oh!

bradleyhd commented Oct 21, 2025

Uh oh!

DarkLight1337 commented Oct 22, 2025

Uh oh!

ywang96 commented Oct 22, 2025

Uh oh!

bradleyhd commented Oct 22, 2025

Uh oh!

bradleyhd commented Oct 22, 2025

Uh oh!

bradleyhd commented Oct 22, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhewenl commented Oct 17, 2025 •

edited

Loading

LucasWilkinson commented Oct 20, 2025 •

edited

Loading

tjtanaa commented Oct 21, 2025 •

edited

Loading