[MM Encoder] ViT attention performance and consolidation

### 🚀 The feature, motivation and pitch

Today many vision transformers on vLLM leverage standard `F.scaled_dot_product_attention` to compute attention scores.

While there has been some effort in [vision.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/vision.py) to help developers easily choose which backend to use, it would be great if vLLM can consolidate non-mask MHA implementations with different backends without caching so that developers can easily plug them in.

We should also investigate integrating FA3 for a few vision models we have and make sure there's no accuracy  regression.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM Encoder] ViT attention performance and consolidation #23880

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[MM Encoder] ViT attention performance and consolidation #23880

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions