[Attention] Flash Attention 3 - fp8 #14570

mickaelseznec · 2025-03-10T15:40:03Z

This PR add support for FP8 KV cache with FlashAttention3 (related PR in flash-attn here) cc @LucasWilkinson Please do not merge this PR as long as it's not referencing vllm-project/flash-attention yet.

FlashAttention (contrary to FlashInfer) does attention with all Q, K and V in FP8.
The performance is usually better than FlashInfer FP8 KV and FlashAttention 3 with bf16.

I added support for v0 and v1 + some unit testing.

Note that I've added a trick for checkpoints not providing q_scale and reuse the k_scale (with is something TRTLLM does fwiw).

Also: I added a small QoS improvement when debugging v1: workers send back their traceback when they raise an exception.

github-actions · 2025-03-10T15:40:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Mickael Seznec <[email protected]>

mickaelseznec · 2025-03-11T09:30:52Z

CI failing because vllm/tests/entrypoints/openai/test_accuracy.py from here doesn't exist.

@robertgshaw2-redhat any idea how should I fix? Just rename in run-tpu-test.sh? (@NickLucche you moved the file)

NickLucche · 2025-03-11T11:43:04Z

This is a known issue, PR addressing it here #13898. It won't block your PR.

NickLucche · 2025-03-11T11:45:19Z

I see there's some other problem with building the image, but likely CI just needs another spin

LucasWilkinson · 2025-03-11T18:05:41Z

@mickaelseznec apologies for the delay, vllm-project/flash-attention#50 (review) has been merged, you can now point to vllm_flash_attn

We will need to populate the sccache on the server to get it through the CI, I can help with this once the tag is updated 👍

LucasWilkinson

Thanks for the contribution! Looks clean 😄, ill approve once we can get it updated to use vllm_flash_attn, added a couple comments

LucasWilkinson · 2025-03-11T18:10:34Z

tests/kernels/test_flash_attn.py

+
+        q_descale = q_scale.expand((num_seqs, num_kv_heads))
+        k_descale = k_scale.expand((num_seqs, num_kv_heads))
+        v_descale = v_scale.expand((num_seqs, num_kv_heads))


nit: could we maybe test per-head scales here too?, i.e. also test with non-zero strides

I can add tests here, but these type of scaling isn't supported by vLLM for the moment. I believe that whenever we add support for it, we can add tests as well.
Besides, there's already a combination of 9k tests in here, I don't want to make the duration explode if it's not 100% needed :D

LucasWilkinson · 2025-03-11T18:16:47Z

vllm/platforms/cuda.py

                "Cannot use FlashAttention-2 backend for dtype other than "
                "torch.float16 or torch.bfloat16.")
            target_backend = _Backend.XFORMERS
-        elif kv_cache_dtype is not None and \


we should keep this check but restrict it to FA2, i.e. check get_flash_attn_version() != 2 (get_flash_attn_version() is in vllm/attention/backends/utils.py)

Agree that this might be improved, but I can't directly import get_flash_attn_version because of circular dependency.

Do you prefer if I move that function in another file? vllm/attention/backends/versions.py for example?

Do you prefer if I move that function in another file? vllm/attention/backends/versions.py for example?

sure, maybe move it to:

vllm/attention/utils/fa_support.py

for now, since there is is_flash_attn_mla_supported() function that may come in #14258 so this could be a spot for both of those

I had to move it to vllm/fa_utils.py because of how vllm/attention/__init__.py imports a bunch of stuff for convenience.

Signed-off-by: Mickael Seznec <[email protected]>

this is needed to avoid circular dependencies now that we want to get the flash_attn_version directly in platforms/cuda.py to check if fp8 flash_attn is actually available Signed-off-by: Mickael Seznec <[email protected]>

mergify · 2025-03-14T10:59:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mickaelseznec.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Mickael Seznec <[email protected]>

LucasWilkinson · 2025-03-15T22:35:26Z

apologies for the delay, the CI should be working now. There appears to be failing kernel tests

increase flash_attention unit test tolerance Signed-off-by: Mickael Seznec <[email protected]>

JaheimLee · 2025-03-20T08:38:26Z

Hi, it seems there are no new nightly wheels since this PR. Is there anything wrong? @LucasWilkinson

Signed-off-by: Mickael Seznec <[email protected]>

Signed-off-by: Mickael Seznec <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Mickael Seznec <[email protected]>

Signed-off-by: Mickael Seznec <[email protected]> Signed-off-by: Mu Huai <[email protected]>

mickaelseznec requested review from WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, youkaichao, ywang96 and zhuohan123 as code owners March 10, 2025 15:40

mergify bot added ci/build v1 labels Mar 10, 2025

mickaelseznec added 4 commits March 10, 2025 15:41

[Feature] Add support for Flash-Attn FP8

f9e953f

Signed-off-by: Mickael Seznec <[email protected]>

[Feature] executors send back traceback

e684323

Signed-off-by: Mickael Seznec <[email protected]>

[FIXME] temp. use up-to-date flashattn

4294035

Signed-off-by: Mickael Seznec <[email protected]>

feat: add unit test for FA3 FP8

2b985ed

Signed-off-by: Mickael Seznec <[email protected]>

mickaelseznec force-pushed the mseznec/flash-attention-fp8 branch from bc909f9 to 2b985ed Compare March 10, 2025 15:41

fix: ruff

263bd81

Signed-off-by: Mickael Seznec <[email protected]>

robertgshaw2-redhat requested a review from LucasWilkinson March 10, 2025 17:03

robertgshaw2-redhat assigned LucasWilkinson Mar 10, 2025

LucasWilkinson reviewed Mar 11, 2025

View reviewed changes

mickaelseznec added 3 commits March 12, 2025 19:28

feat: update flash-attention reference

c97a2aa

Signed-off-by: Mickael Seznec <[email protected]>

Merge branch 'main' into mseznec/flash-attention-fp8

5449802

refactor: get_flash_attn in dedicated file

609423b

this is needed to avoid circular dependencies now that we want to get the flash_attn_version directly in platforms/cuda.py to check if fp8 flash_attn is actually available Signed-off-by: Mickael Seznec <[email protected]>

mergify bot added the needs-rebase label Mar 14, 2025

Merge branch 'main' into mseznec/flash-attention-fp8

4bf0d0b

mergify bot removed the needs-rebase label Mar 14, 2025

refactor: get_flash_attn_version

91baf1a

Signed-off-by: Mickael Seznec <[email protected]>

fix: tests

5678b6d

increase flash_attention unit test tolerance Signed-off-by: Mickael Seznec <[email protected]>

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2025

LucasWilkinson changed the title ~~Mseznec/flash attention fp8~~ [Attention] Flash Attention 3 - fp8 Mar 19, 2025

LucasWilkinson enabled auto-merge (squash) March 19, 2025 20:56

LucasWilkinson approved these changes Mar 20, 2025

View reviewed changes

LucasWilkinson merged commit a597a57 into vllm-project:main Mar 20, 2025
48 checks passed

LucasWilkinson mentioned this pull request Mar 20, 2025

[V1] Enable V1 Fp8 cache for FA3 in the oracle #15191

Merged

pooyadavoodi mentioned this pull request Mar 20, 2025

[Bug]: Can't run vllm model because of the FlashAttention. #15238

Closed

1 task

lhtin mentioned this pull request Mar 21, 2025

[Bug]: V1 with MLA enable throw error cannot import name 'get_flash_attn_version' from 'vllm.attention.backends.utils' #15265

Closed

1 task

huydhn mentioned this pull request Mar 25, 2025

[Installation]: Fail to build vLLM from source on CUDA 12.6 #15435

Closed

1 task

erictang000 pushed a commit to erictang000/vllm that referenced this pull request Mar 25, 2025

[Attention] Flash Attention 3 - fp8 (vllm-project#14570)

bd94952

Signed-off-by: Mickael Seznec <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Attention] Flash Attention 3 - fp8 (vllm-project#14570)

764aa70

Signed-off-by: Mickael Seznec <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Attention] Flash Attention 3 - fp8 (vllm-project#14570)

dc96805

Signed-off-by: Mickael Seznec <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Attention] Flash Attention 3 - fp8 (vllm-project#14570)

4a5bc26

Signed-off-by: Mickael Seznec <[email protected]> Signed-off-by: Mu Huai <[email protected]>

heheda12345 mentioned this pull request Jun 6, 2025

[V1][TPU] Fix TPU kv sharing tests #19155

Closed

yaochengji mentioned this pull request Jun 6, 2025

[TPU] fix kv cache dtype in model runner #19244

Merged

osma mentioned this pull request Jul 28, 2025

[Bug]: Florence2 example fails with UnboundLocalError: cannot access local variable 'key_cache' #21749

Open

1 task

Uh oh!

[Attention] Flash Attention 3 - fp8 #14570

[Attention] Flash Attention 3 - fp8 #14570

Uh oh!

Conversation

mickaelseznec commented Mar 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2025

Uh oh!

mickaelseznec commented Mar 11, 2025

Uh oh!

NickLucche commented Mar 11, 2025

Uh oh!

NickLucche commented Mar 11, 2025

Uh oh!

LucasWilkinson commented Mar 11, 2025

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

mickaelseznec Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

mickaelseznec Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

mickaelseznec Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 14, 2025

Uh oh!

LucasWilkinson commented Mar 15, 2025

Uh oh!

Uh oh!

JaheimLee commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mickaelseznec commented Mar 10, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading

JaheimLee commented Mar 20, 2025 •

edited

Loading