-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Closed as not planned
Labels
Description
Anything you want to discuss about vllm.
vllm/vllm/attention/backends/flash_attn.py
Line 282 in 99caa49
| if prefill_meta := attn_metadata.prefill_metadata: |
I noticed that in flash-attn backends. forward_prefix and forward_decode seem to be executed serially. Does forward_decode wait for forward_prefix to finish before running? Can this take advantage of the performance provided by chunked-prefill? I mean the tokens of prefill are in the same batch as the tokens of decode.
if prefill_meta := attn_metadata.prefill_metadata:
output[:num_prefill_tokens] = PagedAttention.forward_prefix(...)
if decode_meta := attn_metadata.decode_metadata:
output[num_prefill_tokens:] = PagedAttention.forward_decode(...)