-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Attention] Refactor FA block_size limitations to hybrid models only
#29084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ) | ||
|
|
||
| @classmethod | ||
| @override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-commit rightfully complaining as this is not an ovveride
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors how supported block sizes for attention backends are determined, changing from a class variable to a static method. This allows for dynamic block size support, which is used here to apply specific block_size limitations only to hybrid models using the FlashAttention backend. The changes are consistent across all affected backend implementations and tests.
I've found one critical issue in vllm/v1/attention/backends/flashinfer.py where a method signature mismatch with its base class will cause a TypeError at runtime. Please see the detailed comment.
|
This pull request has merge conflicts that must be resolved before it can be |
35a7704 to
351e050
Compare
tdoublep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
MatthewBonanni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have no objections to making this a static method.
|
Addressed your suggestions, thanks for the quick review @tdoublep @MatthewBonanni |
tdoublep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
14d82fc to
1d3a4b3
Compare
This PR limits the blocks-size changes introduced in #27753 to hybrid-models only.
As non hybrid-models are not affected by the issues reported in the PR above, not allowing a supported physical block_size can hinder performance of kv cache transfers.
Eg on the likes of PD disaggregation with NIXL, we're not limited by bw so reducing the size of blocks reduces optimal saturation of medium.
To do so I have soft-reverted some of the changes introduced in the #24794 refactor, regarding
supported_kernel_block_sizebeing a class attribute rather than a staticmethod. Happy to discuss other options, but I found I couldn't have a cleaner solution with classvars.cc @tdoublep for hybrid models
cc @MatthewBonanni for attn metadata refactor
Test
Reproducing same setup as for hybrid models.