Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Nov 20, 2025

This PR limits the blocks-size changes introduced in #27753 to hybrid-models only.
As non hybrid-models are not affected by the issues reported in the PR above, not allowing a supported physical block_size can hinder performance of kv cache transfers.

Eg on the likes of PD disaggregation with NIXL, we're not limited by bw so reducing the size of blocks reduces optimal saturation of medium.

To do so I have soft-reverted some of the changes introduced in the #24794 refactor, regarding supported_kernel_block_size being a class attribute rather than a staticmethod. Happy to discuss other options, but I found I couldn't have a cleaner solution with classvars.

cc @tdoublep for hybrid models
cc @MatthewBonanni for attn metadata refactor

Test

Reproducing same setup as for hybrid models.

# after PR (same as without) FA backend
Prompt: 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nConsider the paths of length $16$ that follow the lines from the lower left corner to the upper right corner on an $8\\times 8$ grid. Find the number of such paths that change direction exactly four times, as in the examples shown below.\n\nRemember to put your answer on its own line after "Answer:".'
Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n\nExample paths:\n- Path 1: Right, Right, Down,'
Generated token IDs: [1032, 1267, 20396, 22344, 1877, 1045, 17669, 1032, 1049, 1058, 21285, 1044, 21285, 1044, 16999, 1044]
Generated text: '</think>\nThe number of paths of length 16 from the lower left'
Generated token IDs: [1885, 74045, 1561, 1784, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: '</think>\nThe problem involves finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 19263, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' output: To solve the problem of finding the number of paths of length 1'
Generated token IDs: [4848, 1058, 3870, 15047, 1278, 4127, 1307, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049]
Generated text: ' \n</think>\nThe problem requires finding the number of paths of length '
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032]
Generated text: '</think>\nThe problem requires finding the number of paths of length 16'
Generated token IDs: [1885, 74045, 1561, 1784, 4127, 10867, 13170, 1278, 2782, 1307, 22344, 1307, 5592, 1032, 1049, 1054]
Generated text: ' \n</think>\nThe paths of length 16 from the lower left'
Generated token IDs: [1032, 1010, 1885, 74045, 1561, 1784, 22344, 1307, 5592, 1032, 1049, 1054, 1562, 1278, 4953, 3979]
Generated text: " \n\n<think>\nOkay, let's see. I need to find the"
Generated token IDs: [1032, 1267, 49250, 2077, 1561, 44053, 1044, 2878, 1681, 3219, 1046, 1362, 2534, 1317, 3081, 1278]
Generated text: '<think>\nOkay, so I need to find the number of paths on an'
Generated token IDs: [1060, 74045, 1561, 44053, 1044, 1878, 1362, 2534, 1317, 3081, 1278, 2782, 1307, 22344, 1408, 1420]

)

@classmethod
@override
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit rightfully complaining as this is not an ovveride

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors how supported block sizes for attention backends are determined, changing from a class variable to a static method. This allows for dynamic block size support, which is used here to apply specific block_size limitations only to hybrid models using the FlashAttention backend. The changes are consistent across all affected backend implementations and tests.

I've found one critical issue in vllm/v1/attention/backends/flashinfer.py where a method signature mismatch with its base class will cause a TypeError at runtime. Please see the detailed comment.

@mergify
Copy link

mergify bot commented Nov 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Copy link
Contributor

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have no objections to making this a static method.

@NickLucche
Copy link
Collaborator Author

Addressed your suggestions, thanks for the quick review @tdoublep @MatthewBonanni

@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 20, 2025
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@vllm-bot vllm-bot merged commit 066209a into vllm-project:main Nov 22, 2025
49 of 52 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants