-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Misc] Log the reason for falling back to FlexAttention #20699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @DarkLight1337, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly improves the diagnostic capabilities of the V1 attention backend selection process. My changes ensure that when the system falls back to FlexAttention, the reason for this decision (e.g., unsupported head size or data type, or import issues) is clearly logged. This provides better transparency and helps users understand and debug performance characteristics related to attention backend choices.
Highlights
- Enhanced Attention Backend Selection and Logging: I've refactored the attention backend selection logic to provide more granular information on why a specific backend (like FlashAttention or FlashInfer) might not be used, leading to a fallback to FlexAttention. This includes checking for both head size and data type compatibility.
- Consolidated Backend Support Checks: The
supports_head_sizefunction has been renamed tois_attn_backend_supportedand now returns a detailed_IsSupportedobject, indicating whether the backend can be imported, and if it supports the given head size and data type. This consolidates the logic for determining backend viability. - Explicit Data Type Support for V1 Backends: All V1 attention backends (FlashAttention, FlashInfer, FlexAttention, etc.) now explicitly declare their supported data types via a new
get_supported_dtypesclass method. This allows the selection logic to make informed decisions based ondtype. - Fixed FlashInfer Import Warning Regression: I've addressed a regression introduced in a previous PR (#20467) that inadvertently disabled the warning message when FlashInfer failed to import for SM 10.0 (Blackwell) devices. The warning is now correctly re-enabled.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the logging issue for FlashInfer fallback on SM 10.0 devices and improves the attention backend selection logic. The refactoring of supports_head_size to is_attn_backend_supported and the addition of get_supported_dtypes make the code more robust and easier to understand. The new logging for FlexAttention fallback reasons is a great enhancement for users.
| logger.info_once("Using Flash Attention backend on V1 engine.") | ||
| return FLASH_ATTN_V1 | ||
| if cls.has_device_capability(80): | ||
| if is_default_backend_supported := is_attn_backend_supported( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable is_default_backend_supported is reused for both FlashInfer (line 261) and FlashAttention (line 276) checks. This reassignment can reduce clarity regarding which backend's support is being evaluated. Consider using a more specific variable name, such as is_flash_attn_supported, for the FlashAttention check to improve readability and explicitly indicate the context of the support check.
Signed-off-by: DarkLight1337 <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: DarkLight1337 <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: x22x22 <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…#20699) Signed-off-by: DarkLight1337 <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
#20467 accidentally disabled the warning message for failing to import FlashInfer for SM 10.0 devices. This PR fixes the issue and also consolidates the logic for falling back to FlexAttention based on head_size and dtype.
Notable changes:
get_supported_dtypesto V1 attention backends.supports_head_sizeto a more generalis_attn_backend_supportedTest Plan
Test Result
(Optional) Documentation Update