-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[torch.compile] Use custom ops when use_inductor=False #19618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @WoosukKwon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a configuration issue where vLLM was prevented from using its custom CUDA kernels when torch.compile
was used with use_inductor=False
. The change modifies the configuration logic to correctly enable custom CUDA kernels in this scenario, ensuring that the intended performance optimizations are applied based on the compilation settings.
Highlights
- Configuration Fix: Corrected the logic in
vllm/config.py
to ensure that custom CUDA kernels are used for registered custom operations whenuse_inductor
is set toFalse
. Previously,custom_ops
was incorrectly set to["none"]
even when Inductor was disabled, forcing the use of Inductor-generated kernels. - Conditional Logic Update: The code now conditionally sets
compilation_config.custom_ops = ["none"]
only whencustom_ops
is empty anduse_inductor
isTrue
. This aligns the configuration with the intended behavior: letting Inductor handle custom ops when it's enabled and no specific custom CUDA ops are requested. - Improved Comments: Added comments to clarify the expected behavior of custom operations depending on the
use_inductor
andcustom_ops
configuration settings.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the logic for handling custom_ops
in the VllmConfig
when torch.compile
is used with VLLM_USE_V1
. The main change ensures that if use_inductor
is False
and custom_ops
is initially empty, custom_ops
will remain empty. This allows vLLM to use its custom CUDA kernels for registered custom operations by default in this scenario, which is the intended behavior described in the PR.
Previously, if custom_ops
was empty, it was unconditionally set to ["none"]
within the V1-specific configuration block. The new logic makes this assignment conditional on use_inductor
being True
. If use_inductor
is True
and custom_ops
is empty, it's set to ["none"]
to allow Inductor to generate Triton kernels for these operations.
The accompanying comments clearly explain the behavior when use_inductor=True
.
The code change is well-targeted and appears to correctly implement the desired logic. No issues of medium or higher severity were identified.
# The behavior of custom ops with inductor depends on the config: | ||
# - If use_inductor=True and custom_ops is empty: | ||
# Inductor generates Triton kernels for all registered custom ops | ||
# (default behavior) | ||
# - If use_inductor=True and custom_ops is non-empty: | ||
# Custom CUDA kernels are used for specified ops while inductor | ||
# generates Triton kernels for remaining ops, including misc torch | ||
# ops in the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @BoyuanFeng this might change how we do benchmarking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past I've recommended that people use use_inductor=False
with custom_ops="all"
, but I think having this happen automatically is reasonable.
Has anyone confirmed this actually works? If I understand correctly, custom ops are automatically disabled when level>=3 (piecewise). So |
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
When
use_inductor=False
, vLLM should use the custom CUDA kernels for the registered custom ops. Currently, we always setcustom_ops = ["none"]
even whenuse_inductor=False
.Test Plan
Test Result
(Optional) Documentation Update