-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Description
tl;dr: Improve the user experience around compilation and cudagraph capture by consolidating/overhauling CompilationConfig and defining more meaningful optimization levels -O0, -O1, -O2, -O3 (and maybe more).
Motivation.
CompilationConfig was born around December 2024 to enable configuring torch.compile-based compilation and piecewise cudagraph capture. Since then, a bunch more flags were added to support new features, all good in isolation but without a cohesive plan. As vLLM aims to provide great performance out-of-the-box, having to manually configure a bunch of flags is bad UX.
CompilationConfig currently serves as both the user-facing and compiler-interfacing compilation configuration mechanism. What I mean by that is that it's used by CLI/Python API users to control compilation, as well as other parts of the codebase (model runner, vllm config, etc.). This has the benefit of good UX for developers to directly control compilation from the CLI and Python, but the downside of this weird second-state where defaults are inspected and adjusted. This was handled very poorly in V1 where a bunch of settings were hardcoded, making them impossible to change from the CLI.
Additionally, compilation levels 0-3 are currently not very intuitive and 1 and 2 are only meant for internal use. Instead, the convenience of -O<n> flag should be used for optimization levels, and compilation levels should be adjusted to reflect actual uses.
Finally, there are concerns around vLLM startup time (#19824) and having different optimization levels -O progressively trade startup cost for performance seems like another improvement to startup UX.
Proposed Change.
I am proposing an overhaul of many CompilationConfig fields. I've put them all into one RFC as some are very related, but they can be done as separate PRs.
Repurpose -O for optimization level
I propose we start with 4 optimization levels, 0 through 3. Exact settings here should be determined later, but they could go something like this:
-O0: No optimization. pretty much equivalent to--enforce_eager: no compilation, no cudagraphs, no other optimization, just starting up immediately-O1: Quick optimizations. Dynamo+Inductor compilation but no cudagraphs (or maybe lazy cudagraphs: [RFC]: Lazy CUDA Graph capture #20098)-O2: Full optimizations.-O1as well as cudagraphs. This would be the default, and is most similar to the current default settings.-O3: Full (auto)tuning.-O2as well asmax-autotune, compiling for additional static sizes, etc. - any other time-consuming optimizations.
These levels trade startup time cost for performance, with -O0 having the best startup time and -O3 having the best performance. We can decide exact settings for each levels after more in-depth benchmarking as proposed in #19824.
While we should make sure each level is just a combination of fine-grained flags, I also believe we should not commit to not changing what each of the levels do for better flexibility. If users rely on certain features, they can specify them manually. But I know that either way users might come to rely on features being present in each level so that should be considered.
I also propose --enforce-eager is deprecated, becoming equivalent to -O0. We can remove it later or keep it around.
Rename compilation level to mode
Because -O<n> now means optimization level and not compilation level, I propose renaming CompilationLevel to CompilationMode. This is mostly used by developers, specifically by Meta to debug vLLM's torch.compile integration, and the interface should better reflect the use. I propose the following "modes":
CompilationMode.NONE(same as currentNO_COMPILATION)CompilationMode.STOCK_TORCH_COMPILE(same as currentDYNAMO_AS_ISexcept with Inductor by default). This can be useful to vLLM custom compilation issues fromtorch.compile. Looking for better name suggestions.CompilationMode.DYNAMO_TRACE_ONCE(same as currentDYNAMO_ONCE)CompilationMode.VLLM_COMPILE(same as currentPIECEWISE)
Other changes to compilation controls
❌ means removal, ✏ means change, 🌱 means addition
- ❌
use_inductor: this is fully redundant withbackend - ✏
backend: this is currently not respected for compilation mode (level) 3 (PIECEWISE), and use_inductor is used in its place. We can instead just use this field and make mode 3 respect it. There are currently no uses for custom backends inside vLLM custom backend, so we can disallow custom backends (only allow"inductor"and"eager"/"") for mode 3. If a use case is needed in the future, this can be extended."inductor"becomes the default for this field. - 🌱
debug_mode: bool- add additional checks to validate compilation & cudagraphs are running correctly. This could be shape checks for VLLM_COMPILE, cudagraph address checks, and more. Currently cudagraph addresses are checked if VLLM_LOGGING_LEVEL=DEBUG, but I think this would be better done explicitly. Open to name suggestions, and thanks to @zou3519 for the proposal! More details in [RFC][UX]: debug mode for vLLM-compile #20394. - ❌
use_cudagraphandfull_cuda_graph. These are replaced withcudagraph_mode. - 🌱
cudagraph_mode: enum of typeCUDAGraphModewith optionsNONE,FULL,PIECEWISE, later addingFULL_AND_PIECEWISEandAUTO.PIECEWISEobviously requires compilation modeVLLM_COMPILE.FULL_AND_PIECEWISEis for attention backends that only support cudagraphs in attention for some requests.AUTOcan be used to meanFULLif supported, otherwiseFULL_AND_PIECEWISE, otherwisePIECEWISE. This is assuming we want full cudagraphs by default when enabled (not yet confirmed that's the case). [Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059 will add this enum for cudagraph execution, as well as the ability to run cudagraphs (only full) without any compilation. We can simply extend the enum and use it here. - ✏
custom_ops: default behavior for custom ops currently depends onuse_inductor(getting removed) andCompilationLevel(renamed). Instead, this field should be the single source of truth for custom ops and we can set it to"all"or"none"as part of config initialization (allowing user-specified values to override). - ✏
cudagraph_capture_sizes: these are currently reversed, just for the model runner to unreverse them and then reverse them again. They can just be sorted ascending and model runner can iterate in reverse during capture.
Unchanged fields:
For visibility, below is the list of other fields this RFC does not seek to address. Please let me know if you think any of these fields should be adjusted as part of this RFC:
debug_dump_path: str = ""cache_dir: str = ""splitting_ops: list[str] = []compile_sizes: Optional[list[Union[int, str]]] = Noneinductor_compile_config: dict = {}inductor_passes: dict[str, str] = {}cudagraph_num_of_warmups: int = 0cudagraph_capture_sizes: Optional[list[int]] = Nonecudagraph_copy_inputs: bool = Falsepass_config: PassConfig = PassConfig()- all fields that are excluded from
__init__
Enabling logic
There are a lot off fields whose defaults depend on the values of other fields or the platform. Those fields should be uninitialized/None by default so that we can distinguish between it set explicitly from the CLI/Python and the default value. For example, splitting_ops is an empty list by default but in V1 piecewise compilation, it's set to attention ops, and it's set to empty otherwise. After #20059, splitting ops will not be required if full cudagraphs is enabled so the user must be able to overwrite it.
This logic is currently scattered around config.py and some other places; we should make sure it's consolidated inside a single function, likely VllmConfig.__post_init__.
Sunsetting period
I believe that these are not user-facing enough to warrant standard deprecation procedures. Instead, I propose we perform the changes (including the swap from CompilationLevel to OptimizationLevel) in a single release. I believe that would be less painful than trying to support both at the same time. We would add explicit error messages about removal of level etc. instead of JSON parsing errors. I know that is a bold stance so please give feedback on it in the comments.
Alternatively, we could deprecate level (and map it to mode) and create optimization_level and mode fields, and remove level in later releases. As a middle ground,
Out of scope for this RFC
- Moving cudagraph capture config out of
CompilationConfig - Configuration oracle to replace current platform-dependent configuration
Feedback Period.
10 days, 6/30-7/9. I want to try to address this before my summer vacation 7/18-8/8
CC List.
@youkaichao @simon-mo @mgoin @robertgshaw2-redhat @zou3519 @WoosukKwon
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status