Skip to content

[RFC][UX][torch.compile][CUDAGraph]: Overhaul CompilationConfig and improve CLI -O<n> #20283

@ProExpertProg

Description

@ProExpertProg

tl;dr: Improve the user experience around compilation and cudagraph capture by consolidating/overhauling CompilationConfig and defining more meaningful optimization levels -O0, -O1, -O2, -O3 (and maybe more).

Motivation.

CompilationConfig was born around December 2024 to enable configuring torch.compile-based compilation and piecewise cudagraph capture. Since then, a bunch more flags were added to support new features, all good in isolation but without a cohesive plan. As vLLM aims to provide great performance out-of-the-box, having to manually configure a bunch of flags is bad UX.

CompilationConfig currently serves as both the user-facing and compiler-interfacing compilation configuration mechanism. What I mean by that is that it's used by CLI/Python API users to control compilation, as well as other parts of the codebase (model runner, vllm config, etc.). This has the benefit of good UX for developers to directly control compilation from the CLI and Python, but the downside of this weird second-state where defaults are inspected and adjusted. This was handled very poorly in V1 where a bunch of settings were hardcoded, making them impossible to change from the CLI.

Additionally, compilation levels 0-3 are currently not very intuitive and 1 and 2 are only meant for internal use. Instead, the convenience of -O<n> flag should be used for optimization levels, and compilation levels should be adjusted to reflect actual uses.

Finally, there are concerns around vLLM startup time (#19824) and having different optimization levels -O progressively trade startup cost for performance seems like another improvement to startup UX.

Proposed Change.

I am proposing an overhaul of many CompilationConfig fields. I've put them all into one RFC as some are very related, but they can be done as separate PRs.

Repurpose -O for optimization level

I propose we start with 4 optimization levels, 0 through 3. Exact settings here should be determined later, but they could go something like this:

  1. -O0: No optimization. pretty much equivalent to --enforce_eager: no compilation, no cudagraphs, no other optimization, just starting up immediately
  2. -O1: Quick optimizations. Dynamo+Inductor compilation but no cudagraphs (or maybe lazy cudagraphs: [RFC]: Lazy CUDA Graph capture #20098)
  3. -O2: Full optimizations. -O1 as well as cudagraphs. This would be the default, and is most similar to the current default settings.
  4. -O3: Full (auto)tuning. -O2 as well as max-autotune, compiling for additional static sizes, etc. - any other time-consuming optimizations.

These levels trade startup time cost for performance, with -O0 having the best startup time and -O3 having the best performance. We can decide exact settings for each levels after more in-depth benchmarking as proposed in #19824.

While we should make sure each level is just a combination of fine-grained flags, I also believe we should not commit to not changing what each of the levels do for better flexibility. If users rely on certain features, they can specify them manually. But I know that either way users might come to rely on features being present in each level so that should be considered.

I also propose --enforce-eager is deprecated, becoming equivalent to -O0. We can remove it later or keep it around.

Rename compilation level to mode

Because -O<n> now means optimization level and not compilation level, I propose renaming CompilationLevel to CompilationMode. This is mostly used by developers, specifically by Meta to debug vLLM's torch.compile integration, and the interface should better reflect the use. I propose the following "modes":

  • CompilationMode.NONE (same as current NO_COMPILATION)
  • CompilationMode.STOCK_TORCH_COMPILE (same as current DYNAMO_AS_IS except with Inductor by default). This can be useful to vLLM custom compilation issues from torch.compile. Looking for better name suggestions.
  • CompilationMode.DYNAMO_TRACE_ONCE (same as current DYNAMO_ONCE)
  • CompilationMode.VLLM_COMPILE (same as current PIECEWISE)

Other changes to compilation controls

❌ means removal, ✏ means change, 🌱 means addition

  • use_inductor: this is fully redundant with backend
  • backend: this is currently not respected for compilation mode (level) 3 (PIECEWISE), and use_inductor is used in its place. We can instead just use this field and make mode 3 respect it. There are currently no uses for custom backends inside vLLM custom backend, so we can disallow custom backends (only allow "inductor" and "eager"/"") for mode 3. If a use case is needed in the future, this can be extended. "inductor" becomes the default for this field.
  • 🌱 debug_mode: bool - add additional checks to validate compilation & cudagraphs are running correctly. This could be shape checks for VLLM_COMPILE, cudagraph address checks, and more. Currently cudagraph addresses are checked if VLLM_LOGGING_LEVEL=DEBUG, but I think this would be better done explicitly. Open to name suggestions, and thanks to @zou3519 for the proposal! More details in [RFC][UX]: debug mode for vLLM-compile #20394.
  • use_cudagraph and full_cuda_graph. These are replaced with cudagraph_mode.
  • 🌱 cudagraph_mode: enum of type CUDAGraphMode with options NONE, FULL, PIECEWISE, later adding FULL_AND_PIECEWISE and AUTO. PIECEWISE obviously requires compilation mode VLLM_COMPILE. FULL_AND_PIECEWISE is for attention backends that only support cudagraphs in attention for some requests. AUTO can be used to mean FULL if supported, otherwise FULL_AND_PIECEWISE, otherwise PIECEWISE. This is assuming we want full cudagraphs by default when enabled (not yet confirmed that's the case). [Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer #20059 will add this enum for cudagraph execution, as well as the ability to run cudagraphs (only full) without any compilation. We can simply extend the enum and use it here.
  • custom_ops: default behavior for custom ops currently depends on use_inductor (getting removed) and CompilationLevel (renamed). Instead, this field should be the single source of truth for custom ops and we can set it to "all" or "none" as part of config initialization (allowing user-specified values to override).
  • cudagraph_capture_sizes: these are currently reversed, just for the model runner to unreverse them and then reverse them again. They can just be sorted ascending and model runner can iterate in reverse during capture.

Unchanged fields:

For visibility, below is the list of other fields this RFC does not seek to address. Please let me know if you think any of these fields should be adjusted as part of this RFC:

  • debug_dump_path: str = ""
  • cache_dir: str = ""
  • splitting_ops: list[str] = []
  • compile_sizes: Optional[list[Union[int, str]]] = None
  • inductor_compile_config: dict = {}
  • inductor_passes: dict[str, str] = {}
  • cudagraph_num_of_warmups: int = 0
  • cudagraph_capture_sizes: Optional[list[int]] = None
  • cudagraph_copy_inputs: bool = False
  • pass_config: PassConfig = PassConfig()
  • all fields that are excluded from __init__

Enabling logic

There are a lot off fields whose defaults depend on the values of other fields or the platform. Those fields should be uninitialized/None by default so that we can distinguish between it set explicitly from the CLI/Python and the default value. For example, splitting_ops is an empty list by default but in V1 piecewise compilation, it's set to attention ops, and it's set to empty otherwise. After #20059, splitting ops will not be required if full cudagraphs is enabled so the user must be able to overwrite it.

This logic is currently scattered around config.py and some other places; we should make sure it's consolidated inside a single function, likely VllmConfig.__post_init__.

Sunsetting period

I believe that these are not user-facing enough to warrant standard deprecation procedures. Instead, I propose we perform the changes (including the swap from CompilationLevel to OptimizationLevel) in a single release. I believe that would be less painful than trying to support both at the same time. We would add explicit error messages about removal of level etc. instead of JSON parsing errors. I know that is a bold stance so please give feedback on it in the comments.

Alternatively, we could deprecate level (and map it to mode) and create optimization_level and mode fields, and remove level in later releases. As a middle ground,

Out of scope for this RFC

  • Moving cudagraph capture config out of CompilationConfig
  • Configuration oracle to replace current platform-dependent configuration

Feedback Period.

10 days, 6/30-7/9. I want to try to address this before my summer vacation 7/18-8/8

CC List.

@youkaichao @simon-mo @mgoin @robertgshaw2-redhat @zou3519 @WoosukKwon

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Type

No type

Projects

Status

Ready

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions