[RFC][UX][torch.compile][CUDAGraph]: Overhaul `CompilationConfig` and improve CLI `-O<n>`


**tl;dr**: Improve the user experience around compilation and cudagraph capture by consolidating/overhauling `CompilationConfig` and defining more meaningful optimization levels `-O0`, `-O1`, `-O2`, `-O3` (and maybe more).

## Motivation.

`CompilationConfig` was born around December 2024 to enable configuring `torch.compile`-based compilation and piecewise cudagraph capture. Since then, a bunch more flags were added to support new features, all good in isolation but without a cohesive plan. As vLLM aims to provide great performance out-of-the-box, having to manually configure a bunch of flags is bad UX.

`CompilationConfig` currently serves as both the user-facing and compiler-interfacing compilation configuration mechanism. What I mean by that is that it's used by CLI/Python API users to control compilation, as well as other parts of the codebase (model runner, vllm config, etc.). This has the benefit of good UX for developers to directly control compilation from the CLI and Python, but the downside of this weird second-state where defaults are inspected and adjusted. This was handled very poorly in V1 where a bunch of settings were hardcoded, making them impossible to change from the CLI.

Additionally, compilation levels 0-3 are currently not very intuitive and 1 and 2 are only meant for internal use. Instead, the convenience of `-O<n>` flag should be used for optimization levels, and compilation levels should be adjusted to reflect actual uses.

Finally, there are concerns around vLLM startup time (#19824) and having different optimization levels `-O` progressively trade startup cost for performance seems like another improvement to startup UX.

## Proposed Change.

I am proposing an overhaul of many `CompilationConfig` fields. I've put them all into one RFC as some are very related, but they can be done as separate PRs.

### Repurpose `-O` for optimization level
I propose we start with 4 optimization levels, 0 through 3. Exact settings here should be determined later, but they could go something like this:
1. `-O0`: No optimization. pretty much equivalent to `--enforce_eager`: no compilation, no cudagraphs, no other optimization, just starting up immediately
2. `-O1`: Quick optimizations. Dynamo+Inductor compilation but no cudagraphs (or maybe lazy cudagraphs: #20098)
3. `-O2`: Full optimizations. `-O1` as well as cudagraphs. This would be the default, and is most similar to the current default settings.
4. `-O3`: Full (auto)tuning. `-O2` as well as `max-autotune`, compiling for additional static sizes, etc. - any other time-consuming optimizations.

These levels trade startup time cost for performance, with `-O0` having the best startup time and `-O3` having the best performance. We can decide exact settings for each levels after more in-depth benchmarking as proposed in #19824.

While we should make sure each level is just a combination of fine-grained flags, I also believe we should not commit to not changing what each of the levels do for better flexibility. If users rely on certain features, they can specify them manually. But I know that either way users might come to rely on features being present in each level so that should be considered.

I also propose `--enforce-eager` is deprecated, becoming equivalent to `-O0`. We can remove it later or keep it around.

### Rename compilation level to mode

Because `-O<n>` now means optimization level and not compilation level, I propose renaming `CompilationLevel` to `CompilationMode`. This is mostly used by developers, specifically by Meta to debug vLLM's `torch.compile` integration, and the interface should better reflect the use. I propose the following "modes":
- `CompilationMode.NONE` (same as current `NO_COMPILATION`)
- `CompilationMode.STOCK_TORCH_COMPILE` (same as current `DYNAMO_AS_IS` except with Inductor by default). This can be useful to vLLM custom compilation issues from `torch.compile`. Looking for better name suggestions.
- `CompilationMode.DYNAMO_TRACE_ONCE` (same as current `DYNAMO_ONCE`)
- `CompilationMode.VLLM_COMPILE` (same as current `PIECEWISE`)

### Other changes to compilation controls
❌ means removal, ✏ means change, 🌱 means addition

- ❌ `use_inductor`: this is fully redundant with `backend`
- ✏ `backend`: this is currently not respected for compilation mode (level) 3 (PIECEWISE), and use_inductor is used in its place. We can instead just use this field and make mode 3 respect it. There are currently no uses for custom backends inside vLLM custom backend, so we can disallow custom backends (only allow `"inductor"` and `"eager"`/`""`) for mode 3. If a use case is needed in the future, this can be extended. `"inductor"` becomes the default for this field.
- 🌱 `debug_mode: bool` - add additional checks to validate compilation & cudagraphs are running correctly. This could be shape checks for VLLM_COMPILE, cudagraph address checks, and more. Currently cudagraph addresses are checked if VLLM_LOGGING_LEVEL=DEBUG, but I think this would be better done explicitly. Open to name suggestions, and thanks to @zou3519 for the proposal! More details in #20394.
- ❌ `use_cudagraph` and `full_cuda_graph`. These are replaced with `cudagraph_mode`.
- 🌱 `cudagraph_mode`: enum of type `CUDAGraphMode` with options `NONE`, `FULL`, `PIECEWISE`, later adding `FULL_AND_PIECEWISE` and `AUTO`. `PIECEWISE` obviously requires compilation mode `VLLM_COMPILE`. `FULL_AND_PIECEWISE` is for attention backends that only support cudagraphs in attention for some requests. `AUTO` can be used to mean `FULL` if supported, otherwise `FULL_AND_PIECEWISE`, otherwise `PIECEWISE`. This is assuming we want full cudagraphs by default when enabled (not yet confirmed that's the case). #20059 will add this enum for cudagraph execution, as well as the ability to run cudagraphs (only full) without any compilation. We can simply extend the enum and use it here.
- ✏ `custom_ops`: default behavior for custom ops currently depends on `use_inductor` (getting removed) and `CompilationLevel` (renamed). Instead, this field should be the single source of truth for custom ops and we can set it to `"all"` or `"none"` as part of config initialization (allowing user-specified values to override).
- ✏ `cudagraph_capture_sizes`: these are currently reversed, just for the model runner to unreverse them and then reverse them again. They can just be sorted ascending and model runner can iterate in reverse during capture.

### Unchanged fields:
For visibility, below is the list of other fields this RFC does not seek to address. Please let me know if you think any of these fields should be adjusted as part of this RFC:
- `debug_dump_path: str = ""`
- `cache_dir: str = ""`
- `splitting_ops: list[str] = []`
- `compile_sizes: Optional[list[Union[int, str]]] = None`
- `inductor_compile_config: dict = {}`
- `inductor_passes: dict[str, str] = {}`
- `cudagraph_num_of_warmups: int = 0`
- `cudagraph_capture_sizes: Optional[list[int]] = None`
- `cudagraph_copy_inputs: bool = False`
- `pass_config: PassConfig = PassConfig()`
- all fields that are excluded from `__init__`

### Enabling logic
There are a lot off fields whose defaults depend on the values of other fields or the platform. Those fields should be uninitialized/`None` by default so that we can distinguish between it set explicitly from the CLI/Python and the default value. For example, `splitting_ops` is an empty list by default but in V1 piecewise compilation, it's set to attention ops, and it's set to empty otherwise. After #20059, splitting ops will not be required if full cudagraphs is enabled so the user must be able to overwrite it.

This logic is currently scattered around `config.py` and some other places; we should make sure it's consolidated inside a single function, likely `VllmConfig.__post_init__`.

### Sunsetting period
I believe that these are not user-facing enough to warrant standard deprecation procedures. Instead, I propose we perform the changes (including the swap from `CompilationLevel` to `OptimizationLevel`) in a single release. I believe that would be less painful than trying to support both at the same time. We would add explicit error messages about removal of `level` etc. instead of JSON parsing errors. I know that is a bold stance so please give feedback on it in the comments.

Alternatively, we could deprecate `level` (and map it to `mode`) and create `optimization_level` and `mode` fields, and remove `level` in later releases. As a middle ground, 

### Out of scope for this RFC
- Moving cudagraph capture config out of `CompilationConfig`
- Configuration oracle to replace current platform-dependent configuration

## Feedback Period.

10 days, 6/30-7/9. I want to try to address this before my summer vacation 7/18-8/8

## CC List.

@youkaichao @simon-mo @mgoin @robertgshaw2-redhat @zou3519 @WoosukKwon

## Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC][UX][torch.compile][CUDAGraph]: Overhaul `CompilationConfig` and improve CLI `-O<n>` #20283

Motivation.

Proposed Change.

Repurpose `-O` for optimization level

Rename compilation level to mode

Other changes to compilation controls

Unchanged fields:

Enabling logic

Sunsetting period

Out of scope for this RFC

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC][UX][torch.compile][CUDAGraph]: Overhaul CompilationConfig and improve CLI -O<n> #20283

Description

Motivation.

Proposed Change.

Repurpose -O for optimization level

Rename compilation level to mode

Other changes to compilation controls

Unchanged fields:

Enabling logic

Sunsetting period

Out of scope for this RFC

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC][UX][torch.compile][CUDAGraph]: Overhaul `CompilationConfig` and improve CLI `-O<n>` #20283

Repurpose `-O` for optimization level