[RFC]: Lazy CUDA Graph capture

### Motivation.

Currently vLLM captures cudagraphs as part of the engine initialization significantly slowing down vLLM startup time. By default, vLLM captures 67 graphs, which depending on model size and GPU type, can take more than 10s. This is not great UX (see #19824 for details).

In addition, It's most unlikely that all 67 graphs are actually needed, wasting both time and space.  

### Proposed Change.

We propose to capture cudagraphs lazily. Instead of performing dummy runs during the engine initialization phase, the idea is to do those runs somewhere in the CUDA piecewise backend, and only for the current runtime shape if not cached already.

Exact implementation needs to be worked out.

### Feedback Period.

one week

### CC List.

@ProExpertProg @aarnphm @charlesfrye  

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Lazy CUDA Graph capture #20098

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Lazy CUDA Graph capture #20098

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions