Skip to content

[RFC]: Lazy CUDA Graph capture #20098

@lionelvillard

Description

@lionelvillard

Motivation.

Currently vLLM captures cudagraphs as part of the engine initialization significantly slowing down vLLM startup time. By default, vLLM captures 67 graphs, which depending on model size and GPU type, can take more than 10s. This is not great UX (see #19824 for details).

In addition, It's most unlikely that all 67 graphs are actually needed, wasting both time and space.

Proposed Change.

We propose to capture cudagraphs lazily. Instead of performing dummy runs during the engine initialization phase, the idea is to do those runs somewhere in the CUDA piecewise backend, and only for the current runtime shape if not cached already.

Exact implementation needs to be worked out.

Feedback Period.

one week

CC List.

@ProExpertProg @aarnphm @charlesfrye

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions