-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Open
Labels
Description
Motivation.
Currently vLLM captures cudagraphs as part of the engine initialization significantly slowing down vLLM startup time. By default, vLLM captures 67 graphs, which depending on model size and GPU type, can take more than 10s. This is not great UX (see #19824 for details).
In addition, It's most unlikely that all 67 graphs are actually needed, wasting both time and space.
Proposed Change.
We propose to capture cudagraphs lazily. Instead of performing dummy runs during the engine initialization phase, the idea is to do those runs somewhere in the CUDA piecewise backend, and only for the current runtime shape if not cached already.
Exact implementation needs to be worked out.
Feedback Period.
one week
CC List.
@ProExpertProg @aarnphm @charlesfrye
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
charlesfrye, Kovbo, angkywilliam and Northrend
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Backlog