Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2130,11 +2130,12 @@ class SchedulerConfig:
NOTE: This will be replaced by speculative config in the future; it is
present to enable correctness tests until then."""

cuda_graph_sizes: list[int] = field(default_factory=lambda: [512])
"""Cuda graph capture sizes, default is 512.
1. if one value is provided, then the capture list would follow the
cuda_graph_sizes: list[int] = field(default_factory=list)
"""Cuda graph capture sizes
1. if none provided, then default set to [min(max_num_seqs * 2, 512)]
2. if one value is provided, then the capture list would follow the
pattern: [1, 2, 4] + [i for i in range(8, cuda_graph_sizes + 1, 8)]
Comment on lines +2136 to 2137
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for cuda_graph_sizes appears to have a small error. When one value is provided, the example pattern uses cuda_graph_sizes directly in the range function, but since cuda_graph_sizes is a list, this would raise a TypeError. The implementation correctly uses cuda_graph_sizes[0], so the docstring should be updated to match for clarity.

Suggested change
2. if one value is provided, then the capture list would follow the
pattern: [1, 2, 4] + [i for i in range(8, cuda_graph_sizes + 1, 8)]
2. if one value is provided, then the capture list would follow the
pattern: [1, 2, 4] + [i for i in range(8, cuda_graph_sizes[0] + 1, 8)]

2. more than one value (e.g. 1 2 128) is provided, then the capture list
3. more than one value (e.g. 1 2 128) is provided, then the capture list
will follow the provided list."""

delay_factor: float = 0.0
Expand Down Expand Up @@ -2299,6 +2300,13 @@ def __post_init__(self) -> None:
self.max_num_partial_prefills, self.max_long_partial_prefills,
self.long_prefill_token_threshold)

# NOTE: Default set cuda_graph_sizes to [min(max_num_seqs * 2, 512)].
# This avoids OOM in tight memory scenarios with small max_num_seqs,
# and prevents capture of many large graphs (>512) that would greatly
# increase startup time with limited performance benefit.
if not self.cuda_graph_sizes:
self.cuda_graph_sizes = [min(self.max_num_seqs * 2, 512)]

@model_validator(mode='after')
def _verify_args(self) -> Self:
if (self.max_num_batched_tokens < self.max_model_len
Expand Down