-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Config] Clean up SchedulerConfig initialization #28665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Config] Clean up SchedulerConfig initialization #28665
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a solid cleanup of the SchedulerConfig initialization. Moving the default value logic from SchedulerConfig.__post_init__ to EngineArgs and refactoring it into new helper methods significantly improves modularity and readability. The use of class variables for default values in SchedulerConfig is also a good practice. I've identified one minor logging issue that should be addressed.
Signed-off-by: DarkLight1337 <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a good refactoring that cleans up SchedulerConfig initialization by centralizing default values and improving the modularity of EngineArgs. The use of a property for chunked_prefill_enabled is a clean way to handle aliasing and maintain backward compatibility.
However, I've identified a critical issue: the refactoring appears to have removed the specific default max_num_batched_tokens logic for pooling and multimodal models. This could lead to a performance regression for these model types. I've added a detailed comment regarding this. Once this is addressed, the PR will be a strong improvement.
| def __post_init__(self, is_encoder_decoder: bool) -> None: | ||
| if self.max_model_len is None: | ||
| self.max_model_len = 8192 | ||
|
|
||
| if self.max_num_seqs is None: | ||
| self.max_num_seqs = 128 | ||
|
|
||
| if is_encoder_decoder: | ||
| # Chunked prefill should be disabled for encoder-decoder models. | ||
| self.disable_chunked_mm_input = True | ||
| self.chunked_prefill_enabled = False | ||
| self.enable_chunked_prefill = False | ||
| self.long_prefill_token_threshold = 0 | ||
| logger.info( | ||
| "Encoder-decoder models do not support chunked prefill nor" | ||
| " prefix caching; disabling both." | ||
| ) | ||
|
|
||
| if self.max_num_batched_tokens is None: | ||
| if self.enable_chunked_prefill: | ||
| self.max_num_batched_tokens = DEFAULT_MAX_NUM_BATCHED_TOKENS | ||
| else: | ||
| # If max_model_len is too short, use | ||
| # DEFAULT_MAX_NUM_BATCHED_TOKENS as the default value | ||
| # for higher throughput. | ||
| self.max_num_batched_tokens = max( | ||
| self.max_model_len, DEFAULT_MAX_NUM_BATCHED_TOKENS | ||
| ) | ||
|
|
||
| if self.runner_type == "pooling": | ||
| # Choose specific value for higher throughput | ||
| self.max_num_batched_tokens = max( | ||
| self.max_num_batched_tokens, | ||
| POOLING_MODEL_MAX_NUM_BATCHED_TOKENS, | ||
| ) | ||
| if self.is_multimodal_model: | ||
| # The value needs to be at least the number of multimodal tokens | ||
| self.max_num_batched_tokens = max( | ||
| self.max_num_batched_tokens, | ||
| MULTIMODAL_MODEL_MAX_NUM_BATCHED_TOKENS, | ||
| ) | ||
|
|
||
| # When using default settings, | ||
| # Ensure max_num_batched_tokens does not exceed model limit. | ||
| # Some models (e.g., Whisper) have embeddings tied to max length. | ||
| self.max_num_batched_tokens = min( | ||
| self.max_num_seqs * self.max_model_len, self.max_num_batched_tokens | ||
| if not self.enable_chunked_prefill: | ||
| # If max_model_len is too short, use the default for higher throughput. | ||
| self.max_num_batched_tokens = max( | ||
| self.max_model_len, | ||
| self.max_num_batched_tokens, | ||
| ) | ||
|
|
||
| # Ensure max_num_batched_tokens does not exceed model limit. | ||
| # Some models (e.g., Whisper) have embeddings tied to max length. | ||
| self.max_num_batched_tokens = min( | ||
| self.max_num_seqs * self.max_model_len, | ||
| self.max_num_batched_tokens, | ||
| ) | ||
|
|
||
| self.max_num_encoder_input_tokens = self.max_num_batched_tokens | ||
| self.encoder_cache_size = self.max_num_batched_tokens | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring simplifies the initialization, but it seems to have removed the special default logic for max_num_batched_tokens for pooling and multimodal models.
Previously, if max_num_batched_tokens was not set by the user or a UsageContext-specific default, there was fallback logic to increase it for pooling models (to 32768) and multimodal models (to 5120) for better throughput. This logic was triggered if max_num_batched_tokens was None when __post_init__ was called.
This logic has now been removed. The justification in the PR description suggests this was dead code, but it appears it would have been triggered if no UsageContext default was found. The new implementation in EngineArgs ensures max_num_batched_tokens is always set, but the specific, higher defaults for pooling/multimodal models are no longer applied anywhere.
Removing this could lead to a significant performance regression for these model types. Could you please confirm if this change is intended? If it's a mistake, this logic should be restored, perhaps within EngineArgs.get_batch_defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case of no UsageContext is not normal usage of vLLM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill @WoosukKwon @robertgshaw2-redhat correct me if I'm wrong about this
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
hmellor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice change, I've left a few comments for clarification
| max_model_len: int = Field(default=8192, ge=1) | ||
| """Maximum length of a sequence (including prompt and generated text). | ||
| The default value here is mainly for convenience when testing. | ||
| In real usage, this should duplicate `ModelConfig.max_model_len` via | ||
| `EngineArgs`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in some other places like vllm.v1.core.sched.Scheduler. We can try to refactor this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this would need a small refactor. A follow up PR sounds good.
| @property | ||
| def chunked_prefill_enabled(self) -> bool: | ||
| return self.enable_chunked_prefill | ||
|
|
||
| @chunked_prefill_enabled.setter | ||
| def chunked_prefill_enabled(self, value: bool): | ||
| self.enable_chunked_prefill = value | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove this? It used to be init=False so it's not part of the normal API of SchedulerConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| gpu_memory_utilization: float = CacheConfig.gpu_memory_utilization | ||
| kv_cache_memory_bytes: int | None = CacheConfig.kv_cache_memory_bytes | ||
| max_num_batched_tokens: int | None = SchedulerConfig.max_num_batched_tokens | ||
| max_num_batched_tokens: int | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you group these fields which don't copy the defaults from their respective config and add a comment saying why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that in the next cleanup
Signed-off-by: DarkLight1337 <[email protected]>
|
@DarkLight1337 I think this PR changes the default values for |
|
IIUC, the two if statements here are not executed because Lines 1988 to 1998 in 3380ed5
|
|
If the behaviour has changed I don't think it's because of the changed defaults in Line 436 in d4acf51
Line 440 in d4acf51
|
|
Yeah, |
|
@hmellor @DarkLight1337 In B200, the correct default values (1024 seqs, 8K tokens) are not used. |
|
Can you run |
|
@DarkLight1337 It passes the test, but it's probably because the usage context is considered in the test (while it doesn't in |
|
Hmm, shouldn't |
|
@DarkLight1337 Here, |
|
Ok I figured out the issue, the CLI defaults are still using the ones from |
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
Purpose
SchedulerConfigfrom utils toSchedulerConfigitself.max_num_batched_tokensandmax_num_seqsare actually set byUsageContextwhen constructingEngineArgs, so there is no point in making the default values ofSchedulerConfigdynamic (the defaults for pooling and multimodal models are actually being overwritten). This simplifies the initialization code a lot.EngineArgs._set_default_argsto be more modular.SchedulerConfig.chunked_prefill_enableda property-based alias ofSchedulerConfig.enable_chunked_prefillto avoid having to set both when overriding the config in each platform.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.