[Bugfix] Improve GPU validation logging in Ray fallback scenarios #25775
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Bugfix] Improve GPU validation logging in Ray fallback scenarios
Adds early GPU count validation and clearer Ray placement error messages when tensor_parallel_size exceeds available GPUs to address poor logging and help users diagnose K8s deployment failures.
Related Issues
Fixes #25263
Purpose
Fixes poor logging when tensor_parallel_size exceeds available GPUs in Ray fallback scenarios.
When
tensor_parallel_size
is set higher than the available GPU count (e.g., tensor_parallel_size=4 with only 1 GPU), vLLM silently falls back to Ray executor without adequate warning. This causes confusing error messages in K8s deployments, where users see Ray placement group timeout errors without understanding the root cause.Changes Made
vllm/config/parallel.py
: Added warning when tensor parallel size exceeds available GPUs during backend selectionvllm/executor/ray_utils.py
: Improved error messages in_wait_until_pg_ready()
andinitialize_ray_cluster()
functions to provide context about GPU resource mismatchesFiles Modified
vllm/config/parallel.py
- Added GPU count validation with clear warningsvllm/executor/ray_utils.py
- Enhanced Ray placement group error handlingTest Plan
Scenario Testing
--tensor-parallel-size 4
on a system with only 1 available GPUTest Commands
Functional Testing
Test Result
Before Fix
After Fix
Validation Results
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.