Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm/v1/worker/gpu_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,7 @@ def profile(self, is_start: bool = True):
sort_by="self_cuda_time_total"))

def execute_dummy_batch(self) -> None:
self.model_runner._dummy_run(1, uniform_decode=True)
self.model_runner._dummy_run(16, uniform_decode=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While changing 1 to 16 likely fixes the hang on B200, the value 16 is a magic number. To improve code clarity and maintainability, it would be better to define this as a named constant with a comment explaining why this specific value is necessary for the fix.

For example, you could define a constant at the top of the file:

# This value is chosen to ensure sufficient workload on all ranks to avoid
# hangs with Data Parallelism and Expert Parallelism on B200 hardware.
# See the associated pull request for more details.
_DUMMY_BATCH_TOKENS_FOR_B200_FIX = 16

And then use it here:

self.model_runner._dummy_run(_DUMMY_BATCH_TOKENS_FOR_B200_FIX, uniform_decode=True)

This would make the code's intent much clearer to future maintainers. Also, please consider updating the PR description with the details as mentioned in the TODO.


def add_lora(self, lora_request: LoRARequest) -> bool:
return self.model_runner.add_lora(lora_request)
Expand Down