Skip to content

Conversation

alexm-redhat
Copy link
Collaborator

This PR fixes the hang with DP+EP on DSR1 for B200 (TODO add more details)

Signed-off-by: Alexander Matveev <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a hang issue on B200 GPUs when using Data Parallelism and Expert Parallelism by increasing the number of tokens in a dummy batch run from 1 to 16. While this change likely resolves the immediate problem, the use of the magic number 16 introduces a maintainability concern. I've suggested replacing it with a named constant and adding a comment to explain its purpose, which will make the code clearer for future developers.


def execute_dummy_batch(self) -> None:
self.model_runner._dummy_run(1, uniform_decode=True)
self.model_runner._dummy_run(16, uniform_decode=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While changing 1 to 16 likely fixes the hang on B200, the value 16 is a magic number. To improve code clarity and maintainability, it would be better to define this as a named constant with a comment explaining why this specific value is necessary for the fix.

For example, you could define a constant at the top of the file:

# This value is chosen to ensure sufficient workload on all ranks to avoid
# hangs with Data Parallelism and Expert Parallelism on B200 hardware.
# See the associated pull request for more details.
_DUMMY_BATCH_TOKENS_FOR_B200_FIX = 16

And then use it here:

self.model_runner._dummy_run(_DUMMY_BATCH_TOKENS_FOR_B200_FIX, uniform_decode=True)

This would make the code's intent much clearer to future maintainers. Also, please consider updating the PR description with the details as mentioned in the TODO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants