- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.8k
[EPLB] Support EPLB for Mixtral Model #22842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: rouchenzi <[email protected]>
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Expert Parallelism Load Balancing (EPLB) for the Mixtral model. The changes involve updating the Mixtral model implementation to conform to the MixtureOfExperts interface, which is necessary for EPLB integration. This includes modifications to weight loading logic to handle redundant experts and methods to manage EPLB state. The implementation is largely correct, but I've identified a couple of areas for improvement to enhance robustness and future extensibility.
| def set_eplb_state( | ||
| self, | ||
| expert_load_view: torch.Tensor, | ||
| logical_to_physical_map: torch.Tensor, | ||
| logical_replica_count: torch.Tensor, | ||
| ) -> None: | ||
| for layer_idx, layer in enumerate(self.moe_layers): | ||
| # Register the expert weights. | ||
| self.expert_weights.append(layer.get_expert_weights()) | ||
| layer.set_eplb_state( | ||
| moe_layer_idx=layer_idx, | ||
| expert_load_view=expert_load_view, | ||
| logical_to_physical_map=logical_to_physical_map, | ||
| logical_replica_count=logical_replica_count, | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.expert_weights list is appended to in a loop. Since it's only initialized once in __init__, multiple calls to set_eplb_state would cause the list to grow indefinitely, leading to incorrect behavior and a memory leak. While the current implementation of the EPLB scheduler calls this method only once, it's safer to make this method idempotent by clearing the list before populating it.
| def set_eplb_state( | |
| self, | |
| expert_load_view: torch.Tensor, | |
| logical_to_physical_map: torch.Tensor, | |
| logical_replica_count: torch.Tensor, | |
| ) -> None: | |
| for layer_idx, layer in enumerate(self.moe_layers): | |
| # Register the expert weights. | |
| self.expert_weights.append(layer.get_expert_weights()) | |
| layer.set_eplb_state( | |
| moe_layer_idx=layer_idx, | |
| expert_load_view=expert_load_view, | |
| logical_to_physical_map=logical_to_physical_map, | |
| logical_replica_count=logical_replica_count, | |
| ) | |
| def set_eplb_state( | |
| self, | |
| expert_load_view: torch.Tensor, | |
| logical_to_physical_map: torch.Tensor, | |
| logical_replica_count: torch.Tensor, | |
| ) -> None: | |
| self.expert_weights.clear() | |
| for layer_idx, layer in enumerate(self.moe_layers): | |
| # Register the expert weights. | |
| self.expert_weights.append(layer.get_expert_weights()) | |
| layer.set_eplb_state( | |
| moe_layer_idx=layer_idx, | |
| expert_load_view=expert_load_view, | |
| logical_to_physical_map=logical_to_physical_map, | |
| logical_replica_count=logical_replica_count, | |
| ) | 
| num_physical_experts: int, | ||
| num_local_physical_experts: int, | ||
| ) -> None: | ||
| assert self.num_local_physical_experts == num_local_physical_experts | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion enforces that num_local_physical_experts cannot change. However, the method signature and the broader context of EPLB (which can involve dynamic scaling) suggest that this value might need to be updated. If the number of GPUs in the expert parallel group changes, this assertion will fail, preventing dynamic scaling. If this value is not expected to change, the parameter is redundant. This assertion seems overly restrictive and may prevent future features like dynamic scaling.
| Hey @abmfy this PR supports EPLB for Mixtral, tested E2E workflow with Mixtral-8x7B, please let me know if more tests suggested, thanks! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just very tiny changes to reflect the new config interface. I'll do an accuracy test and if that passes we're good to go.
Thanks for the contribution!
| Accuracy tests passed: w/o EPLB: 
 w/ EPLB: 
 | 
| This pull request has merge conflicts that must be resolved before it can be | 
Signed-off-by: rouchenzi <[email protected]>
Co-authored-by: Bowen Wang <[email protected]> Signed-off-by: rouchenzi <[email protected]>
Co-authored-by: Bowen Wang <[email protected]> Signed-off-by: rouchenzi <[email protected]>
| Thanks @abmfy for reviewing and helping with accuracy test! Resolved conflicts and updated change with suggestion, please take a look whenever free. | 
Signed-off-by: rouchenzi <[email protected]>
| LGTM! | 
Signed-off-by: rouchenzi <[email protected]> Signed-off-by: rouchenzi <[email protected]> Co-authored-by: Bowen Wang <[email protected]>
Signed-off-by: rouchenzi <[email protected]> Signed-off-by: rouchenzi <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: rouchenzi <[email protected]> Signed-off-by: rouchenzi <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: rouchenzi <[email protected]> Signed-off-by: rouchenzi <[email protected]> Co-authored-by: Bowen Wang <[email protected]>
Signed-off-by: rouchenzi <[email protected]> Signed-off-by: rouchenzi <[email protected]> Co-authored-by: Bowen Wang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
#20468 Support EPLB for Mixtral Model
Test Plan
Running the following test script
python test_eplb.py --mode eplb
python test_eplb.py --mode normal
Test Result
Result w/ verses w/o EPLB looks consistent given the example prompt tested
Example eplb_log_balancedness result
(Optional) Documentation Update