- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.8k
(deprecated) [V1] Support DP with Ray #18233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
f0963fb    to
    a7b8b9a      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some high level critical questions:
        
          
                vllm/v1/utils.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to use placement group to control the correct placement of local vs. remote right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right, this is the first follow up mentioned in the description
        
          
                vllm/v1/engine/core.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that gpu is not set on .remote() decorator call on the actor, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right
        
          
                vllm/v1/engine/core.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so intuitively when we use ray we should not need to specify --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 in the input command. This should be automatically picked with ray utils and knowlege about the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to basically set the following env vars somewhere, don't we? so that the stateless_dp_init_group function can create the process groups among the ray workers.
            self.data_parallel_size = envs.VLLM_DP_SIZE
            self.data_parallel_rank = envs.VLLM_DP_RANK
            self.data_parallel_rank_local = envs.VLLM_DP_RANK_LOCAL
            self.data_parallel_master_ip = envs.VLLM_DP_MASTER_IP
            self.data_parallel_master_port = envs.VLLM_DP_MASTER_PORT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this PR supports single node and the framework is extensible to support multi nodes. At this stage, I'm just following the MP CLI to have a minimal change. The user interface will evolve when we support multi nodes, likely adjusting input args and environment variables, like you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update the git description to indicate the local mode only use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG, updated
a7b8b9a    to
    19a46e3      
    Compare
  
    Signed-off-by: Rui Qiao <[email protected]>
19a46e3    to
    6ccf8f3      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ruisearch42 it looks quite clean. Just a few general thoughts:
- We are hoping to get #17546 in soon which refactors some things that I think may conflict with the changes here. I wonder if you could look at targeting this PR on top of that one?
- It looks much of the logic in the Ray subclasses is duplicated from the superclass. Perhaps we could abstract things a little to avoid that, an minimze the logic in the subclasses? Probably it would only make sense to do this in the context of (1) though.
- We could consider grouping the Ray classes in a v1/raysubmodule/package?
        
          
                vllm/v1/utils.py
              
                Outdated
          
        
      | } | ||
|  | ||
|  | ||
| class CoreEngineActorManager: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit circular from a dependency pov, since core depends on utils.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to ray_dp.py and removed the circular dependency.
Signed-off-by: Rui Qiao <[email protected]>
568bbc4    to
    9ca65b3      
    Compare
  
    | 
 Thanks for the review @njhill . Your thoughts make sense. For 3, I moved  For 1 and 2, I think while the high level interface is relatively stable, both MP and Ray based implementations are still evolving (as can be seen from PR 17546). So instead of waiting for MP to finally stablize, or refactoring now but rework Ray later, I was thinking deferring the actual refactoring/cleanup although keeping in mind the implementations should align. This allows us to iterate faster on both ends. In case there is a conflict with 17546, e.g., when there is interface changes, you could simply add placeholder methods in Ray to make type checking happy and I will follow up immediately to address that. The main advantage is to break the dependency on 17546, a large PR which might got delayed. Let me know what you think. | 
| This pull request has merge conflicts that must be resolved before it can be | 
This PR adds support for DP with Ray. In this change it supports single node (i.e., head node), but is extensible to support multi-nodes: to be done in follow-up PRs.
We use the same ZMQ mechanism for communication between frontend and engine cores, as in #15977 .
Main differences from that PR:
Examples
Currently supports:
This will run DP=4 on the head node.
After adding node placement strategy in future PR, it can support:
This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on other nodes.
This will run DP=4 with only the API server on the head node and all engines other nodes:
Design
See the following illustration:
TODO
Follow ups after this PR: