Skip to content

Conversation

@ruisearch42
Copy link
Collaborator

@ruisearch42 ruisearch42 commented May 15, 2025

This PR adds support for DP with Ray. In this change it supports single node (i.e., head node), but is extensible to support multi-nodes: to be done in follow-up PRs.

We use the same ZMQ mechanism for communication between frontend and engine cores, as in #15977 .
Main differences from that PR:

  • The handshake between frontend and engine cores are greatly simplified, thanks to the Ray API
  • We can launch all DP ranks just on the head node

Examples

Currently supports:

This will run DP=4 on the head node.

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

After adding node placement strategy in future PR, it can support:

This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on other nodes.

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

This will run DP=4 with only the API server on the head node and all engines other nodes:

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

Design

See the following illustration:

image

TODO

Follow ups after this PR:

  • Allow specifying placement strategy for non-local DP ranks
  • Scale out API server

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label May 15, 2025
@ruisearch42 ruisearch42 force-pushed the ray_dp_refactor branch 3 times, most recently from f0963fb to a7b8b9a Compare May 16, 2025 16:46
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some high level critical questions:

vllm/v1/utils.py Outdated
Comment on lines 200 to 209
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to use placement group to control the correct placement of local vs. remote right ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, this is the first follow up mentioned in the description

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that gpu is not set on .remote() decorator call on the actor, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so intuitively when we use ray we should not need to specify --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 in the input command. This should be automatically picked with ray utils and knowlege about the cluster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to basically set the following env vars somewhere, don't we? so that the stateless_dp_init_group function can create the process groups among the ray workers.

            self.data_parallel_size = envs.VLLM_DP_SIZE
            self.data_parallel_rank = envs.VLLM_DP_RANK
            self.data_parallel_rank_local = envs.VLLM_DP_RANK_LOCAL
            self.data_parallel_master_ip = envs.VLLM_DP_MASTER_IP
            self.data_parallel_master_port = envs.VLLM_DP_MASTER_PORT

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this PR supports single node and the framework is extensible to support multi nodes. At this stage, I'm just following the MP CLI to have a minimal change. The user interface will evolve when we support multi nodes, likely adjusting input args and environment variables, like you mentioned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update the git description to indicate the local mode only use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG, updated

@ruisearch42 ruisearch42 added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ruisearch42 it looks quite clean. Just a few general thoughts:

  1. We are hoping to get #17546 in soon which refactors some things that I think may conflict with the changes here. I wonder if you could look at targeting this PR on top of that one?
  2. It looks much of the logic in the Ray subclasses is duplicated from the superclass. Perhaps we could abstract things a little to avoid that, an minimze the logic in the subclasses? Probably it would only make sense to do this in the context of (1) though.
  3. We could consider grouping the Ray classes in a v1/ray submodule/package?

vllm/v1/utils.py Outdated
}


class CoreEngineActorManager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit circular from a dependency pov, since core depends on utils.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to ray_dp.py and removed the circular dependency.

Signed-off-by: Rui Qiao <[email protected]>
@ruisearch42
Copy link
Collaborator Author

  1. We are hoping to get [Perf] API-server scaleout with many-to-many server-engine comms  #17546 in soon which refactors some things that I think may conflict with the changes here. I wonder if you could look at targeting this PR on top of that one?
  2. It looks much of the logic in the Ray subclasses is duplicated from the superclass. Perhaps we could abstract things a little to avoid that, an minimze the logic in the subclasses? Probably it would only make sense to do this in the context of (1) though.
  3. We could consider grouping the Ray classes in a v1/ray submodule/package?

Thanks for the review @njhill . Your thoughts make sense.

For 3, I moved DPEngineCoreActor and CoreEngineActorManager to ray_dp.py. For RayDPClient I left it in core_client.py for 2 reasons: 1) it is logically a core client so makes more sense to be in that file; 2) there is a circular dependency when moved to ray_dp.py.

For 1 and 2, I think while the high level interface is relatively stable, both MP and Ray based implementations are still evolving (as can be seen from PR 17546). So instead of waiting for MP to finally stablize, or refactoring now but rework Ray later, I was thinking deferring the actual refactoring/cleanup although keeping in mind the implementations should align. This allows us to iterate faster on both ends. In case there is a conflict with 17546, e.g., when there is interface changes, you could simply add placeholder methods in Ray to make type checking happy and I will follow up immediately to address that.

The main advantage is to break the dependency on 17546, a large PR which might got delayed. Let me know what you think.

@ruisearch42 ruisearch42 changed the title [V1] Support DP with Ray (deprecated) [V1] Support DP with Ray May 27, 2025
@mergify
Copy link

mergify bot commented May 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ruisearch42.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants