(deprecated) [V1] Support DP with Ray #18233

ruisearch42 · 2025-05-15T23:41:02Z

This PR adds support for DP with Ray. In this change it supports single node (i.e., head node), but is extensible to support multi-nodes: to be done in follow-up PRs.

We use the same ZMQ mechanism for communication between frontend and engine cores, as in #15977 .
Main differences from that PR:

The handshake between frontend and engine cores are greatly simplified, thanks to the Ray API
We can launch all DP ranks just on the head node

Examples

Currently supports:

This will run DP=4 on the head node.

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 4 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

After adding node placement strategy in future PR, it can support:

This will run DP=4 with DP ranks 0 and 1 on the head node and ranks 2 and 3 on other nodes.

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 2 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

This will run DP=4 with only the API server on the head node and all engines other nodes:

# Head node  (with ip address 10.99.48.128)
vllm serve $MODEL --data-parallel-size 4 --data-parallel-size-local 0 \
                  --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345
                  --data-parallel-backend ray

Design

See the following illustration:

TODO

Follow ups after this PR:

Allow specifying placement strategy for non-local DP ranks
Scale out API server

github-actions · 2025-05-15T23:41:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

kouroshHakha

left some high level critical questions:

kouroshHakha · 2025-05-16T18:47:39Z

vllm/v1/utils.py

We also need to use placement group to control the correct placement of local vs. remote right ?

that's right, this is the first follow up mentioned in the description

vllm/v1/utils.py

kouroshHakha · 2025-05-16T18:54:38Z

vllm/v1/engine/core.py

The reason is that gpu is not set on .remote() decorator call on the actor, right?

that's right

kouroshHakha · 2025-05-16T19:03:20Z

vllm/v1/engine/core.py

ok so intuitively when we use ray we should not need to specify --data-parallel-address 10.99.48.128 --data-parallel-rpc-port 13345 in the input command. This should be automatically picked with ray utils and knowlege about the cluster?

We need to basically set the following env vars somewhere, don't we? so that the stateless_dp_init_group function can create the process groups among the ray workers.

self.data_parallel_size = envs.VLLM_DP_SIZE self.data_parallel_rank = envs.VLLM_DP_RANK self.data_parallel_rank_local = envs.VLLM_DP_RANK_LOCAL self.data_parallel_master_ip = envs.VLLM_DP_MASTER_IP self.data_parallel_master_port = envs.VLLM_DP_MASTER_PORT

yeah this PR supports single node and the framework is extensible to support multi nodes. At this stage, I'm just following the MP CLI to have a minimal change. The user interface will evolve when we support multi nodes, likely adjusting input args and environment variables, like you mentioned.

You need to update the git description to indicate the local mode only use case.

SG, updated

Signed-off-by: Rui Qiao <[email protected]>

njhill

Thanks @ruisearch42 it looks quite clean. Just a few general thoughts:

We are hoping to get #17546 in soon which refactors some things that I think may conflict with the changes here. I wonder if you could look at targeting this PR on top of that one?
It looks much of the logic in the Ray subclasses is duplicated from the superclass. Perhaps we could abstract things a little to avoid that, an minimze the logic in the subclasses? Probably it would only make sense to do this in the context of (1) though.
We could consider grouping the Ray classes in a v1/ray submodule/package?

njhill · 2025-05-19T18:33:31Z

vllm/v1/utils.py

        }


+class CoreEngineActorManager:


This seems a bit circular from a dependency pov, since core depends on utils.

moved to ray_dp.py and removed the circular dependency.

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 · 2025-05-19T21:48:54Z

We are hoping to get [Perf] API-server scaleout with many-to-many server-engine comms #17546 in soon which refactors some things that I think may conflict with the changes here. I wonder if you could look at targeting this PR on top of that one?

It looks much of the logic in the Ray subclasses is duplicated from the superclass. Perhaps we could abstract things a little to avoid that, an minimze the logic in the subclasses? Probably it would only make sense to do this in the context of (1) though.

We could consider grouping the Ray classes in a v1/ray submodule/package?

Thanks for the review @njhill . Your thoughts make sense.

For 3, I moved DPEngineCoreActor and CoreEngineActorManager to ray_dp.py. For RayDPClient I left it in core_client.py for 2 reasons: 1) it is logically a core client so makes more sense to be in that file; 2) there is a circular dependency when moved to ray_dp.py.

For 1 and 2, I think while the high level interface is relatively stable, both MP and Ray based implementations are still evolving (as can be seen from PR 17546). So instead of waiting for MP to finally stablize, or refactoring now but rework Ray later, I was thinking deferring the actual refactoring/cleanup although keeping in mind the implementations should align. This allows us to iterate faster on both ends. In case there is a conflict with 17546, e.g., when there is interface changes, you could simply add placeholder methods in Ray to make type checking happy and I will follow up immediately to address that.

The main advantage is to break the dependency on 17546, a large PR which might got delayed. Let me know what you think.

mergify · 2025-05-30T15:19:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ruisearch42.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ruisearch42 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 15, 2025 23:41

mergify bot added the v1 label May 15, 2025

ruisearch42 force-pushed the ray_dp_refactor branch 3 times, most recently from f0963fb to a7b8b9a Compare May 16, 2025 16:46

kouroshHakha reviewed May 16, 2025

View reviewed changes

ruisearch42 force-pushed the ray_dp_refactor branch from a7b8b9a to 19a46e3 Compare May 16, 2025 19:42

ruisearch42 added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025

[V1] Support DP with Ray

6ccf8f3

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the ray_dp_refactor branch from 19a46e3 to 6ccf8f3 Compare May 19, 2025 15:19

njhill reviewed May 19, 2025

View reviewed changes

up

9ca65b3

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the ray_dp_refactor branch from 568bbc4 to 9ca65b3 Compare May 19, 2025 21:26

ruisearch42 changed the title ~~[V1] Support DP with Ray~~ (deprecated) [V1] Support DP with Ray May 27, 2025

mergify bot added the needs-rebase label May 30, 2025

ruisearch42 closed this Jun 5, 2025

		}


		class CoreEngineActorManager:

Uh oh!

(deprecated) [V1] Support DP with Ray #18233

(deprecated) [V1] Support DP with Ray #18233

Uh oh!

Conversation

ruisearch42 commented May 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

Design

TODO

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisearch42 commented May 19, 2025

Uh oh!

mergify bot commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruisearch42 commented May 15, 2025 •

edited by github-actions bot

Loading