-
Notifications
You must be signed in to change notification settings - Fork 28
Data Parallelism support #865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yaochengji
reviewed
Oct 15, 2025
kyuyeunk
reviewed
Oct 15, 2025
6d38e1d to
da4519e
Compare
a1ea387 to
f257867
Compare
vanbasten23
reviewed
Oct 29, 2025
01e6e64 to
0537770
Compare
5 tasks
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
vanbasten23
reviewed
Nov 3, 2025
gpolovets1
reviewed
Nov 4, 2025
gpolovets1
reviewed
Nov 4, 2025
0537770 to
a397bea
Compare
7b745fa to
b903c39
Compare
gpolovets1
approved these changes
Nov 4, 2025
Collaborator
|
are you planning to create a separate PR for sharding annotation related changes? it seems like that change makes the PR significantly larger than just DP related changes and make it hard to review. |
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
Signed-off-by: wenxindongwork <[email protected]>
ba61743 to
94d33f3
Compare
wenxindongwork
added a commit
that referenced
this pull request
Nov 6, 2025
This reverts commit a27922a.
sixiang-google
pushed a commit
that referenced
this pull request
Nov 6, 2025
Signed-off-by: wenxindongwork <[email protected]>
sierraisland
pushed a commit
that referenced
this pull request
Nov 7, 2025
Signed-off-by: wenxindongwork <[email protected]>
sierraisland
pushed a commit
that referenced
this pull request
Nov 8, 2025
Signed-off-by: wenxindongwork <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces support for Data Parallelism in the vLLM TPU backend.
Data Parallelism is a sharding strategy that is intended to be applied to the following scanerios:
Note that the vLLM TPU DP design (SPMD) is very different from vLLM GPU DP design (MPMD). vLLM GPU DP launches multiple vLLM EngineCore instances (one for DP rank) and communicate between processes, whereas vLLM TPU DP launches a single vLLM EngineCore and does the data sharding within one instance. SPMD is a more TPU and JAX native approach to DP.
DP scheduler
We introduce a new DP scheduler class that extends the base vLLM Scheduler to support data-parallel (DP) execution. It manages request distribution and KV cache allocation across multiple DP ranks, where each rank has its own logical KV cache shard and processes a subset of the total requests.
Input preparation
See changes in
tpu_jax_runner._prepare_inputs(). When DP is enabled, we assign each request to a DP rank. Input tokens should be sorted by same DP ranks (e.g. inputs tokens from DP rank 0 should come before input tokens from DP rank 1, and so on). We do this sorting and padding in the_prepare_inputs_dpfunction. The input tokens will then be sharded by the DP axis, such that each attention layer replica processes one shard of the global input.Sharding and Attention Axis Updates:
We introduced a new mesh axis:
attn_dpto enable attention-only DP. Additionally, we modified sharding annotations for Llama3 to enable both model-wise and attention-only DP. All models should follow the same rough idea: attention weights (e.g. qkv projs, out_proj) should be replicated across thedataandattn_dpdimension, and data should be sharded across thedataandattn_dpdimension. MLP and MoE layers should be replicated across thedatadimension but not theattn_dpdimension.Usage:
It is recommended to use DP with async scheduling
Model wise DP:
python examples/offline_inference.py --tensor_parallel_size=4 --data_parallel_size=2 --async-schedulingAttention DP:
python examples/offline_inference.py --tensor_parallel_size=8 --kv-cache-dtype=fp8 --additional_config='{"sharding":{"sharding_strategy": {"enable_dp_attention":1}}}' --async-schedulingAttention DP will be automatically triggered when the
enable_dp_attentionflag is passed, and the exact dp_size will be automatically determined based on the number of kv heads and TP size.Complementary vLLM PR
We have to make some minor changes to vLLM upstream.
https://github.com/vllm-project/vllm/pull/27365/files
Tests
test_data_parallel.py) to verify 1. model parallelism, 2. attention data parallelism, 3. output correctness check. Performance tests not added in this PR.DPSchedulerclass and_prepare_input_dpfunction.Buildkite: https://buildkite.com/tpu-commons/tpu-inference-ci/builds/4993
Performance
This PR introduces the functional DP implementation.
Model wise DP
Attention DP
Future PRs
Checklist
Before submitting this PR, please make sure: