Data Parallelism support #865

wenxindongwork · 2025-10-14T23:08:36Z

This PR introduces support for Data Parallelism in the vLLM TPU backend.

Data Parallelism is a sharding strategy that is intended to be applied to the following scanerios:

Model replication. Replicating the model across a large slice increases the overall throughput of the system.
KV cache de-duplication for large models with small number of kv heads, e.g. DeepSeekV3, Qwen3 235b, and models that use fp8 kv cache quantization. By default we shard the kv heads by TP, and if there is not enough heads to shard, we end up replicating the heads and kv cache, which results in wasted memory. Using attention DP will eliminate this waste by replicate the attention layer and sending different data to each attention replica.

Note that the vLLM TPU DP design (SPMD) is very different from vLLM GPU DP design (MPMD). vLLM GPU DP launches multiple vLLM EngineCore instances (one for DP rank) and communicate between processes, whereas vLLM TPU DP launches a single vLLM EngineCore and does the data sharding within one instance. SPMD is a more TPU and JAX native approach to DP.

DP scheduler
We introduce a new DP scheduler class that extends the base vLLM Scheduler to support data-parallel (DP) execution. It manages request distribution and KV cache allocation across multiple DP ranks, where each rank has its own logical KV cache shard and processes a subset of the total requests.

Input preparation
See changes in tpu_jax_runner._prepare_inputs() . When DP is enabled, we assign each request to a DP rank. Input tokens should be sorted by same DP ranks (e.g. inputs tokens from DP rank 0 should come before input tokens from DP rank 1, and so on). We do this sorting and padding in the _prepare_inputs_dp function. The input tokens will then be sharded by the DP axis, such that each attention layer replica processes one shard of the global input.

Sharding and Attention Axis Updates:

We introduced a new mesh axis: attn_dp to enable attention-only DP. Additionally, we modified sharding annotations for Llama3 to enable both model-wise and attention-only DP. All models should follow the same rough idea: attention weights (e.g. qkv projs, out_proj) should be replicated across the data and attn_dp dimension, and data should be sharded across the data and attn_dp dimension. MLP and MoE layers should be replicated across the data dimension but not the attn_dp dimension.

Usage:
It is recommended to use DP with async scheduling

Model wise DP: python examples/offline_inference.py --tensor_parallel_size=4 --data_parallel_size=2 --async-scheduling
Attention DP: python examples/offline_inference.py --tensor_parallel_size=8 --kv-cache-dtype=fp8 --additional_config='{"sharding":{"sharding_strategy": {"enable_dp_attention":1}}}' --async-scheduling

Attention DP will be automatically triggered when the enable_dp_attention flag is passed, and the exact dp_size will be automatically determined based on the number of kv heads and TP size.

Complementary vLLM PR

We have to make some minor changes to vLLM upstream.
https://github.com/vllm-project/vllm/pull/27365/files

Tests

New e2e BuildKite test on v6e-8(test_data_parallel.py) to verify 1. model parallelism, 2. attention data parallelism, 3. output correctness check. Performance tests not added in this PR.
Unit tests for the DPScheduler class and _prepare_input_dp function.

Buildkite: https://buildkite.com/tpu-commons/tpu-inference-ci/builds/4993

Performance

This PR introduces the functional DP implementation.

Model wise DP

Achieves ~72-80% of expected throughput.

Attention DP

1.5x gain in effective KV cache size.

Future PRs

Support DP for speculative decoding, sequence paralellism, LoRA, and structured decoding.
Support DP for Torchax backend.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

tpu_inference/core/sched/dp_scheduler.py

tpu_inference/layers/jax/sharding.py

tpu_inference/models/vllm/jax_fused_moe.py

tpu_inference/core/sched/utils.py

tpu_inference/runner/tpu_jax_runner.py

tests/runner/test_tpu_jax_runner_dp.py

tpu_inference/worker/tpu_worker_jax.py

tpu_inference/utils.py

tpu_inference/runner/tpu_jax_runner.py

tpu_inference/layers/vllm/fused_moe.py

tpu_inference/layers/vllm/sharding.py

tpu_inference/models/jax/qwen2.py

tpu_inference/layers/jax/attention/attention.py

tpu_inference/layers/jax/sample/sampling.py

tpu_inference/utils.py

kyuyeunk · 2025-11-05T06:11:40Z

are you planning to create a separate PR for sharding annotation related changes? it seems like that change makes the PR significantly larger than just DP related changes and make it hard to review.

Signed-off-by: wenxindongwork <[email protected]>

This reverts commit a27922a.

Signed-off-by: wenxindongwork <[email protected]>

yaochengji reviewed Oct 15, 2025

View reviewed changes

tpu_inference/core/sched/dp_scheduler.py Outdated Show resolved Hide resolved

tpu_inference/core/sched/dp_scheduler.py Outdated Show resolved Hide resolved

tpu_inference/core/sched/dp_scheduler.py Outdated Show resolved Hide resolved

kyuyeunk reviewed Oct 15, 2025

View reviewed changes

wenxindongwork force-pushed the dp_attention branch 4 times, most recently from 6d38e1d to da4519e Compare October 22, 2025 19:04

wenxindongwork self-assigned this Oct 27, 2025

wenxindongwork marked this pull request as ready for review October 27, 2025 18:17

wenxindongwork force-pushed the dp_attention branch 3 times, most recently from a1ea387 to f257867 Compare October 27, 2025 22:13

vanbasten23 reviewed Oct 29, 2025

View reviewed changes

tpu_inference/runner/tpu_jax_runner.py Outdated Show resolved Hide resolved

wenxindongwork force-pushed the dp_attention branch 2 times, most recently from 01e6e64 to 0537770 Compare October 31, 2025 19:52

wenxindongwork mentioned this pull request Oct 31, 2025

[Core][TPU] Support TPU Data Parallalism vllm-project/vllm#27365

Merged

5 tasks