[perf]: support dual-batch overlap(dbo) for deepseek #941

zxdukki · 2025-05-23T15:46:25Z

What this PR does / why we need it?

Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer.
Compared with the previously proposed draft, we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources.

ref: overlap for llama
ref: dbo in sglang

Does this PR introduce any user-facing change?

Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1"

See /examples/offline_dualbatch_overlap_npu.py for more info.

How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though).

Any advice/discussion is welcome.

Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap.

python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'

and run benchmark with the parameters of :
--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90

test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8)

prefill qps: 2.17-> 2.60

test with the version using alltoall:

prefill qps: 0.90 -> 1.01
Mean TTFT：8226->7432ms

The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm

wangxiyuan

This is a great change. Thanks for the contribute. Please rebase to main as well.

wangxiyuan · 2025-06-03T07:53:18Z

vllm_ascend/models/deepseek_v2.py

+                real_top_k = self.mlp.experts.top_k
+
+            if VLLM_ENABLE_MC2 and not is_prefill:
+                ...


please add more note or remove this kind of code

please add more note or remove this kind of code

Thanks for your review, we have removed it in our revised version.

wangxiyuan · 2025-06-03T07:53:35Z

vllm_ascend/models/deepseek_v2.py

+
+            if VLLM_ENABLE_MC2 and not is_prefill:
+                ...
+            ''' the following kernels will be submitted to the comm stream to overlap the computation of the 


use # for comment.

wangxiyuan · 2025-06-03T07:53:45Z

vllm_ascend/models/deepseek_v2.py

+                enable_force_load_balance)
+
+            if VLLM_ENABLE_MC2 and not is_prefill:
+                ...


wangxiyuan · 2025-06-03T08:06:12Z

vllm_ascend/models/deepseek_v2.py

            residual = intermediate_tensors["residual"]

-        for i in range(self.start_layer, self.end_layer):
+        num_normal_layers = (self.first_k_dense_replace


I prefer to make the code like to be more clear:

if self.can_run_ms: xxx else: xxx

I know your code here is more simple, but it's not easy for maintain. Or another way is to move if moe_start_layer == self.end_layer: from _forward_ms_layers to L1014.

moe_start_layer = self.start_layer + num_normal_layers if moe_start_layer !=self.end_layer: hidden_states, residual = self._forward_ms_layers()

wangxiyuan · 2025-06-03T08:07:31Z

vllm_ascend/multistream/context.py

+from contextlib import contextmanager
+from typing import Any
+
+# TODO: move this part to vllm


can you explain more here, do you mean multistream should be in vllm？

wangxiyuan · 2025-06-03T08:12:51Z

vllm_ascend/models/deepseek_v2.py

+            self.multistream_config = MultiStreamConfig()
+
+        self.use_mla = model_config.use_mla
+        self.multistream_metadata = make_multistream_metadata_ds(


multistream_metadata is only used for L973 and L975, use self here is useless

wangxiyuan · 2025-06-03T08:13:10Z

vllm_ascend/models/deepseek_v2.py

+            causal_lm=getattr(config, "causal_lm", True),
+            multistream_config=self.multistream_config,
+        )
+        self.ms_pre_layer = MultiStreamPreTransformerLayer(


this kind of parameters is only used for ms case, can we just init them when ms is enabled?

wangxiyuan · 2025-06-03T08:15:46Z

vllm_ascend/multistream/metadata.py

+
+    def split_micro_batch(
+        self,
+        attn_metadata: "AttentionMetadata",


the type should be AscendMLAMetadata

We have modified it to the AscendMLAMetadata.

Yikun · 2025-06-03T13:08:46Z

vllm_ascend/envs.py

    "VLLM_ASCEND_TRACE_RECOMPILES":
    lambda: bool(int(os.getenv("VLLM_ASCEND_TRACE_RECOMPILES", '0'))),
+    "VLLM_ENABLE_DBO":
+    lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))),


Suggested change

lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))),

lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DBO", '0'))),

Thanks for your review. We have updated env variables to VLLM_ASCEND_ENABLE_DBO for enabling dbo.

Yikun · 2025-06-03T13:18:14Z

tests/multicard/test_offline_inference_distributed.py

        vllm_model.generate_greedy(example_prompts, max_tokens)
+
+
+def test_deepseek_model_with_dbo():


Suggested change

def test_deepseek_model_with_dbo():

@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_DBO": "1"})

def test_deepseek_model_with_dbo():

we need restore the env after test excuted.

[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict

we need restore the env after test excuted.

[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict

We have fixed the use of env variables in e2e tests by using the suggested approach of patch.dict

Yikun · 2025-06-03T13:25:07Z

tests/multicard/test_offline_inference_distributed.py

+
+
+def test_deepseek_model_with_dbo():
+    os.environ["VLLM_ENABLE_DBO"] = "1"


Suggested change

os.environ["VLLM_ENABLE_DBO"] = "1"

Yikun · 2025-06-03T13:26:36Z

tests/multicard/test_offline_inference_distributed.py

+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",


Suggested change

"The president of the United States is",

"The capital of France is",

"The future of AI is",

Keep only one is enough

Keep only one is enough

Now we use only one prompts but repeat for 40 times in dbo tests since we set a threshold of the token nums to activate dual batch overlap.

Yikun · 2025-06-03T13:44:28Z

vllm_ascend/attention/mla_v1.py

+        from vllm_ascend.multistream.context import \
+            get_multistream_comm_context


Why import here?

Why import here?

We have fixed it in our revised version by moving the imports to the beginning.

Yikun · 2025-06-03T13:47:15Z

Sparate DeepSeek as DeepSeekDBO frist

github-actions · 2025-06-04T10:32:04Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-06T08:08:09Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhuohuan <[email protected]>

zxdukki · 2025-06-06T13:00:07Z

Sparate DeepSeek as DeepSeekDBO frist

To decouple dbo with the model of deepseekv2, we adopt the suggestion and add a new model named deepseek_dbo. Users should override the model arch and set the env variable 'VLLM_ASCEND_ENABLE_DBO' simultaneously to run into the logic of dbo.
An example has been added in /examples named offline_dualbatch_overlap_npu.py.

We plan to sync the code in deepseek_dbo.py with deepseek_v2.py regularly so that we can merge dbo back to deepseek_v2.py when it is stable.

zxdukki · 2025-06-07T02:15:02Z

It seems that the tests of qwen triggers an npu out of memory issue. Maybe it need to be run again.....

Sparate DeepSeek as DeepSeekDBO frist

Signed-off-by: zhuohuan <[email protected]>

Yikun

Thanks for contributions! I'm OK with this.

Yikun · 2025-06-07T06:15:52Z

@ganyi1996ppo @wangxiyuan

ganyi1996ppo · 2025-06-07T06:18:12Z

vllm_ascend/multistream/context.py

+from contextlib import contextmanager
+from typing import Any
+
+_ms_comm_context: Any = None


Can we wrap those variable into a single class as a global variable?

Can we wrap those variable into a single class as a global variable?

I will give a try to wrap it in a single class.
In my opinion, most global variables here can be moved into a ctx class as class members. We can pass them as func params in deepseek_dbo to get the necessary info for switching streams~(in attn/routed expert/shared expert...) or split/merge the inputs.

Currently, as we donot modify the impl of attention backend, the stream switch logic in mla still relys on the global variable _ms_comm_context. But I think it is feasible to wrap this global variable by a ctx class and pass it through our rewrote forward function of mla to obtain the comm stream info.

let's do it in the next PR.

let's do it in the next PR.

ok, maybe like this demo? https://github.com/zxdukki/vllm-ascend/commit/8a7d13d05b7ccec3e696ffe31221d7a927626be1 We will update it together with the modifications on mla impl according to ganyi's advice in next PR. cc @ganyi1996ppo

ganyi1996ppo · 2025-06-07T06:20:38Z

vllm_ascend/models/deepseek_dbo.py

+        #     k_c.size(1) + k_pe.size(1) == kv_cache.size(2)
+        # i.e.
+        #     kv_lora_rank + qk_rope_head_dim == head_size
+        self.mla_attn = Attention(


Is there any further plan to rewrite the attention impl in this dbo modeling

Is there any further plan to rewrite the attention impl in this dbo modeling

Thanks for your review. I think it would be better to rewrite the impl of mla backend for dbo and move the logic of switching npu streams into it. Maybe we can decouple them from mla_v1 and use env variables to control the return of get_impl_cls in the next version? Besides, do you have any other advices on the modifications on the attention impl for dbo model?

### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](vllm-project#842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT：8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <[email protected]>

github-actions bot added the module:ops label May 23, 2025

zxdukki force-pushed the dev_multistream_overlap branch from 943d296 to 68070f1 Compare May 23, 2025 16:05

github-actions bot added the module:core label May 26, 2025

zxdukki force-pushed the dev_multistream_overlap branch from b0eed8a to 9053dd1 Compare May 27, 2025 14:21

zxdukki marked this pull request as ready for review May 28, 2025 05:06

zxdukki changed the title ~~[feat][WIP]: support multistream overlap(dbo) for deepseek~~ [perf]: support multistream overlap(dbo) for deepseek May 28, 2025

zxdukki force-pushed the dev_multistream_overlap branch 6 times, most recently from cfe6a5a to 8c24a6f Compare May 29, 2025 03:58

zxdukki changed the title ~~[perf]: support multistream overlap(dbo) for deepseek~~ [perf]: support dual-batch overlap(dbo) for deepseek May 29, 2025

zxdukki force-pushed the dev_multistream_overlap branch from f44b88d to b8c75ef Compare May 29, 2025 08:37

jianzs added the ready read for review label May 30, 2025

github-actions bot added the module:tests label May 30, 2025

zxdukki force-pushed the dev_multistream_overlap branch 2 times, most recently from 5308bfc to 3bd58a5 Compare May 31, 2025 11:20

wangxiyuan reviewed Jun 3, 2025

View reviewed changes

zxdukki force-pushed the dev_multistream_overlap branch 7 times, most recently from 519d279 to 3db6d6f Compare June 3, 2025 13:04

Yikun reviewed Jun 3, 2025

View reviewed changes

wangxiyuan mentioned this pull request Jun 4, 2025

[release] 0.9.0rc1 release checklist #904

Closed

76 tasks

zxdukki force-pushed the dev_multistream_overlap branch from b5926a1 to 590551e Compare June 6, 2025 07:45

github-actions bot removed the merge-conflicts label Jun 6, 2025

zxdukki force-pushed the dev_multistream_overlap branch from e96445f to 9b50520 Compare June 6, 2025 08:07

github-actions bot added merge-conflicts and removed merge-conflicts labels Jun 6, 2025

zxdukki added 10 commits June 6, 2025 17:27

[feat]: support dbo for deepseek

1957822

Signed-off-by: zhuohuan <[email protected]>

[fix]: reduced dependency on vllm for dbo

b1b8d6d

Signed-off-by: zhuohuan <[email protected]>

[feat]: improve overlap performance

92f813c

Signed-off-by: zhuohuan <[email protected]>

[fix]: resolve format issues

423e920

Signed-off-by: zhuohuan <[email protected]>

[fix]: fix accuracy issues for dbo in deepseek

e5eed0f

Signed-off-by: zhuohuan <[email protected]>

[fix]: add e2e test for dbo

e43a618

Signed-off-by: zhuohuan <[email protected]>

[feat]: support v0.9.0 modification of mla attn metadata

142402e

Signed-off-by: zhuohuan <[email protected]>

[fix]: optimize the dbo execution and fix minor issues

2e046e8

Signed-off-by: zhuohuan <[email protected]>

[fix]: fix comment issues by separating dbo model

85bc104

Signed-off-by: zhuohuan <[email protected]>

[feat]: update tests and example for dbo

f9230b3

Signed-off-by: zhuohuan <[email protected]>

zxdukki force-pushed the dev_multistream_overlap branch from 20d1afc to f9230b3 Compare June 6, 2025 10:27

wangxiyuan added the ready read for review label Jun 6, 2025

[fix]: use env varibles to enable dbo model

22cd249

Signed-off-by: zhuohuan <[email protected]>

zxdukki force-pushed the dev_multistream_overlap branch from 0c9d709 to 22cd249 Compare June 7, 2025 03:43

Yikun approved these changes Jun 7, 2025

View reviewed changes

ganyi1996ppo reviewed Jun 7, 2025

View reviewed changes

wangxiyuan merged commit 87ebaef into vllm-project:main Jun 7, 2025
23 checks passed

	lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))),
	lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DBO", '0'))),

		vllm_model.generate_greedy(example_prompts, max_tokens)


		def test_deepseek_model_with_dbo():

	def test_deepseek_model_with_dbo():
	@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_DBO": "1"})
	def test_deepseek_model_with_dbo():



		def test_deepseek_model_with_dbo():
		os.environ["VLLM_ENABLE_DBO"] = "1"

	"The president of the United States is",
	"The capital of France is",
	"The future of AI is",

		from vllm_ascend.multistream.context import \
		get_multistream_comm_context

Uh oh!

[perf]: support dual-batch overlap(dbo) for deepseek #941

[perf]: support dual-batch overlap(dbo) for deepseek #941

Uh oh!

Conversation

zxdukki commented May 23, 2025 • edited by wangxiyuan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Performance Benchmark

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

zxdukki commented Jun 6, 2025

Uh oh!

zxdukki commented Jun 7, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

Yikun commented Jun 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zxdukki Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zxdukki commented May 23, 2025 •

edited by wangxiyuan

Loading

zxdukki Jun 7, 2025 •

edited

Loading