Skip to content

Conversation

@zxdukki
Copy link
Contributor

@zxdukki zxdukki commented May 23, 2025

What this PR does / why we need it?

Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer.
Compared with the previously proposed draft, we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources.

ref: overlap for llama
ref: dbo in sglang

Does this PR introduce any user-facing change?

Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1"

See /examples/offline_dualbatch_overlap_npu.py for more info.

How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though).

Any advice/discussion is welcome.

Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap.

python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'

and run benchmark with the parameters of :
--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90

  1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8)

image

image
prefill qps: 2.17-> 2.60

  1. test with the version using alltoall:

image

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm

@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from 943d296 to 68070f1 Compare May 23, 2025 16:05
@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from b0eed8a to 9053dd1 Compare May 27, 2025 14:21
@zxdukki zxdukki marked this pull request as ready for review May 28, 2025 05:06
@zxdukki zxdukki changed the title [feat][WIP]: support multistream overlap(dbo) for deepseek [perf]: support multistream overlap(dbo) for deepseek May 28, 2025
@zxdukki zxdukki force-pushed the dev_multistream_overlap branch 6 times, most recently from cfe6a5a to 8c24a6f Compare May 29, 2025 03:58
@zxdukki zxdukki changed the title [perf]: support multistream overlap(dbo) for deepseek [perf]: support dual-batch overlap(dbo) for deepseek May 29, 2025
@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from f44b88d to b8c75ef Compare May 29, 2025 08:37
@jianzs jianzs added the ready read for review label May 30, 2025
@zxdukki zxdukki force-pushed the dev_multistream_overlap branch 2 times, most recently from 5308bfc to 3bd58a5 Compare May 31, 2025 11:20
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change. Thanks for the contribute. Please rebase to main as well.

real_top_k = self.mlp.experts.top_k

if VLLM_ENABLE_MC2 and not is_prefill:
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add more note or remove this kind of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add more note or remove this kind of code

Thanks for your review, we have removed it in our revised version.


if VLLM_ENABLE_MC2 and not is_prefill:
...
''' the following kernels will be submitted to the comm stream to overlap the computation of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use # for comment.

enable_force_load_balance)

if VLLM_ENABLE_MC2 and not is_prefill:
...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

residual = intermediate_tensors["residual"]

for i in range(self.start_layer, self.end_layer):
num_normal_layers = (self.first_k_dense_replace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to make the code like to be more clear:

if self.can_run_ms:
  xxx
else:
  xxx

I know your code here is more simple, but it's not easy for maintain. Or another way is to move if moe_start_layer == self.end_layer: from _forward_ms_layers to L1014.

moe_start_layer = self.start_layer + num_normal_layers
if moe_start_layer !=self.end_layer:
  hidden_states, residual = self._forward_ms_layers()

from contextlib import contextmanager
from typing import Any

# TODO: move this part to vllm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain more here, do you mean multistream should be in vllm?

self.multistream_config = MultiStreamConfig()

self.use_mla = model_config.use_mla
self.multistream_metadata = make_multistream_metadata_ds(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multistream_metadata is only used for L973 and L975, use self here is useless

causal_lm=getattr(config, "causal_lm", True),
multistream_config=self.multistream_config,
)
self.ms_pre_layer = MultiStreamPreTransformerLayer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of parameters is only used for ms case, can we just init them when ms is enabled?


def split_micro_batch(
self,
attn_metadata: "AttentionMetadata",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type should be AscendMLAMetadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have modified it to the AscendMLAMetadata.

@zxdukki zxdukki force-pushed the dev_multistream_overlap branch 7 times, most recently from 519d279 to 3db6d6f Compare June 3, 2025 13:04
"VLLM_ASCEND_TRACE_RECOMPILES":
lambda: bool(int(os.getenv("VLLM_ASCEND_TRACE_RECOMPILES", '0'))),
"VLLM_ENABLE_DBO":
lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))),
lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DBO", '0'))),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. We have updated env variables to VLLM_ASCEND_ENABLE_DBO for enabling dbo.

vllm_model.generate_greedy(example_prompts, max_tokens)


def test_deepseek_model_with_dbo():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_deepseek_model_with_dbo():
@patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_DBO": "1"})
def test_deepseek_model_with_dbo():

we need restore the env after test excuted.

[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need restore the env after test excuted.

[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict

We have fixed the use of env variables in e2e tests by using the suggested approach of patch.dict



def test_deepseek_model_with_dbo():
os.environ["VLLM_ENABLE_DBO"] = "1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.environ["VLLM_ENABLE_DBO"] = "1"

Comment on lines 70 to 72
"The president of the United States is",
"The capital of France is",
"The future of AI is",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The president of the United States is",
"The capital of France is",
"The future of AI is",

Keep only one is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep only one is enough

Now we use only one prompts but repeat for 40 times in dbo tests since we set a threshold of the token nums to activate dual batch overlap.

Comment on lines 624 to 625
from vllm_ascend.multistream.context import \
get_multistream_comm_context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why import here?

We have fixed it in our revised version by moving the imports to the beginning.

@Yikun
Copy link
Collaborator

Yikun commented Jun 3, 2025

Sparate DeepSeek as DeepSeekDBO frist

@wangxiyuan wangxiyuan mentioned this pull request Jun 4, 2025
76 tasks
@github-actions
Copy link

github-actions bot commented Jun 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from b5926a1 to 590551e Compare June 6, 2025 07:45
@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from e96445f to 9b50520 Compare June 6, 2025 08:07
@github-actions
Copy link

github-actions bot commented Jun 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from 20d1afc to f9230b3 Compare June 6, 2025 10:27
@zxdukki
Copy link
Contributor Author

zxdukki commented Jun 6, 2025

Sparate DeepSeek as DeepSeekDBO frist

To decouple dbo with the model of deepseekv2, we adopt the suggestion and add a new model named deepseek_dbo. Users should override the model arch and set the env variable 'VLLM_ASCEND_ENABLE_DBO' simultaneously to run into the logic of dbo.
An example has been added in /examples named offline_dualbatch_overlap_npu.py.

We plan to sync the code in deepseek_dbo.py with deepseek_v2.py regularly so that we can merge dbo back to deepseek_v2.py when it is stable.

@wangxiyuan wangxiyuan added the ready read for review label Jun 6, 2025
@zxdukki
Copy link
Contributor Author

zxdukki commented Jun 7, 2025

It seems that the tests of qwen triggers an npu out of memory issue. Maybe it need to be run again.....
image

Sparate DeepSeek as DeepSeekDBO frist

@zxdukki zxdukki force-pushed the dev_multistream_overlap branch from 0c9d709 to 22cd249 Compare June 7, 2025 03:43
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributions! I'm OK with this.

@Yikun
Copy link
Collaborator

Yikun commented Jun 7, 2025

@ganyi1996ppo @wangxiyuan

from contextlib import contextmanager
from typing import Any

_ms_comm_context: Any = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap those variable into a single class as a global variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap those variable into a single class as a global variable?

I will give a try to wrap it in a single class.
In my opinion, most global variables here can be moved into a ctx class as class members. We can pass them as func params in deepseek_dbo to get the necessary info for switching streams~(in attn/routed expert/shared expert...) or split/merge the inputs.

Currently, as we donot modify the impl of attention backend, the stream switch logic in mla still relys on the global variable _ms_comm_context. But I think it is feasible to wrap this global variable by a ctx class and pass it through our rewrote forward function of mla to obtain the comm stream info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it in the next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it in the next PR.

ok, maybe like this demo? https://github.com/zxdukki/vllm-ascend/commit/8a7d13d05b7ccec3e696ffe31221d7a927626be1 We will update it together with the modifications on mla impl according to ganyi's advice in next PR. cc @ganyi1996ppo

# k_c.size(1) + k_pe.size(1) == kv_cache.size(2)
# i.e.
# kv_lora_rank + qk_rope_head_dim == head_size
self.mla_attn = Attention(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any further plan to rewrite the attention impl in this dbo modeling

Copy link
Contributor Author

@zxdukki zxdukki Jun 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any further plan to rewrite the attention impl in this dbo modeling

Thanks for your review. I think it would be better to rewrite the impl of mla backend for dbo and move the logic of switching npu streams into it. Maybe we can decouple them from mla_v1 and use env variables to control the return of get_impl_cls in the next version? Besides, do you have any other advices on the modifications on the attention impl for dbo model?

@wangxiyuan wangxiyuan merged commit 87ebaef into vllm-project:main Jun 7, 2025
23 checks passed
Yuxiao-Xu pushed a commit to Yuxiao-Xu/vllm-ascend that referenced this pull request Jun 7, 2025
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](vllm-project#842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.

ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)

### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.

### How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).



Any advice/discussion is welcome.

### Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.

`python -m vllm.entrypoints.openai.api_server \
 --model=DeepSeek-R1-W8A8 \
 --trust-remote-code \
 --distributed-executor-backend=mp \
 -tp=16 \
 --port 8006 \
 --max-num-seqs 390 \
 --max-model-len 32768 \
 --max-num-batched-tokens 65536 \
 --block-size 128 \
 --compilation_config 0 \
 --gpu-memory-utilization 0.90 \
 --disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`

and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`

1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)

2. test with the version using alltoall: 

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm

---------

Signed-off-by: zhuohuan <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](vllm-project#842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.

ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)

### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.

### How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).



Any advice/discussion is welcome.

### Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.

`python -m vllm.entrypoints.openai.api_server \
 --model=DeepSeek-R1-W8A8 \
 --trust-remote-code \
 --distributed-executor-backend=mp \
 -tp=16 \
 --port 8006 \
 --max-num-seqs 390 \
 --max-model-len 32768 \
 --max-num-batched-tokens 65536 \
 --block-size 128 \
 --compilation_config 0 \
 --gpu-memory-utilization 0.90 \
 --disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`

and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`

1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)

2. test with the version using alltoall: 

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm

---------

Signed-off-by: zhuohuan <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and
also the implementation of fused moe in VLLM project, we implement the
multi-stream(also known as dual-batch) overlap for deepseek+mla on
Ascend NPU. We split the input batch of model into two microbatches and
then overlap the comp/comm ops in attention and moe layers using two
streams to improve the performance. Our approach can be easily extended
when adding dispatch/combine communications for moe layer.
Compared with the previously proposed
[draft](vllm-project#842), we use
one stream for computation ops and the other for communication ops,
separately. In out opinions, it is beneficial for arranging the order of
executing different ops and thus avoiding the contention of
computation/communication resources.

ref: [overlap for
llama](https://github.com/vllm-project/vllm/pull/15787/files)
ref: [dbo in
sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de)

### Does this PR introduce _any_ user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by
setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.

### How was this patch tested?

This patch can be tested with vllm-0.9.0 using its online service with
benchmark tests. We have decoupled the func of dbo from vllm and it
should be able to run without any modification to the code of vllm(some
modifications is better to implement in vllm though).



Any advice/discussion is welcome.

### Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance
after using dual-batch overlap.

`python -m vllm.entrypoints.openai.api_server \
 --model=DeepSeek-R1-W8A8 \
 --trust-remote-code \
 --distributed-executor-backend=mp \
 -tp=16 \
 --port 8006 \
 --max-num-seqs 390 \
 --max-model-len 32768 \
 --max-num-batched-tokens 65536 \
 --block-size 128 \
 --compilation_config 0 \
 --gpu-memory-utilization 0.90 \
 --disable-log-requests \
--additional-config
'{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'`

and run benchmark with the parameters of :
`--dataset-name random --random-input-len 4096 --random-output-len 1
--num-prompts 200 --max-concurrency 8 --request-rate 5
--metric-percentiles 90`

1. test with the version using allgather+allreduce in Ascend 910B (tp16
ep16 + deepseek r1 w8a8)

2. test with the version using alltoall: 

prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms

The overlap approach when using alltoall communication can be further
optimized by overlapping micro-batch1's moe comp with micro-batch2's
dispatch a2a comm

---------

Signed-off-by: zhuohuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants