-
Couldn't load subscription status.
- Fork 523
[perf]: support dual-batch overlap(dbo) for deepseek #941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[perf]: support dual-batch overlap(dbo) for deepseek #941
Conversation
943d296 to
68070f1
Compare
b0eed8a to
9053dd1
Compare
cfe6a5a to
8c24a6f
Compare
f44b88d to
b8c75ef
Compare
5308bfc to
3bd58a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change. Thanks for the contribute. Please rebase to main as well.
vllm_ascend/models/deepseek_v2.py
Outdated
| real_top_k = self.mlp.experts.top_k | ||
|
|
||
| if VLLM_ENABLE_MC2 and not is_prefill: | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add more note or remove this kind of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add more note or remove this kind of code
Thanks for your review, we have removed it in our revised version.
vllm_ascend/models/deepseek_v2.py
Outdated
|
|
||
| if VLLM_ENABLE_MC2 and not is_prefill: | ||
| ... | ||
| ''' the following kernels will be submitted to the comm stream to overlap the computation of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use # for comment.
vllm_ascend/models/deepseek_v2.py
Outdated
| enable_force_load_balance) | ||
|
|
||
| if VLLM_ENABLE_MC2 and not is_prefill: | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
vllm_ascend/models/deepseek_v2.py
Outdated
| residual = intermediate_tensors["residual"] | ||
|
|
||
| for i in range(self.start_layer, self.end_layer): | ||
| num_normal_layers = (self.first_k_dense_replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to make the code like to be more clear:
if self.can_run_ms:
xxx
else:
xxx
I know your code here is more simple, but it's not easy for maintain. Or another way is to move if moe_start_layer == self.end_layer: from _forward_ms_layers to L1014.
moe_start_layer = self.start_layer + num_normal_layers
if moe_start_layer !=self.end_layer:
hidden_states, residual = self._forward_ms_layers()
vllm_ascend/multistream/context.py
Outdated
| from contextlib import contextmanager | ||
| from typing import Any | ||
|
|
||
| # TODO: move this part to vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain more here, do you mean multistream should be in vllm?
vllm_ascend/models/deepseek_v2.py
Outdated
| self.multistream_config = MultiStreamConfig() | ||
|
|
||
| self.use_mla = model_config.use_mla | ||
| self.multistream_metadata = make_multistream_metadata_ds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multistream_metadata is only used for L973 and L975, use self here is useless
vllm_ascend/models/deepseek_v2.py
Outdated
| causal_lm=getattr(config, "causal_lm", True), | ||
| multistream_config=self.multistream_config, | ||
| ) | ||
| self.ms_pre_layer = MultiStreamPreTransformerLayer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kind of parameters is only used for ms case, can we just init them when ms is enabled?
vllm_ascend/multistream/metadata.py
Outdated
|
|
||
| def split_micro_batch( | ||
| self, | ||
| attn_metadata: "AttentionMetadata", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type should be AscendMLAMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have modified it to the AscendMLAMetadata.
519d279 to
3db6d6f
Compare
vllm_ascend/envs.py
Outdated
| "VLLM_ASCEND_TRACE_RECOMPILES": | ||
| lambda: bool(int(os.getenv("VLLM_ASCEND_TRACE_RECOMPILES", '0'))), | ||
| "VLLM_ENABLE_DBO": | ||
| lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lambda: bool(int(os.getenv("VLLM_ENABLE_DBO", '0'))), | |
| lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_DBO", '0'))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. We have updated env variables to VLLM_ASCEND_ENABLE_DBO for enabling dbo.
| vllm_model.generate_greedy(example_prompts, max_tokens) | ||
|
|
||
|
|
||
| def test_deepseek_model_with_dbo(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def test_deepseek_model_with_dbo(): | |
| @patch.dict(os.environ, {"VLLM_ASCEND_ENABLE_DBO": "1"}) | |
| def test_deepseek_model_with_dbo(): |
we need restore the env after test excuted.
[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need restore the env after test excuted.
[1] https://docs.python.org/3/library/unittest.mock.html#unittest.mock.patch.dict
We have fixed the use of env variables in e2e tests by using the suggested approach of patch.dict
|
|
||
|
|
||
| def test_deepseek_model_with_dbo(): | ||
| os.environ["VLLM_ENABLE_DBO"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| os.environ["VLLM_ENABLE_DBO"] = "1" |
| "The president of the United States is", | ||
| "The capital of France is", | ||
| "The future of AI is", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "The president of the United States is", | |
| "The capital of France is", | |
| "The future of AI is", |
Keep only one is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep only one is enough
Now we use only one prompts but repeat for 40 times in dbo tests since we set a threshold of the token nums to activate dual batch overlap.
vllm_ascend/attention/mla_v1.py
Outdated
| from vllm_ascend.multistream.context import \ | ||
| get_multistream_comm_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import here?
We have fixed it in our revised version by moving the imports to the beginning.
|
Sparate DeepSeek as DeepSeekDBO frist |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
b5926a1 to
590551e
Compare
e96445f to
9b50520
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
Signed-off-by: zhuohuan <[email protected]>
20d1afc to
f9230b3
Compare
To decouple dbo with the model of deepseekv2, we adopt the suggestion and add a new model named deepseek_dbo. Users should override the model arch and set the env variable 'VLLM_ASCEND_ENABLE_DBO' simultaneously to run into the logic of dbo. We plan to sync the code in |
Signed-off-by: zhuohuan <[email protected]>
0c9d709 to
22cd249
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributions! I'm OK with this.
| from contextlib import contextmanager | ||
| from typing import Any | ||
|
|
||
| _ms_comm_context: Any = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap those variable into a single class as a global variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap those variable into a single class as a global variable?
I will give a try to wrap it in a single class.
In my opinion, most global variables here can be moved into a ctx class as class members. We can pass them as func params in deepseek_dbo to get the necessary info for switching streams~(in attn/routed expert/shared expert...) or split/merge the inputs.
Currently, as we donot modify the impl of attention backend, the stream switch logic in mla still relys on the global variable _ms_comm_context. But I think it is feasible to wrap this global variable by a ctx class and pass it through our rewrote forward function of mla to obtain the comm stream info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do it in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do it in the next PR.
ok, maybe like this demo? https://github.com/zxdukki/vllm-ascend/commit/8a7d13d05b7ccec3e696ffe31221d7a927626be1 We will update it together with the modifications on mla impl according to ganyi's advice in next PR. cc @ganyi1996ppo
| # k_c.size(1) + k_pe.size(1) == kv_cache.size(2) | ||
| # i.e. | ||
| # kv_lora_rank + qk_rope_head_dim == head_size | ||
| self.mla_attn = Attention( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any further plan to rewrite the attention impl in this dbo modeling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any further plan to rewrite the attention impl in this dbo modeling
Thanks for your review. I think it would be better to rewrite the impl of mla backend for dbo and move the logic of switching npu streams into it. Maybe we can decouple them from mla_v1 and use env variables to control the return of get_impl_cls in the next version? Besides, do you have any other advices on the modifications on the attention impl for dbo model?
### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](vllm-project#842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT:8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <[email protected]>
### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](vllm-project#842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT:8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <[email protected]>
### What this PR does / why we need it? Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer. Compared with the previously proposed [draft](vllm-project#842), we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources. ref: [overlap for llama](https://github.com/vllm-project/vllm/pull/15787/files) ref: [dbo in sglang](https://github.com/sgl-project/sglang/pull/4068/files#diff-b4937569fc71f6ad215181b633b2f89c7183a2b4ac39e41fc22635599a9be7de) ### Does this PR introduce _any_ user-facing change? Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1" See /examples/offline_dualbatch_overlap_npu.py for more info. ### How was this patch tested? This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though). Any advice/discussion is welcome. ### Performance Benchmark We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap. `python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'` and run benchmark with the parameters of : `--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90` 1. test with the version using allgather+allreduce in Ascend 910B (tp16 ep16 + deepseek r1 w8a8) 2. test with the version using alltoall: prefill qps: 0.90 -> 1.01 Mean TTFT:8226->7432ms The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm --------- Signed-off-by: zhuohuan <[email protected]>

What this PR does / why we need it?
Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer.
Compared with the previously proposed draft, we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources.
ref: overlap for llama
ref: dbo in sglang
Does this PR introduce any user-facing change?
Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ASCEND_ENABLE_DBO=1"
See /examples/offline_dualbatch_overlap_npu.py for more info.
How was this patch tested?
This patch can be tested with vllm-0.9.0 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though).
Any advice/discussion is welcome.
Performance Benchmark
We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap.
python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'and run benchmark with the parameters of :
--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90prefill qps: 2.17-> 2.60
prefill qps: 0.90 -> 1.01
Mean TTFT:8226->7432ms
The overlap approach when using alltoall communication can be further optimized by overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm