Fujh/opt send and recv overlap #8

alpha-baby · 2025-11-09T07:13:13Z

一个两处优化

最后一个 PP 可以尽快调用 send
在 --pp-async-batch-depth == 0 的时候，把所有 nccl 上的 send 放到一起，可以重叠 send 和 recv

压测对比：

4机 H200-TP8-PP4

压测参数

python3 -m sglang.bench_serving --port 61001 --dataset-name random-ids --num-prompts 1024 --random-input-len 4096 --random-output-len 1536 --random-range-ratio 0.9 --disable-stream --warmup-requests 100

优化前

pp-async-batch-depth = 0

压测结果

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:42<00:00,  4.61it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  222.01    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1479769   
Request throughput (req/s):              4.61      
Input token throughput (tok/s):          17953.13  
Output token throughput (tok/s):         6724.68   
Total token throughput (tok/s):          24677.81  
Concurrency:                             568.16    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   123182.14 
Median E2E Latency (ms):                 122858.82 
---------------Time to First Token----------------
Mean TTFT (ms):                          122469.65 
Median TTFT (ms):                        114442.09 
P99 TTFT (ms):                           220298.19 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

pp-async-batch-depth = 1

压测结果

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:41<00:00,  4.62it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  221.60    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1480004   
Request throughput (req/s):              4.62      
Input token throughput (tok/s):          17986.45  
Output token throughput (tok/s):         6737.16   
Total token throughput (tok/s):          24723.61  
Concurrency:                             567.46    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122802.80 
Median E2E Latency (ms):                 122592.23 
---------------Time to First Token----------------
Mean TTFT (ms):                          122092.78 
Median TTFT (ms):                        114055.86 
P99 TTFT (ms):                           219646.49 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

精度测试

python3 /home/shared/fujianhao.fjh/sglang/benchmark/gsm8k/bench_sglang.py --port 61001 --num-questions 1000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:14<00:00, 13.46it/s]
Accuracy: 0.902
Invalid: 0.000
Latency: 74.487 s
Output throughput: 1674.920 token/s

优化后

pp-async-batch-depth = 0

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:40<00:00,  4.65it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  220.20    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1479149   
Request throughput (req/s):              4.65      
Input token throughput (tok/s):          18101.14  
Output token throughput (tok/s):         6780.12   
Total token throughput (tok/s):          24881.26  
Concurrency:                             568.56    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122261.76 
Median E2E Latency (ms):                 121822.08 
---------------Time to First Token----------------
Mean TTFT (ms):                          121555.04 
Median TTFT (ms):                        113411.08 
P99 TTFT (ms):                           218630.09 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

精度测试：

#python3 /home/shared/fujianhao.fjh/sglang/benchmark/gsm8k/bench_sglang.py --port 61001 --num-questions 1000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:10<00:00, 14.12it/s]
Accuracy: 0.902
Invalid: 0.000
Latency: 72.320 s
Output throughput: 1725.063 token/s

opt send and recv overlap

1fb62d2

alpha-baby changed the base branch from main to Xuchun/pp-dev November 9, 2025 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fujh/opt send and recv overlap #8

Fujh/opt send and recv overlap #8

Uh oh!

alpha-baby commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fujh/opt send and recv overlap #8

Are you sure you want to change the base?

Fujh/opt send and recv overlap #8

Uh oh!

Conversation

alpha-baby commented Nov 9, 2025

压测对比：

4机 H200-TP8-PP4

优化前

优化后

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant