Skip to content

Conversation

@alpha-baby
Copy link

一个两处优化

  1. 最后一个 PP 可以尽快调用 send
  2. 在 --pp-async-batch-depth == 0 的时候,把所有 nccl 上的 send 放到一起,可以重叠 send 和 recv

压测对比:

4机 H200-TP8-PP4

压测参数

python3 -m sglang.bench_serving --port 61001 --dataset-name random-ids --num-prompts 1024 --random-input-len 4096 --random-output-len 1536 --random-range-ratio 0.9 --disable-stream --warmup-requests 100

优化前

pp-async-batch-depth = 0

压测结果

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:42<00:00,  4.61it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  222.01    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1479769   
Request throughput (req/s):              4.61      
Input token throughput (tok/s):          17953.13  
Output token throughput (tok/s):         6724.68   
Total token throughput (tok/s):          24677.81  
Concurrency:                             568.16    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   123182.14 
Median E2E Latency (ms):                 122858.82 
---------------Time to First Token----------------
Mean TTFT (ms):                          122469.65 
Median TTFT (ms):                        114442.09 
P99 TTFT (ms):                           220298.19 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

pp-async-batch-depth = 1

压测结果

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:41<00:00,  4.62it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  221.60    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1480004   
Request throughput (req/s):              4.62      
Input token throughput (tok/s):          17986.45  
Output token throughput (tok/s):         6737.16   
Total token throughput (tok/s):          24723.61  
Concurrency:                             567.46    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122802.80 
Median E2E Latency (ms):                 122592.23 
---------------Time to First Token----------------
Mean TTFT (ms):                          122092.78 
Median TTFT (ms):                        114055.86 
P99 TTFT (ms):                           219646.49 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

精度测试

python3 /home/shared/fujianhao.fjh/sglang/benchmark/gsm8k/bench_sglang.py --port 61001 --num-questions 1000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:14<00:00, 13.46it/s]
Accuracy: 0.902
Invalid: 0.000
Latency: 74.487 s
Output throughput: 1674.920 token/s

优化后

pp-async-batch-depth = 0

#Input tokens: 3985822
#Output tokens: 1492964
Starting warmup with 100 sequences...
Warmup completed with 100 sequences. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:40<00:00,  4.65it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     1024      
Benchmark duration (s):                  220.20    
Total input tokens:                      3985822   
Total input text tokens:                 3985822   
Total input vision tokens:               0         
Total generated tokens:                  1492964   
Total generated tokens (retokenized):    1479149   
Request throughput (req/s):              4.65      
Input token throughput (tok/s):          18101.14  
Output token throughput (tok/s):         6780.12   
Total token throughput (tok/s):          24881.26  
Concurrency:                             568.56    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122261.76 
Median E2E Latency (ms):                 121822.08 
---------------Time to First Token----------------
Mean TTFT (ms):                          121555.04 
Median TTFT (ms):                        113411.08 
P99 TTFT (ms):                           218630.09 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

精度测试:

#python3 /home/shared/fujianhao.fjh/sglang/benchmark/gsm8k/bench_sglang.py --port 61001 --num-questions 1000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:10<00:00, 14.12it/s]
Accuracy: 0.902
Invalid: 0.000
Latency: 72.320 s
Output throughput: 1725.063 token/s

@alpha-baby alpha-baby changed the base branch from main to Xuchun/pp-dev November 9, 2025 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant