@@ -98,7 +98,7 @@ Then run the benchmarking script
9898``` bash
9999# download dataset
100100# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
101- python3 vllm/benchmarks/benchmark_serving.py \
101+ vllm bench serve \
102102 --backend vllm \
103103 --model NousResearch/Hermes-3-Llama-3.1-8B \
104104 --endpoint /v1/completions \
@@ -111,25 +111,25 @@ If successful, you will see the following output
111111
112112```
113113============ Serving Benchmark Result ============
114- Successful requests: 10
115- Benchmark duration (s): 5.78
116- Total input tokens: 1369
117- Total generated tokens: 2212
118- Request throughput (req/s): 1.73
119- Output token throughput (tok/s): 382.89
120- Total Token throughput (tok/s): 619.85
114+ Successful requests: 10
115+ Benchmark duration (s): 5.78
116+ Total input tokens: 1369
117+ Total generated tokens: 2212
118+ Request throughput (req/s): 1.73
119+ Output token throughput (tok/s): 382.89
120+ Total Token throughput (tok/s): 619.85
121121---------------Time to First Token----------------
122- Mean TTFT (ms): 71.54
123- Median TTFT (ms): 73.88
124- P99 TTFT (ms): 79.49
122+ Mean TTFT (ms): 71.54
123+ Median TTFT (ms): 73.88
124+ P99 TTFT (ms): 79.49
125125-----Time per Output Token (excl. 1st token)------
126- Mean TPOT (ms): 7.91
127- Median TPOT (ms): 7.96
128- P99 TPOT (ms): 8.03
126+ Mean TPOT (ms): 7.91
127+ Median TPOT (ms): 7.96
128+ P99 TPOT (ms): 8.03
129129---------------Inter-token Latency----------------
130- Mean ITL (ms): 7.74
131- Median ITL (ms): 7.70
132- P99 ITL (ms): 8.39
130+ Mean ITL (ms): 7.74
131+ Median ITL (ms): 7.70
132+ P99 ITL (ms): 8.39
133133==================================================
134134```
135135
@@ -141,7 +141,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you
141141{"prompt": "What is the capital of India?"}
142142{"prompt": "What is the capital of Iran?"}
143143{"prompt": "What is the capital of China?"}
144- ```
144+ ```
145145
146146``` bash
147147# start server
@@ -150,7 +150,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests
150150
151151``` bash
152152# run benchmarking script
153- python3 benchmarks/benchmark_serving.py --port 9001 --save-result --save-detailed \
153+ vllm bench serve --port 9001 --save-result --save-detailed \
154154 --backend vllm \
155155 --model meta-llama/Llama-3.1-8B-Instruct \
156156 --endpoint /v1/completions \
@@ -174,7 +174,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
174174```
175175
176176``` bash
177- python3 vllm/benchmarks/benchmark_serving.py \
177+ vllm bench serve \
178178 --backend openai-chat \
179179 --model Qwen/Qwen2-VL-7B-Instruct \
180180 --endpoint /v1/chat/completions \
@@ -194,7 +194,7 @@ VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
194194```
195195
196196``` bash
197- python3 benchmarks/benchmark_serving.py \
197+ vllm bench serve \
198198 --model meta-llama/Meta-Llama-3-8B-Instruct \
199199 --dataset-name hf \
200200 --dataset-path likaixin/InstructCoder \
@@ -210,7 +210,7 @@ vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
210210** ` lmms-lab/LLaVA-OneVision-Data ` **
211211
212212``` bash
213- python3 vllm/benchmarks/benchmark_serving.py \
213+ vllm bench serve \
214214 --backend openai-chat \
215215 --model Qwen/Qwen2-VL-7B-Instruct \
216216 --endpoint /v1/chat/completions \
@@ -224,7 +224,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
224224** ` Aeala/ShareGPT_Vicuna_unfiltered ` **
225225
226226``` bash
227- python3 vllm/benchmarks/benchmark_serving.py \
227+ vllm bench serve \
228228 --backend openai-chat \
229229 --model Qwen/Qwen2-VL-7B-Instruct \
230230 --endpoint /v1/chat/completions \
@@ -237,7 +237,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
237237** ` AI-MO/aimo-validation-aime ` **
238238
239239``` bash
240- python3 vllm/benchmarks/benchmark_serving.py \
240+ vllm bench serve \
241241 --model Qwen/QwQ-32B \
242242 --dataset-name hf \
243243 --dataset-path AI-MO/aimo-validation-aime \
@@ -248,7 +248,7 @@ python3 vllm/benchmarks/benchmark_serving.py \
248248** ` philschmid/mt-bench ` **
249249
250250``` bash
251- python3 vllm/benchmarks/benchmark_serving.py \
251+ vllm bench serve \
252252 --model Qwen/QwQ-32B \
253253 --dataset-name hf \
254254 --dataset-path philschmid/mt-bench \
@@ -261,7 +261,7 @@ When using OpenAI-compatible backends such as `vllm`, optional sampling
261261parameters can be specified. Example client command:
262262
263263``` bash
264- python3 vllm/benchmarks/benchmark_serving.py \
264+ vllm bench serve \
265265 --backend vllm \
266266 --model NousResearch/Hermes-3-Llama-3.1-8B \
267267 --endpoint /v1/completions \
@@ -296,7 +296,7 @@ The following arguments can be used to control the ramp-up:
296296<br />
297297
298298``` bash
299- python3 vllm/benchmarks/benchmark_throughput.py \
299+ vllm bench throughput \
300300 --model NousResearch/Hermes-3-Llama-3.1-8B \
301301 --dataset-name sonnet \
302302 --dataset-path vllm/benchmarks/sonnet.txt \
@@ -314,7 +314,7 @@ Total num output tokens: 1500
314314** VisionArena Benchmark for Vision Language Models**
315315
316316``` bash
317- python3 vllm/benchmarks/benchmark_throughput.py \
317+ vllm bench throughput \
318318 --model Qwen/Qwen2-VL-7B-Instruct \
319319 --backend vllm-chat \
320320 --dataset-name hf \
@@ -336,7 +336,7 @@ Total num output tokens: 1280
336336``` bash
337337VLLM_WORKER_MULTIPROC_METHOD=spawn \
338338VLLM_USE_V1=1 \
339- python3 vllm/benchmarks/benchmark_throughput.py \
339+ vllm bench throughput \
340340 --dataset-name=hf \
341341 --dataset-path=likaixin/InstructCoder \
342342 --model=meta-llama/Meta-Llama-3-8B-Instruct \
@@ -360,7 +360,7 @@ Total num output tokens: 204800
360360** ` lmms-lab/LLaVA-OneVision-Data ` **
361361
362362``` bash
363- python3 vllm/benchmarks/benchmark_throughput.py \
363+ vllm bench throughput \
364364 --model Qwen/Qwen2-VL-7B-Instruct \
365365 --backend vllm-chat \
366366 --dataset-name hf \
@@ -373,7 +373,7 @@ python3 vllm/benchmarks/benchmark_throughput.py \
373373** ` Aeala/ShareGPT_Vicuna_unfiltered ` **
374374
375375``` bash
376- python3 vllm/benchmarks/benchmark_throughput.py \
376+ vllm bench throughput \
377377 --model Qwen/Qwen2-VL-7B-Instruct \
378378 --backend vllm-chat \
379379 --dataset-name hf \
@@ -385,7 +385,7 @@ python3 vllm/benchmarks/benchmark_throughput.py \
385385** ` AI-MO/aimo-validation-aime ` **
386386
387387``` bash
388- python3 benchmarks/benchmark_throughput.py \
388+ vllm bench throughput \
389389 --model Qwen/QwQ-32B \
390390 --backend vllm \
391391 --dataset-name hf \
@@ -399,7 +399,7 @@ python3 benchmarks/benchmark_throughput.py \
399399``` bash
400400# download dataset
401401# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
402- python3 vllm/benchmarks/benchmark_throughput.py \
402+ vllm bench throughput \
403403 --model meta-llama/Llama-2-7b-hf \
404404 --backend vllm \
405405 --dataset_path < your data path> /ShareGPT_V3_unfiltered_cleaned_split.json \
0 commit comments