Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
2f8d51c
Remove unused variables from CPA
hyoon1 Jun 20, 2025
c3649e4
[Docs] Fix syntax highlighting of shell commands (#19870)
lgeiger Jun 23, 2025
68aaeb3
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low l…
tlrmchlsmth Jun 23, 2025
61f4fc5
[Bugfix][v1] Fix step pooler implementation and step pooling usage in…
Isotr0py Jun 23, 2025
d0132f0
[Misc] Add type alias `ReqId` and `EngineId` for better readability (…
lk-chen Jun 23, 2025
e6327c9
[Feature] Support sequence parallelism for static fp8 quantization (#…
cascade812 Jun 23, 2025
a3bc76e
[CI/Build] Push latest tag for cpu and neuron docker image (#19897)
22quinn Jun 23, 2025
dd2ccf8
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend (#19395)
Jun-Howie Jun 23, 2025
4671ac6
[Bugfix][Benchmark] Fix Marlin benchmark (#19929)
22quinn Jun 23, 2025
33d5e29
[TPU] Fix tpu model runner test (#19995)
Chenyaaang Jun 23, 2025
a738dbb
Update test case parameter to have the throughput above 8.0 (#19994)
QiliangCui Jun 24, 2025
ee5ad8d
[Misc][Tools][Benchmark] Add profile to autotune script (#19711)
Chenyaaang Jun 24, 2025
0eed516
[doc] Fix broken link in the installation for CPU (#19980)
yankay Jun 24, 2025
3014c92
add some examples for other benchmark scripts (#19893)
reidliu41 Jun 24, 2025
9a3b883
[PERF] Speedup of MRoPE prepare inputs (#19939)
vadiklyutiy Jun 24, 2025
53da4cd
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 (#20014)
bigPYJ1151 Jun 24, 2025
26d34eb
refactor example - qwen3_reranker (#19847)
reidliu41 Jun 24, 2025
981eeca
[Fix][V1] Remove --scheduling-policy oracle (#20010)
amitm02 Jun 24, 2025
a045b7e
[Perf] Improve/Fix-regression for FA3 in High QPS regimes (#19463)
LucasWilkinson Jun 24, 2025
c635c5f
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the ben…
dtransposed Jun 24, 2025
8619e71
[BugFix] Fix multi-node offline data parallel (#19937)
njhill Jun 24, 2025
91f7d9d
[P/D] Asynchronously do _nixl_handshake (#19836)
lk-chen Jun 24, 2025
c6e3bba
[Feature] Integrate new deepgemm (#19820)
yewentao256 Jun 24, 2025
ead3698
[Easy] Remove submodule added in #19463 (#20039)
b8zhong Jun 24, 2025
c01d1c5
use .dev for version comparison with pytorch nightly release (#20031)
BoyuanFeng Jun 24, 2025
0d06b53
cmake: Update vllm_flash_attn for vllm_kernels (#20032)
seemethere Jun 24, 2025
1afa994
[Llama4] Update `attn_temperature_tuning` (#19997)
b8zhong Jun 25, 2025
a6c4b87
Revert "[Feature] Integrate new deepgemm (#19820)" (#20049)
yewentao256 Jun 25, 2025
2273ec3
Revert "Fix(models/siglip): Add compatibility for Gemma models quanti…
Isotr0py Jun 25, 2025
3443aaf
Move to a faster base64 implementation (#19984)
h-avsha Jun 25, 2025
7108934
[Frontend] speed up import time of vllm.config (#18036)
davidxia Jun 25, 2025
879f69b
[Refactor] Remove duplicate `ceil_div` (#20023)
yewentao256 Jun 25, 2025
f59fc60
[Feat][CLI] enforce-include-usage (#19695)
max-wittig Jun 25, 2025
015fab8
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. …
bnellnm Jun 25, 2025
ba7ba35
[Chore] debloat some initial logs (#19438)
aarnphm Jun 25, 2025
0f9e735
[BugFix] Fix full-cuda-graph illegal memory access in FA3 (#20057)
LucasWilkinson Jun 25, 2025
c53fec1
[doc] add reference link for Intel XPU (#20064)
reidliu41 Jun 25, 2025
bf51815
[Doc] Guide for Incremental Compilation Workflow (#19109)
mgoin Jun 25, 2025
8359f4c
[V1][Speculative Decoding] Fix DeepSeek MTP (#20022)
cjackal Jun 25, 2025
e795d72
[Frontend] Add `/v1/audio/translations` OpenAI API endpoint (#19615)
NickLucche Jun 25, 2025
02c97d9
[Quantization] Add compressed-tensors emulations support for NVFP4 (#…
dsikka Jun 25, 2025
23a04e0
[Fix] Support cls pooling in ModernBertPooler (#20067)
lsz05 Jun 25, 2025
8b8c209
static_scaled_fp8_quant should not run when scale.numel is not 1 (#20…
eldarkurtic Jun 25, 2025
4734704
[PD] let toy proxy handle /chat/completions (#19730)
lk-chen Jun 25, 2025
52741bd
Merge remote-tracking branch 'upstream/main'
gshtras Jun 25, 2025
4ed2d76
Merge remote-tracking branch 'hyoon1/remove_unused_var'
gshtras Jun 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Please download the visualization scripts in the post
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```console
```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
Expand Down
2 changes: 2 additions & 0 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ steps:
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
Expand All @@ -117,6 +118,7 @@ steps:
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest --progress plain -f docker/Dockerfile.neuron ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-neuron-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
4 changes: 2 additions & 2 deletions .buildkite/scripts/tpu/config_v6e_1.env
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ CONTAINER_NAME=vllm-tpu

# vllm config
MODEL=meta-llama/Llama-3.1-8B-Instruct
MAX_NUM_SEQS=512
MAX_NUM_BATCHED_TOKENS=512
MAX_NUM_SEQS=256
MAX_NUM_BATCHED_TOKENS=1024
TENSOR_PARALLEL_SIZE=1
MAX_MODEL_LEN=2048
DOWNLOAD_DIR=/mnt/disks/persist
Expand Down
3 changes: 3 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -615,13 +615,16 @@ steps:
- vllm/executor/
- vllm/model_executor/models/
- tests/distributed/
- tests/examples/offline_inference/data_parallel.py
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code

- label: Distributed Tests (2 GPUs) # 40min
mirror_hardwares: [amdexperimental]
Expand Down
190 changes: 190 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,21 @@ python3 vllm/benchmarks/benchmark_serving.py \
--num-prompts 10
```

### Running With Ramp-Up Request Rate

The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the
server or finding the maximum throughput that it can handle, given some latency budget.

Two ramp-up strategies are supported:
- `linear`: Increases the request rate linearly from a start value to an end value.
- `exponential`: Increases the request rate exponentially.

The following arguments can be used to control the ramp-up:
- `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`).
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.

---
## Example - Offline Throughput Benchmark

Expand Down Expand Up @@ -387,3 +402,178 @@ python3 vllm/benchmarks/benchmark_throughput.py \
--enable-lora \
--lora-path yard1/llama-2-7b-sql-lora-test
```

---
## Example - Structured Output Benchmark

Benchmark the performance of structured output generation (JSON, grammar, regex).

### Server Setup

```bash
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
```

### JSON Schema Benchmark

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset json \
--structured-output-ratio 1.0 \
--request-rate 10 \
--num-prompts 1000
```

### Grammar-based Generation Benchmark

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset grammar \
--structure-type grammar \
--request-rate 10 \
--num-prompts 1000
```

### Regex-based Generation Benchmark

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset regex \
--request-rate 10 \
--num-prompts 1000
```

### Choice-based Generation Benchmark

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset choice \
--request-rate 10 \
--num-prompts 1000
```

### XGrammar Benchmark Dataset

```bash
python3 benchmarks/benchmark_serving_structured_output.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset xgrammar_bench \
--request-rate 10 \
--num-prompts 1000
```

---
## Example - Long Document QA Throughput Benchmark

Benchmark the performance of long document question-answering with prefix caching.

### Basic Long Document QA Test

```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 16 \
--document-length 2000 \
--output-len 50 \
--repeat-count 5
```

### Different Repeat Modes

```bash
# Random mode (default) - shuffle prompts randomly
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode random

# Tile mode - repeat entire prompt list in sequence
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode tile

# Interleave mode - repeat each prompt consecutively
python3 benchmarks/benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--document-length 3000 \
--repeat-count 3 \
--repeat-mode interleave
```

---
## Example - Prefix Caching Benchmark

Benchmark the efficiency of automatic prefix caching.

### Fixed Prompt with Prefix Caching

```bash
python3 benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
```

### ShareGPT Dataset with Prefix Caching

```bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

python3 benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-2-7b-chat-hf \
--dataset-path /path/ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256
```

---
## Example - Request Prioritization Benchmark

Benchmark the performance of request prioritization in vLLM.

### Basic Prioritization Test

```bash
python3 benchmarks/benchmark_prioritization.py \
--model meta-llama/Llama-2-7b-chat-hf \
--input-len 128 \
--output-len 64 \
--num-prompts 100 \
--scheduling-policy priority
```

### Multiple Sequences per Prompt

```bash
python3 benchmarks/benchmark_prioritization.py \
--model meta-llama/Llama-2-7b-chat-hf \
--input-len 128 \
--output-len 64 \
--num-prompts 100 \
--scheduling-policy priority \
--n 2
```
42 changes: 37 additions & 5 deletions benchmarks/auto_tune.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
# 3. Set variables (ALL REQUIRED)
# BASE: your directory for vllm repo
# MODEL: the model served by vllm
# SYSTEM: the hardware, choice TPU or GPU, for other systems, "get best profile" might not support.
# TP: ways of tensor parallelism
# DOWNLOAD_DIR: directory to download and load model weights.
# INPUT_LEN: request input len
Expand All @@ -34,6 +35,7 @@
TAG=$(date +"%Y_%m_%d_%H_%M")
BASE=""
MODEL="meta-llama/Llama-3.1-8B-Instruct"
SYSTEM="TPU"
TP=1
DOWNLOAD_DIR=""
INPUT_LEN=4000
Expand All @@ -45,12 +47,15 @@ NUM_BATCHED_TOKENS_LIST="512 1024 2048 4096"

LOG_FOLDER="$BASE/auto-benchmark/$TAG"
RESULT="$LOG_FOLDER/result.txt"
PROFILE_PATH="$LOG_FOLDER/profile"

echo "result file: $RESULT"
echo "model: $MODEL"

rm -rf $LOG_FOLDER
rm -rf $PROFILE_PATH
mkdir -p $LOG_FOLDER
mkdir -p $PROFILE_PATH

cd "$BASE/vllm"

Expand All @@ -70,10 +75,11 @@ start_server() {
local max_num_seqs=$2
local max_num_batched_tokens=$3
local vllm_log=$4
local profile_dir=$5

pkill -f vllm

VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 vllm serve $MODEL \
VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \
--disable-log-requests \
--port 8004 \
--gpu-memory-utilization $gpu_memory_utilization \
Expand Down Expand Up @@ -105,19 +111,37 @@ start_server() {
fi
}

update_best_profile() {
local profile_dir=$1
local profile_index=$2
sorted_paths=($(find "$profile_dir" -maxdepth 1 -not -path "$profile_dir" | sort))
selected_profile_file=
if [[ "$SYSTEM" == "TPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb"
fi
if [[ "$SYSTEM" == "GPU" ]]; then
selected_profile_file="${sorted_paths[$profile_index]}"
fi
rm -f $PROFILE_PATH/*
cp $selected_profile_file $PROFILE_PATH
}

run_benchmark() {
local max_num_seqs=$1
local max_num_batched_tokens=$2
local gpu_memory_utilization=$3
echo "max_num_seq: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens"
local vllm_log="$LOG_FOLDER/vllm_log_${max_num_seqs}_${max_num_batched_tokens}.txt"
local profile_dir="$LOG_FOLDER/profile_${max_num_seqs}_${max_num_batched_tokens}"
echo "vllm_log: $vllm_log"
echo
rm -f $vllm_log
mkdir -p $profile_dir
pkill -f vllm
local profile_index=0

echo "starting server..."
start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log
start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log $profile_dir
result=$?
if [[ "$result" -eq 1 ]]; then
echo "server failed to start. gpu_memory_utilization:$gpu_memory_utilization, max_num_seqs:$max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens"
Expand All @@ -144,7 +168,8 @@ run_benchmark() {
--goodput e2el:$MAX_LATENCY_ALLOWED_MS \
--num-prompts 1000 \
--random-prefix-len $prefix_len \
--port 8004 &> "$bm_log"
--port 8004 \
--profile &> "$bm_log"
throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}')
goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g')
Expand All @@ -158,6 +183,7 @@ run_benchmark() {
# start from request-rate as int(throughput) + 1
request_rate=$((${throughput%.*} + 1))
while ((request_rate > 0)); do
profile_index=$((profile_index+1))
# clear prefix cache
curl -X POST http://0.0.0.0:8004/reset_prefix_cache
sleep 5
Expand Down Expand Up @@ -195,6 +221,12 @@ run_benchmark() {
best_max_num_seqs=$max_num_seqs
best_num_batched_tokens=$max_num_batched_tokens
best_goodput=$goodput
if [[ "$SYSTEM" == "TPU" ]]; then
update_best_profile "$profile_dir/plugins/profile" $profile_index
fi
if [[ "$SYSTEM" == "GPU" ]]; then
update_best_profile "$profile_dir" $profile_index
fi
fi
else
echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens does not meet latency requirement ${MAX_LATENCY_ALLOWED_MS}"
Expand Down Expand Up @@ -239,6 +271,6 @@ for num_seqs in "${num_seqs_list[@]}"; do
done
done
echo "finish permutations"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput" >> "$RESULT"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH"
echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT"

Loading