Skip to content

Commit 1761b6b

Browse files
committed
add README_release_test.md for perf test
Signed-off-by: ruodil <[email protected]>
1 parent 1f39a11 commit 1761b6b

File tree

3 files changed

+208
-23
lines changed

3 files changed

+208
-23
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# TensorRT-LLM Performance Test Flow (Default PyTorch Flow)
2+
3+
## Overview
4+
This document describes the complete TensorRT-LLM performance testing workflow, particularly for the default PyTorch backend testing process.
5+
6+
## 1. Test Scripts
7+
8+
### Main Test Script
9+
The main script for TensorRT-LLM performance testing is `test_perf.py`, which is responsible for executing all performance test cases.
10+
11+
### Performance Metrics
12+
For trtllm-bench, the test extracts the following key performance metrics from logs:
13+
14+
- **BUILD_TIME**: Model build time
15+
- **INFERENCE_TIME**: Inference time
16+
- **TOKEN_THROUGHPUT**: Token throughput
17+
- **SEQ_THROUGHPUT**: Sequence throughput
18+
- **FIRST_TOKEN_TIME**: First token generation time
19+
- **OUTPUT_TOKEN_TIME**: Output token time
20+
21+
## 2. Detailed Test Flow
22+
23+
### 2.1 Dataset Preparation
24+
25+
#### Without LoRA
26+
```python
27+
prepare_data_script = os.path.join(self._llm_root, "benchmarks", "cpp", "prepare_dataset.py")
28+
data_cmd += [
29+
"python3", prepare_data_script, "--stdout",
30+
f"--tokenizer={tokenizer_dir}", f"token-norm-dist",
31+
f"--num-requests={self._config.num_reqs}",
32+
f"--input-mean={input_len}", f"--output-mean={output_len}",
33+
f"--input-stdev={istdev}", f"--output-stdev={ostdev}",
34+
f" > {dataset_path}"
35+
]
36+
```
37+
38+
#### With LoRA
39+
```python
40+
"python3", prepare_data_script, f"--stdout",
41+
f"--rand-task-id 0 {nloras-1}",
42+
f"--tokenizer={tokenizer_dir}", f"--lora-dir={lora_dir}",
43+
f"token-norm-dist",
44+
f"--num-requests={self._config.num_reqs}",
45+
f"--input-mean={input_len}", f"--output-mean={output_len}",
46+
f"--input-stdev={istdev}", f"--output-stdev={ostdev}",
47+
f" > {dataset_path}"
48+
```
49+
50+
### 2.2 PyTorch Configuration Generation
51+
In `pytorch_model_config.py`, we override PyTorch configurations for certain specific cases and generate YAML configuration files.
52+
53+
### 2.3 Calling trtllm-bench for Throughput Testing
54+
55+
#### Basic Command
56+
```python
57+
benchmark_cmd = [
58+
self._benchmark_script,
59+
f"--model={model_name}",
60+
f"--model_path={model_dir}",
61+
"throughput",
62+
f"--dataset={dataset_path}",
63+
f"--max_batch_size={self._config.max_batch_size}",
64+
f"--max_num_tokens={self._config.max_num_tokens}",
65+
f"--report_json={report_path}",
66+
]
67+
```
68+
69+
#### Backend Selection
70+
```python
71+
if self._config.backend != "pytorch":
72+
benchmark_cmd += [
73+
f"--backend=tensorrt", f"--engine_dir={engine_dir}"
74+
]
75+
else:
76+
benchmark_cmd += ["--backend=pytorch"]
77+
```
78+
79+
#### Optional Parameter Configuration
80+
```python
81+
if self._config.num_reqs > 0:
82+
benchmark_cmd += [f"--num_requests={self._config.num_reqs}"]
83+
if self._config.concurrency != -1:
84+
benchmark_cmd += [f"--concurrency={self._config.concurrency}"]
85+
if self._config.ep_size != None:
86+
benchmark_cmd += [f"--ep={self._config.ep_size}"]
87+
if self._config.tp_size > 1:
88+
benchmark_cmd += [f"--tp={self._config.tp_size}"]
89+
if self._config.pp_size > 1:
90+
benchmark_cmd += [f"--pp={self._config.pp_size}"]
91+
if self._config.streaming == "streaming":
92+
benchmark_cmd += [f"--streaming"]
93+
```
94+
95+
#### PyTorch Default Configuration
96+
```python
97+
# Use default YAML configuration
98+
if self._config.backend == "pytorch":
99+
import yaml
100+
config = get_model_yaml_config(self._config.to_string(),
101+
lora_dirs=self.lora_dirs)
102+
print_info(f"pytorch model config: {config}")
103+
with open('extra-llm-api-config.yml', 'w') as f:
104+
yaml.dump(config, f, default_flow_style=False)
105+
benchmark_cmd += [
106+
f"--extra_llm_api_options=extra-llm-api-config.yml"
107+
]
108+
```
109+
110+
## 3. Test Scheduling
111+
112+
### 3.1 Full Test Cycles
113+
We will run 2 full test cycles:
114+
115+
1. **trt_llm_release_perf_test.yml** - Release performance test
116+
2. **trt_llm_perf_cluster_test.yml** - Cluster performance test on B200/GB200
117+
118+
### 3.2 Sanity Test Cycles
119+
We will run 1-2 sanity test cycles:
120+
121+
- **trt_llm_release_perf_sanity.yml** - Release performance sanity test
122+
123+
## 4. Test Configuration Description
124+
125+
### 4.1 Test Case Configuration
126+
- Test cases are defined in YAML configuration files
127+
- Support for different models, precisions, batch sizes, etc.
128+
- Support for LoRA and standard model testing
129+
130+
### 4.2 Performance Baseline
131+
- Compare regression of each release manually on http://dlswqa-nas.nvidia.com:18688/trtperf
132+
133+
### 4.3 Result Analysis
134+
- Generates detailed performance reports
135+
- Supports performance trend analysis
136+
- View performance data and compare between different runs on http://dlswqa-nas.nvidia.com:18688/trtperf
137+
138+
## 5. Runtime Environment Requirements
139+
140+
### 5.1 Dependency Installation
141+
```bash
142+
pip install -r ./TensorRT-LLM/requirements.txt
143+
pip install -r ./TensorRT-LLM/requirements-dev.txt
144+
```
145+
146+
### 5.2 Hardware Requirements
147+
- CUDA-capable GPU
148+
- Sufficient GPU memory for model loading
149+
- Recommended to use B200/GB200 or higher performance GPU for cluster testing
150+
151+
## 6. Reproduce Steps
152+
153+
To reproduce the performance tests locally, follow these steps:
154+
155+
### 6.1 Install Dependencies
156+
```bash
157+
pip install -r requirements-dev.txt
158+
pip install -r requirements.txt
159+
```
160+
161+
### 6.2 Navigate to Test Directory
162+
```bash
163+
cd tests/integration/defs
164+
```
165+
166+
### 6.3 Add Test Case to Test List
167+
```bash
168+
echo "perf/test_perf.py::test_perf[llama_v3.3_70b_instruct_fp8-bench-pytorch-float8-input_output_len:128,128]" >> perf_test.txt
169+
```
170+
171+
### 6.4 Run Performance Test
172+
```bash
173+
pytest -v -s --test-prefix=H100_80GB_HBM3 --test-list=perf_test.txt -R=qwen2_7b_instruct-bench-float16-input_output_len:128,128 --output-dir=./output --perf --perf-log-formats=csv -o junit_logging=out-err
174+
```
175+
176+
### 6.5 Command Parameters Explanation
177+
- `--test-prefix=H100_80GB_HBM3`: Specifies the test environment prefix
178+
- `--test-list`: Points to the test list file containing test cases
179+
- `-R`: Filter for specific test patterns
180+
- `--output-dir=./output`: Specifies the output directory for test results
181+
- `--perf`: Enables performance testing mode
182+
- `--perf-log-formats=csv`: Outputs performance logs in CSV format
183+
- `-o junit_logging=out-err`: Configures JUnit logging output
184+
185+
## 7. Related Documentation
186+
- [Sanity Perf Check Introduction](README.md)

tests/integration/defs/perf/test_perf.py

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -370,12 +370,12 @@ def __init__(
370370
num_reqs: int = 512,
371371
concurrency: int = -1,
372372
quantization: str = "",
373+
kv_cache_free_gpu_mem_fraction: float = 0.9,
373374
kv_cache_dtype: str = "auto",
374375
ep_size: int = None,
375376
tp_size: int = 1,
376377
pp_size: int = 1,
377378
num_gpus: int = 1,
378-
kv_cache_free_gpu_mem_fraction: float = 0.9,
379379
):
380380
# The model name.
381381
self.model_name = model_name
@@ -415,6 +415,8 @@ def __init__(
415415
self.concurrency = concurrency
416416
# Quantization type.
417417
self.quantization = quantization
418+
# KV cache free gpu mem fraction
419+
self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction
418420
# KV Cache dtype
419421
self.kv_cache_dtype = kv_cache_dtype
420422
# Multiple Profiles
@@ -429,8 +431,6 @@ def __init__(
429431
self.num_gpus = num_gpus
430432
# Just build engines
431433
self.build_only = False
432-
# kv cache free gpu mem fraction
433-
self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction
434434

435435
def to_string(self,
436436
custom_bs: int = None,
@@ -474,6 +474,10 @@ def to_string(self,
474474
# Add Max number of tokens.
475475
entries.append(f"maxnt:{self.max_num_tokens}")
476476

477+
# Add kv cache free gpu mem fraction.
478+
if self.kv_cache_free_gpu_mem_fraction != 0.9:
479+
entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}")
480+
477481
if self.build_only:
478482
entries.append(f"build_only")
479483

@@ -544,10 +548,6 @@ def to_string(self,
544548
if self.num_gpus > 1:
545549
entries.append(f"gpus:{self.num_gpus}")
546550

547-
# Add kv cache free gpu mem fraction.
548-
if self.kv_cache_free_gpu_mem_fraction != 0.9:
549-
entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}")
550-
551551
# Concatenate labels with "-".
552552
return "-".join(entries)
553553

@@ -587,6 +587,10 @@ def load_from_str(self, test_param_labels) -> None:
587587
if labels[0].startswith("maxnt"):
588588
self.max_num_tokens = int(labels.pop(0).replace("maxnt:", ""))
589589

590+
if labels[0].startswith("kv_frac"):
591+
self.kv_cache_free_gpu_mem_fraction = float(
592+
labels.pop(0).replace("kv_frac:", ""))
593+
590594
if labels[0] == "build_only":
591595
self.build_only = True
592596
labels.pop(0)
@@ -655,11 +659,6 @@ def load_from_str(self, test_param_labels) -> None:
655659
self.num_gpus = 1 if not labels[0].startswith("gpus:") else int(
656660
labels.pop(0).replace("gpus:", ""))
657661

658-
if len(labels) > 0:
659-
self.kv_cache_free_gpu_mem_fraction = 0.9 if not labels[
660-
0].startswith("kv_frac:") else float(
661-
labels.pop(0).replace("kv_frac:", ""))
662-
663662
assert len(
664663
labels
665664
) == 0, f"Invalid test name! Some labels cannot be parsed: {labels}"

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -473,21 +473,21 @@ trt_llm_release_perf_test:
473473

474474
#llama_v4_maverick_17b_128e_instruct_fp8
475475
#pytorch backend
476-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
477-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
478-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
479-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6]
480-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6]
476+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8]
477+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8]
478+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8]
479+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8]
480+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8]
481481
#rcca case
482-
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8-kv_frac:0.6]
482+
- perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8]
483483

484484
#llama_v4_scout_17b_16e_instruct_fp8
485485
#pytorch backend
486-
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
487-
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
488-
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
489-
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6]
490-
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6]
486+
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8]
487+
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8]
488+
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8]
489+
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8]
490+
- perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8]
491491

492492
#deepseek_r1_fp8
493493
#pytorch backend

0 commit comments

Comments
 (0)