NVIDIA · ruodil · Aug 5, 2025 · Jul 29, 2025 · Aug 5, 2025
diff --git a/tests/integration/defs/perf/README_release_test.md b/tests/integration/defs/perf/README_release_test.md
@@ -0,0 +1,184 @@
+# TensorRT-LLM Performance Test Flow (Default PyTorch Flow)
+
+## Overview
+This document describes the complete TensorRT-LLM performance testing workflow, particularly for the default PyTorch backend testing process for release testing.
+
+## 1. Test Scripts
+
+### Main Test Script
+The main script for TensorRT-LLM performance testing is `test_perf.py`, which is responsible for executing all performance test cases.
+
+### Performance Metrics
+For trtllm-bench, the test extracts the following key performance metrics from logs:
+
+- **BUILD_TIME**: Model build time
+- **INFERENCE_TIME**: Inference time
+- **TOKEN_THROUGHPUT**: Token throughput
+- **SEQ_THROUGHPUT**: Sequence throughput
+- **FIRST_TOKEN_TIME**: First token generation time
+- **OUTPUT_TOKEN_TIME**: Output token time
+
+## 2. Detailed Test Flow
+
+### 2.1 Dataset Preparation
+
+#### Without LoRA
+```python
+prepare_data_script = os.path.join(self._llm_root, "benchmarks", "cpp", "prepare_dataset.py")
+data_cmd += [
+    "python3", prepare_data_script, "--stdout",
+    f"--tokenizer={tokenizer_dir}", f"token-norm-dist",
+    f"--num-requests={self._config.num_reqs}",
+    f"--input-mean={input_len}", f"--output-mean={output_len}",
+    f"--input-stdev={istdev}", f"--output-stdev={ostdev}",
+    f" > {dataset_path}"
+]
+```
+
+#### With LoRA
+```python
+"python3", prepare_data_script, f"--stdout",
+    f"--rand-task-id 0 {nloras-1}",
+    f"--tokenizer={tokenizer_dir}", f"--lora-dir={lora_dir}",
+    f"token-norm-dist",
+    f"--num-requests={self._config.num_reqs}",
+    f"--input-mean={input_len}", f"--output-mean={output_len}",
+    f"--input-stdev={istdev}", f"--output-stdev={ostdev}",
+    f" > {dataset_path}"
+```
+
+### 2.2 PyTorch Configuration Generation
+In `pytorch_model_config.py`, we override PyTorch configurations for certain specific cases and generate YAML configuration files.
+
+### 2.3 Calling trtllm-bench for Throughput Testing
+
+#### Basic Command
+```python
+benchmark_cmd = [
+    self._benchmark_script,
+    f"--model={model_name}",
+    f"--model_path={model_dir}",
+    "throughput",
+    f"--dataset={dataset_path}",
+    f"--max_batch_size={self._config.max_batch_size}",
+    f"--max_num_tokens={self._config.max_num_tokens}",
+    f"--report_json={report_path}",
+]
+```
+
+#### Backend Selection
+```python
+if self._config.backend != "pytorch":
+    benchmark_cmd += [
+        f"--backend=tensorrt", f"--engine_dir={engine_dir}"
+    ]
+else:
+    benchmark_cmd += ["--backend=pytorch"]
+```
+
+#### Optional Parameter Configuration
+```python
+if self._config.num_reqs > 0:
+    benchmark_cmd += [f"--num_requests={self._config.num_reqs}"]
+if self._config.concurrency != -1:
+    benchmark_cmd += [f"--concurrency={self._config.concurrency}"]
+if self._config.ep_size != None:
+    benchmark_cmd += [f"--ep={self._config.ep_size}"]
+if self._config.tp_size > 1:
+    benchmark_cmd += [f"--tp={self._config.tp_size}"]
+if self._config.pp_size > 1:
+    benchmark_cmd += [f"--pp={self._config.pp_size}"]
+if self._config.streaming == "streaming":
+    benchmark_cmd += [f"--streaming"]
+```
+
+#### PyTorch Default Configuration
+```python
+# Use default YAML configuration
+if self._config.backend == "pytorch":
+    import yaml
+    config = get_model_yaml_config(self._config.to_string(),
+                                   lora_dirs=self.lora_dirs)
+    print_info(f"pytorch model config: {config}")
+    with open('extra-llm-api-config.yml', 'w') as f:
+        yaml.dump(config, f, default_flow_style=False)
+    benchmark_cmd += [
+        f"--extra_llm_api_options=extra-llm-api-config.yml"
+    ]
+```
+
+## 3. Test Scheduling
+
+### 3.1 Full Test Cycles
+
+1. **trt_llm_release_perf_test.yml** - Release performance test
+2. **trt_llm_perf_cluster_test.yml** - Cluster performance test
+
+### 3.2 Sanity Test Cycles
+
+- **trt_llm_release_perf_sanity.yml** - Release performance sanity test
+
+## 4. Test Configuration Description
+
+### 4.1 Test Case Configuration
+- Test cases are defined in YAML configuration files
+- Support for different models, precisions, batch sizes, etc.
+- Support for LoRA and standard model testing
+
+### 4.2 Performance Baseline
+- Compare regression of each release on internal TRT-Perf dashboard
+
+### 4.3 Result Analysis
+- Generates detailed performance reports
+- Supports performance trend analysis
+- View performance data and compare between different runs on internal TRT-Perf dashboard
+
+## 5. Runtime Environment Requirements
+
+### 5.1 Dependency Installation
+```bash
+pip install -r ./TensorRT-LLM/requirements.txt
+pip install -r ./TensorRT-LLM/requirements-dev.txt
+```
+
+### 5.2 Hardware Requirements
+- CUDA-capable GPU
+- Sufficient GPU memory for model loading
+- Recommended to use B200/GB200 or higher performance GPU for cluster testing
+
+## 6. Reproduce Steps
+
+To reproduce the performance tests locally, follow these steps:
+
+### 6.1 Install Dependencies
+```bash
+pip install -r requirements-dev.txt
+pip install -r requirements.txt
+```
+
+### 6.2 Navigate to Test Directory
+```bash
+cd tests/integration/defs
+```
+
+### 6.3 Add Test Case to Test List
+```bash
+echo "perf/test_perf.py::test_perf[llama_v3.3_70b_instruct_fp8-bench-pytorch-float8-input_output_len:128,128]" >> perf_test.txt
+```
+
+### 6.4 Run Performance Test
+```bash
+pytest -v -s --test-prefix=H100_80GB_HBM3 --test-list=perf_test.txt -R=llama_v3.3_70b_instruct_fp8-bench-pytorch-float8-input_output_len:128,128 --output-dir=./output --perf --perf-log-formats=csv -o junit_logging=out-err
+```
+
+### 6.5 Command Parameters Explanation
+- `--test-prefix=H100_80GB_HBM3`: Specifies the test environment prefix
+- `--test-list`: Points to the test list file containing test cases
+- `-R`: Filter for specific test patterns
+- `--output-dir=./output`: Specifies the output directory for test results
+- `--perf`: Enables performance testing mode
+- `--perf-log-formats=csv`: Outputs performance logs in CSV format
+- `-o junit_logging=out-err`: Configures JUnit logging output
+
+## 7. Related Documentation
+- [Sanity Perf Check Introduction](README.md)
diff --git a/tests/integration/defs/perf/test_perf.py b/tests/integration/defs/perf/test_perf.py
@@ -374,12 +374,12 @@ def __init__(
         num_reqs: int = 512,
         concurrency: int = -1,
         quantization: str = "",
+        kv_cache_free_gpu_mem_fraction: float = 0.9,
         kv_cache_dtype: str = "auto",
         ep_size: int = None,
         tp_size: int = 1,
         pp_size: int = 1,
         num_gpus: int = 1,
-        kv_cache_free_gpu_mem_fraction: float = 0.9,
     ):
         # The model name.
         self.model_name = model_name
@@ -419,6 +419,8 @@ def __init__(
         self.concurrency = concurrency
         # Quantization type.
         self.quantization = quantization
+        # KV cache free gpu mem fraction
+        self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction
         # KV Cache dtype
         self.kv_cache_dtype = kv_cache_dtype
         # Multiple Profiles
@@ -433,8 +435,6 @@ def __init__(
         self.num_gpus = num_gpus
         # Just build engines
         self.build_only = False
-        # kv cache free gpu mem fraction
-        self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction
 
     def to_string(self,
                   custom_bs: int = None,
@@ -478,6 +478,10 @@ def to_string(self,
         # Add Max number of tokens.
         entries.append(f"maxnt:{self.max_num_tokens}")
 
+        # Add kv cache free gpu mem fraction.
+        if self.kv_cache_free_gpu_mem_fraction != 0.9:
+            entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}")
+
         if self.build_only:
             entries.append(f"build_only")
 
@@ -548,10 +552,6 @@ def to_string(self,
         if self.num_gpus > 1:
             entries.append(f"gpus:{self.num_gpus}")
 
-        # Add kv cache free gpu mem fraction.
-        if self.kv_cache_free_gpu_mem_fraction != 0.9:
-            entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}")
-
         # Concatenate labels with "-".
         return "-".join(entries)
 
@@ -591,6 +591,10 @@ def load_from_str(self, test_param_labels) -> None:
         if labels[0].startswith("maxnt"):
             self.max_num_tokens = int(labels.pop(0).replace("maxnt:", ""))
 
+        if labels[0].startswith("kv_frac"):
+            self.kv_cache_free_gpu_mem_fraction = float(
+                labels.pop(0).replace("kv_frac:", ""))
+
         if labels[0] == "build_only":
             self.build_only = True
             labels.pop(0)
@@ -659,11 +663,6 @@ def load_from_str(self, test_param_labels) -> None:
             self.num_gpus = 1 if not labels[0].startswith("gpus:") else int(
                 labels.pop(0).replace("gpus:", ""))
 
-        if len(labels) > 0:
-            self.kv_cache_free_gpu_mem_fraction = 0.9 if not labels[
-                0].startswith("kv_frac:") else float(
-                    labels.pop(0).replace("kv_frac:", ""))
-
         assert len(
             labels
         ) == 0, f"Invalid test name! Some labels cannot be parsed: {labels}"

diff --git a/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml b/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml
@@ -473,21 +473,21 @@ trt_llm_release_perf_test:
 
   #llama_v4_maverick_17b_128e_instruct_fp8
   #pytorch backend
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8]
   #rcca case
-  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8-kv_frac:0.6]
+  - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8]
 
   #llama_v4_scout_17b_16e_instruct_fp8
   #pytorch backend
-  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6]
-  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6]
+  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8]
+  - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8]
 
   #deepseek_r1_fp8
   #pytorch backend