diff --git a/tests/integration/defs/perf/README_release_test.md b/tests/integration/defs/perf/README_release_test.md new file mode 100644 index 00000000000..7bff0ed37d5 --- /dev/null +++ b/tests/integration/defs/perf/README_release_test.md @@ -0,0 +1,184 @@ +# TensorRT-LLM Performance Test Flow (Default PyTorch Flow) + +## Overview +This document describes the complete TensorRT-LLM performance testing workflow, particularly for the default PyTorch backend testing process for release testing. + +## 1. Test Scripts + +### Main Test Script +The main script for TensorRT-LLM performance testing is `test_perf.py`, which is responsible for executing all performance test cases. + +### Performance Metrics +For trtllm-bench, the test extracts the following key performance metrics from logs: + +- **BUILD_TIME**: Model build time +- **INFERENCE_TIME**: Inference time +- **TOKEN_THROUGHPUT**: Token throughput +- **SEQ_THROUGHPUT**: Sequence throughput +- **FIRST_TOKEN_TIME**: First token generation time +- **OUTPUT_TOKEN_TIME**: Output token time + +## 2. Detailed Test Flow + +### 2.1 Dataset Preparation + +#### Without LoRA +```python +prepare_data_script = os.path.join(self._llm_root, "benchmarks", "cpp", "prepare_dataset.py") +data_cmd += [ + "python3", prepare_data_script, "--stdout", + f"--tokenizer={tokenizer_dir}", f"token-norm-dist", + f"--num-requests={self._config.num_reqs}", + f"--input-mean={input_len}", f"--output-mean={output_len}", + f"--input-stdev={istdev}", f"--output-stdev={ostdev}", + f" > {dataset_path}" +] +``` + +#### With LoRA +```python +"python3", prepare_data_script, f"--stdout", + f"--rand-task-id 0 {nloras-1}", + f"--tokenizer={tokenizer_dir}", f"--lora-dir={lora_dir}", + f"token-norm-dist", + f"--num-requests={self._config.num_reqs}", + f"--input-mean={input_len}", f"--output-mean={output_len}", + f"--input-stdev={istdev}", f"--output-stdev={ostdev}", + f" > {dataset_path}" +``` + +### 2.2 PyTorch Configuration Generation +In `pytorch_model_config.py`, we override PyTorch configurations for certain specific cases and generate YAML configuration files. + +### 2.3 Calling trtllm-bench for Throughput Testing + +#### Basic Command +```python +benchmark_cmd = [ + self._benchmark_script, + f"--model={model_name}", + f"--model_path={model_dir}", + "throughput", + f"--dataset={dataset_path}", + f"--max_batch_size={self._config.max_batch_size}", + f"--max_num_tokens={self._config.max_num_tokens}", + f"--report_json={report_path}", +] +``` + +#### Backend Selection +```python +if self._config.backend != "pytorch": + benchmark_cmd += [ + f"--backend=tensorrt", f"--engine_dir={engine_dir}" + ] +else: + benchmark_cmd += ["--backend=pytorch"] +``` + +#### Optional Parameter Configuration +```python +if self._config.num_reqs > 0: + benchmark_cmd += [f"--num_requests={self._config.num_reqs}"] +if self._config.concurrency != -1: + benchmark_cmd += [f"--concurrency={self._config.concurrency}"] +if self._config.ep_size != None: + benchmark_cmd += [f"--ep={self._config.ep_size}"] +if self._config.tp_size > 1: + benchmark_cmd += [f"--tp={self._config.tp_size}"] +if self._config.pp_size > 1: + benchmark_cmd += [f"--pp={self._config.pp_size}"] +if self._config.streaming == "streaming": + benchmark_cmd += [f"--streaming"] +``` + +#### PyTorch Default Configuration +```python +# Use default YAML configuration +if self._config.backend == "pytorch": + import yaml + config = get_model_yaml_config(self._config.to_string(), + lora_dirs=self.lora_dirs) + print_info(f"pytorch model config: {config}") + with open('extra-llm-api-config.yml', 'w') as f: + yaml.dump(config, f, default_flow_style=False) + benchmark_cmd += [ + f"--extra_llm_api_options=extra-llm-api-config.yml" + ] +``` + +## 3. Test Scheduling + +### 3.1 Full Test Cycles + +1. **trt_llm_release_perf_test.yml** - Release performance test +2. **trt_llm_perf_cluster_test.yml** - Cluster performance test + +### 3.2 Sanity Test Cycles + +- **trt_llm_release_perf_sanity.yml** - Release performance sanity test + +## 4. Test Configuration Description + +### 4.1 Test Case Configuration +- Test cases are defined in YAML configuration files +- Support for different models, precisions, batch sizes, etc. +- Support for LoRA and standard model testing + +### 4.2 Performance Baseline +- Compare regression of each release on internal TRT-Perf dashboard + +### 4.3 Result Analysis +- Generates detailed performance reports +- Supports performance trend analysis +- View performance data and compare between different runs on internal TRT-Perf dashboard + +## 5. Runtime Environment Requirements + +### 5.1 Dependency Installation +```bash +pip install -r ./TensorRT-LLM/requirements.txt +pip install -r ./TensorRT-LLM/requirements-dev.txt +``` + +### 5.2 Hardware Requirements +- CUDA-capable GPU +- Sufficient GPU memory for model loading +- Recommended to use B200/GB200 or higher performance GPU for cluster testing + +## 6. Reproduce Steps + +To reproduce the performance tests locally, follow these steps: + +### 6.1 Install Dependencies +```bash +pip install -r requirements-dev.txt +pip install -r requirements.txt +``` + +### 6.2 Navigate to Test Directory +```bash +cd tests/integration/defs +``` + +### 6.3 Add Test Case to Test List +```bash +echo "perf/test_perf.py::test_perf[llama_v3.3_70b_instruct_fp8-bench-pytorch-float8-input_output_len:128,128]" >> perf_test.txt +``` + +### 6.4 Run Performance Test +```bash +pytest -v -s --test-prefix=H100_80GB_HBM3 --test-list=perf_test.txt -R=llama_v3.3_70b_instruct_fp8-bench-pytorch-float8-input_output_len:128,128 --output-dir=./output --perf --perf-log-formats=csv -o junit_logging=out-err +``` + +### 6.5 Command Parameters Explanation +- `--test-prefix=H100_80GB_HBM3`: Specifies the test environment prefix +- `--test-list`: Points to the test list file containing test cases +- `-R`: Filter for specific test patterns +- `--output-dir=./output`: Specifies the output directory for test results +- `--perf`: Enables performance testing mode +- `--perf-log-formats=csv`: Outputs performance logs in CSV format +- `-o junit_logging=out-err`: Configures JUnit logging output + +## 7. Related Documentation +- [Sanity Perf Check Introduction](README.md) diff --git a/tests/integration/defs/perf/test_perf.py b/tests/integration/defs/perf/test_perf.py index 1657fa2ce80..566cbbef28f 100644 --- a/tests/integration/defs/perf/test_perf.py +++ b/tests/integration/defs/perf/test_perf.py @@ -374,12 +374,12 @@ def __init__( num_reqs: int = 512, concurrency: int = -1, quantization: str = "", + kv_cache_free_gpu_mem_fraction: float = 0.9, kv_cache_dtype: str = "auto", ep_size: int = None, tp_size: int = 1, pp_size: int = 1, num_gpus: int = 1, - kv_cache_free_gpu_mem_fraction: float = 0.9, ): # The model name. self.model_name = model_name @@ -419,6 +419,8 @@ def __init__( self.concurrency = concurrency # Quantization type. self.quantization = quantization + # KV cache free gpu mem fraction + self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction # KV Cache dtype self.kv_cache_dtype = kv_cache_dtype # Multiple Profiles @@ -433,8 +435,6 @@ def __init__( self.num_gpus = num_gpus # Just build engines self.build_only = False - # kv cache free gpu mem fraction - self.kv_cache_free_gpu_mem_fraction = kv_cache_free_gpu_mem_fraction def to_string(self, custom_bs: int = None, @@ -478,6 +478,10 @@ def to_string(self, # Add Max number of tokens. entries.append(f"maxnt:{self.max_num_tokens}") + # Add kv cache free gpu mem fraction. + if self.kv_cache_free_gpu_mem_fraction != 0.9: + entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}") + if self.build_only: entries.append(f"build_only") @@ -548,10 +552,6 @@ def to_string(self, if self.num_gpus > 1: entries.append(f"gpus:{self.num_gpus}") - # Add kv cache free gpu mem fraction. - if self.kv_cache_free_gpu_mem_fraction != 0.9: - entries.append(f"kv_frac:{self.kv_cache_free_gpu_mem_fraction}") - # Concatenate labels with "-". return "-".join(entries) @@ -591,6 +591,10 @@ def load_from_str(self, test_param_labels) -> None: if labels[0].startswith("maxnt"): self.max_num_tokens = int(labels.pop(0).replace("maxnt:", "")) + if labels[0].startswith("kv_frac"): + self.kv_cache_free_gpu_mem_fraction = float( + labels.pop(0).replace("kv_frac:", "")) + if labels[0] == "build_only": self.build_only = True labels.pop(0) @@ -659,11 +663,6 @@ def load_from_str(self, test_param_labels) -> None: self.num_gpus = 1 if not labels[0].startswith("gpus:") else int( labels.pop(0).replace("gpus:", "")) - if len(labels) > 0: - self.kv_cache_free_gpu_mem_fraction = 0.9 if not labels[ - 0].startswith("kv_frac:") else float( - labels.pop(0).replace("kv_frac:", "")) - assert len( labels ) == 0, f"Invalid test name! Some labels cannot be parsed: {labels}" diff --git a/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml b/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml index fcd9c4ff4f4..67965277019 100644 --- a/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml +++ b/tests/integration/test_lists/qa/trt_llm_release_perf_test.yml @@ -473,21 +473,21 @@ trt_llm_release_perf_test: #llama_v4_maverick_17b_128e_instruct_fp8 #pytorch backend - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8] #rcca case - - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8-kv_frac:0.6] + - perf/test_perf.py::test_perf[llama_v4_maverick_17b_128e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:20000,2000-reqs:1000-ep:8-tp:8-gpus:8] #llama_v4_scout_17b_16e_instruct_fp8 #pytorch backend - - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:128,128-ep:8-tp:8-gpus:8-kv_frac:0.6] - - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-input_output_len:512,32-ep:8-tp:8-gpus:8-kv_frac:0.6] + - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:2000,500-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:500,2000-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-maxbs:1024-maxnt:4096-kv_frac:0.6-input_output_len:1000,1000-reqs:3000-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:128,128-ep:8-tp:8-gpus:8] + - perf/test_perf.py::test_perf[llama_v4_scout_17b_16e_instruct_fp8-bench-pytorch-float8-kv_frac:0.6-input_output_len:512,32-ep:8-tp:8-gpus:8] #deepseek_r1_fp8 #pytorch backend