From 4c9e40ae73322f4d5a96e4ad10399107d0c08761 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:07:10 +0000 Subject: [PATCH 01/10] Add initial documentation for trtllm-bench CLI. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 245 ++++++++++++++++++++++++++ docs/source/index.rst | 1 + 2 files changed, 246 insertions(+) create mode 100644 docs/source/commands/trtllm-bench.rst diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst new file mode 100644 index 00000000000..5b96f226774 --- /dev/null +++ b/docs/source/commands/trtllm-bench.rst @@ -0,0 +1,245 @@ +trtllm-bench +=========================== + +trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It provides three main subcommands for different benchmarking scenarios: + +**Common Options for All Commands:** + +**Usage:** +.. code-block:: bash + + trtllm-bench [OPTIONS] [OPTIONS] + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--model``, ``-m`` + - HuggingFace model name (required) + * - ``--model_path`` + - Path to local HuggingFace checkpoint + * - ``--workspace``, ``-w`` + - Directory for intermediate files (default: /tmp) + * - ``--log_level`` + - Logging level (default: info) + + +build +----- +Build TensorRT-LLM engines optimized for benchmarking. + +**Usage:** +.. code-block:: bash + + trtllm-bench -m build [OPTIONS] + +**Key Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--tp_size``, ``-tp`` + - Number of tensor parallel shards (default: 1) + * - ``--pp_size``, ``-pp`` + - Number of pipeline parallel shards (default: 1) + * - ``--quantization``, ``-q`` + - Quantization algorithm (e.g., fp8, int8_sq, nvfp4) + * - ``--max_seq_len`` + - Maximum total sequence length for requests + * - ``--dataset`` + - Dataset file to extract sequence statistics for engine optimization + * - ``--max_batch_size`` + - Maximum number of requests the engine can schedule + * - ``--max_num_tokens`` + - Maximum number of batched tokens the engine can schedule + * - ``--target_input_len`` + - Target average input length for tuning heuristics + * - ``--target_output_len`` + - Target average output length for tuning heuristics + +**Engine Build Modes:** +The build command supports three mutually exclusive optimization modes: + +1. **Dataset-based**: Use ``--dataset`` to analyze sequence statistics and optimize engine parameters +2. **IFB Scheduler**: Use ``--max_batch_size`` and ``--max_num_tokens`` for manual tweaking of inflight batching +3. **Tuning Heuristics**: Use ``--target_input_len`` and ``--target_output_len`` for heuristic-based optimization + +throughput +---------- +Run throughput benchmarks to measure the engine's processing capacity under load. + +**Usage:** +.. code-block:: bash + + trtllm-bench -m throughput [OPTIONS] + +**Key Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--engine_dir`` + - Path to the serialized TRT-LLM engine + * - ``--backend`` + - Backend choice (pytorch, _autodeploy) + * - ``--max_batch_size`` + - Maximum runtime batch size + * - ``--max_num_tokens`` + - Maximum runtime tokens the engine can accept + * - ``--concurrency`` + - Number of concurrent requests to process + * - ``--dataset`` + - Dataset file for benchmark input + * - ``--num_requests`` + - Number of requests to process (0 for all) + * - ``--warmup`` + - Number of warmup requests before benchmarking + * - ``--streaming`` + - Enable streaming output mode + * - ``--report_json`` + - Path to save benchmark report + * - ``--output_json`` + - Path to save output tokens + +**Performance Features:** +- Supports both streaming and non-streaming modes +- Configurable concurrency for load testing +- Comprehensive reporting with detailed statistics + +latency +------- +Run low-latency benchmarks optimized for minimal response time. + +**Usage:** +.. code-block:: bash + + trtllm-bench -m latency [OPTIONS] + +**Key Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--engine_dir`` + - Path to the serialized TRT-LLM engine (required) + * - ``--kv_cache_free_gpu_mem_fraction`` + - GPU memory fraction for KV cache (default: 0.90) + * - ``--dataset`` + - Dataset file for benchmark input + * - ``--num_requests`` + - Number of requests to process + * - ``--warmup`` + - Number of warmup requests (default: 2) + * - ``--concurrency`` + - Number of concurrent requests (default: 1) + * - ``--beam_width`` + - Number of search beams for beam search + * - ``--medusa_choices`` + - Path to YAML file defining Medusa tree for speculative decoding + * - ``--report_json`` + - Path to save benchmark report + * - ``--iteration_log`` + - Path to save iteration logging + + +Examples +-------- + +Build an engine optimized for a specific dataset (TensorRT backend only): +.. code-block:: bash + + trtllm-bench -m build --dataset --tp_size --pp_size --quantization + +Run throughput benchmark (PyTorch): +.. code-block:: bash + + trtllm-bench -m throughput --backend pytorch --dataset --tp_size --pp_size + +Run throughput benchmark (TensorRT): +.. code-block:: bash + + trtllm-bench -m throughput --engine_dir --dataset + +Run latency benchmark: +.. code-block:: bash + + trtllm-bench -m --engine_dir --kv_cache_free_gpu_mem_fraction --dataset --num_requests --warmup --concurrency --beam_width --medusa_choices --report_json --iteration_log + +Dataset Preparation +------------------ +trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which generates benchmark datasets in the required format. The prepare_dataset script supports: + +**Dataset Types:** +- Real datasets from various sources +- Synthetic datasets with normal or uniform token distributions +- LoRA task-specific datasets + +**Key Features:** +- Tokenizer integration for proper text preprocessing +- Configurable random seeds for reproducible results +- Support for LoRA adapters and task IDs +- Output in JSON format compatible with trtllm-bench + +.. important:: + The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format. + +**prepare_dataset.py CLI Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--tokenizer`` + - Tokenizer directory or HuggingFace model name (required) + * - ``--output`` + - Output JSON filename (default: preprocessed_dataset.json) + * - ``--stdout`` + - Print output to stdout with JSON dataset entry on each line (**required for trtllm-bench**) + * - ``--random-seed`` + - Random seed for token generation (default: 420) + * - ``--task-id`` + - LoRA task ID (default: -1) + * - ``--rand-task-id`` + - Random LoRA task range (two integers) + * - ``--lora-dir`` + - Directory containing LoRA adapters + * - ``--log-level`` + - Logging level: info or debug (default: info) + +**prepare_dataset.py Subcommands:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Subcommand + - Description + * - ``dataset`` + - Process real datasets from various sources + * - ``token_norm_dist`` + - Generate synthetic datasets with normal token distribution + * - ``token_unif_dist`` + - Generate synthetic datasets with uniform token distribution + +**Usage Example:** +.. code-block:: bash + + python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl + +This workflow allows you to: +1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag +2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset +3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency`` diff --git a/docs/source/index.rst b/docs/source/index.rst index b63ec95a676..50b9c122678 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -77,6 +77,7 @@ Welcome to TensorRT-LLM's Documentation! :caption: Command-Line Reference :hidden: + commands/trtllm-bench commands/trtllm-build commands/trtllm-serve From d70449c740e810f6adf2eafe7819ddf811c585a3 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:15:40 +0000 Subject: [PATCH 02/10] Updates to CLI options. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 40 ++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 4 deletions(-) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index 5b96f226774..d8dc161f309 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -89,25 +89,57 @@ Run throughput benchmarks to measure the engine's processing capacity under load * - ``--engine_dir`` - Path to the serialized TRT-LLM engine * - ``--backend`` - - Backend choice (pytorch, _autodeploy) + - Backend choice (pytorch, _autodeploy) -- unspecified for TensorRT. + * - ``--extra_llm_api_options`` + - Path to YAML file that overwrites trtllm-bench parameters * - ``--max_batch_size`` - Maximum runtime batch size * - ``--max_num_tokens`` - Maximum runtime tokens the engine can accept - * - ``--concurrency`` - - Number of concurrent requests to process + * - ``--max_seq_len`` + - Maximum sequence length + * - ``--beam_width`` + - Number of search beams (default: 1) + * - ``--kv_cache_free_gpu_mem_fraction`` + - GPU memory fraction for KV cache (default: 0.90) * - ``--dataset`` - Dataset file for benchmark input + * - ``--eos_id`` + - End-of-sequence token (-1 to disable) + * - ``--modality`` + - Modality of multimodal requests (image, video) + * - ``--max_input_len`` + - Maximum input sequence length for multimodal models (default: 4096) * - ``--num_requests`` - Number of requests to process (0 for all) * - ``--warmup`` - Number of warmup requests before benchmarking + * - ``--target_input_len`` + - Target average input length for tuning heuristics + * - ``--target_output_len`` + - Target average output length for tuning heuristics + * - ``--tp`` + - Tensor parallelism size (default: 1) + * - ``--pp`` + - Pipeline parallelism size (default: 1) + * - ``--ep`` + - Expert parallelism size + * - ``--cluster_size`` + - Expert cluster parallelism size + * - ``--concurrency`` + - Number of concurrent requests to process * - ``--streaming`` - Enable streaming output mode * - ``--report_json`` - Path to save benchmark report + * - ``--iteration_log`` + - Path to save iteration logging * - ``--output_json`` - Path to save output tokens + * - ``--enable_chunked_context`` + - Enable chunking in prefill stage for enhanced throughput + * - ``--scheduler_policy`` + - KV cache scheduler policy (guaranteed_no_evict, max_utilization) **Performance Features:** - Supports both streaming and non-streaming modes @@ -138,7 +170,7 @@ Run low-latency benchmarks optimized for minimal response time. * - ``--dataset`` - Dataset file for benchmark input * - ``--num_requests`` - - Number of requests to process + - Number of requests to process (0 for all) * - ``--warmup`` - Number of warmup requests (default: 2) * - ``--concurrency`` From 64f5ac328e2bb259fa2c4405269a3f338222437b Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:20:56 +0000 Subject: [PATCH 03/10] Update to fix improper formatting. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index d8dc161f309..b8b86f6ec0d 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -74,6 +74,7 @@ throughput Run throughput benchmarks to measure the engine's processing capacity under load. **Usage:** + .. code-block:: bash trtllm-bench -m throughput [OPTIONS] @@ -142,6 +143,7 @@ Run throughput benchmarks to measure the engine's processing capacity under load - KV cache scheduler policy (guaranteed_no_evict, max_utilization) **Performance Features:** + - Supports both streaming and non-streaming modes - Configurable concurrency for load testing - Comprehensive reporting with detailed statistics @@ -151,6 +153,7 @@ latency Run low-latency benchmarks optimized for minimal response time. **Usage:** + .. code-block:: bash trtllm-bench -m latency [OPTIONS] @@ -189,21 +192,25 @@ Examples -------- Build an engine optimized for a specific dataset (TensorRT backend only): + .. code-block:: bash trtllm-bench -m build --dataset --tp_size --pp_size --quantization Run throughput benchmark (PyTorch): + .. code-block:: bash trtllm-bench -m throughput --backend pytorch --dataset --tp_size --pp_size Run throughput benchmark (TensorRT): + .. code-block:: bash trtllm-bench -m throughput --engine_dir --dataset Run latency benchmark: + .. code-block:: bash trtllm-bench -m --engine_dir --kv_cache_free_gpu_mem_fraction --dataset --num_requests --warmup --concurrency --beam_width --medusa_choices --report_json --iteration_log @@ -272,6 +279,7 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl This workflow allows you to: + 1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag 2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset 3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency`` From 12d6d4358f190ef568cbea96f75eb00c0dea90b2 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:34:29 +0000 Subject: [PATCH 04/10] Further updates. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 58 +++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index b8b86f6ec0d..cceacf1fb65 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -273,7 +273,65 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g * - ``token_unif_dist`` - Generate synthetic datasets with uniform token distribution +**Dataset Subcommand Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--input`` + - Input dataset file or directory (required) + * - ``--max-input-length`` + - Maximum input sequence length (default: 2048) + * - ``--max-output-length`` + - Maximum output sequence length (default: 512) + * - ``--num-samples`` + - Number of samples to process (default: all) + * - ``--format`` + - Input format: json, jsonl, csv, or txt (default: auto-detect) + +**Token Normal Distribution Subcommand Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--num-requests`` + - Number of requests to be generated (required) + * - ``--input-mean`` + - Normal distribution mean for input tokens (required) + * - ``--input-stdev`` + - Normal distribution standard deviation for input tokens (required) + * - ``--output-mean`` + - Normal distribution mean for output tokens (required) + * - ``--output-stdev`` + - Normal distribution standard deviation for output tokens (required) + +**Token Uniform Distribution Subcommand Options:** + +.. list-table:: + :widths: 20 80 + :header-rows: 1 + + * - Option + - Description + * - ``--num-requests`` + - Number of requests to be generated (required) + * - ``--input-min`` + - Uniform distribution minimum for input tokens (required) + * - ``--input-max`` + - Uniform distribution maximum for input tokens (required) + * - ``--output-min`` + - Uniform distribution minimum for output tokens (required) + * - ``--output-max`` + - Uniform distribution maximum for output tokens (required) + **Usage Example:** + .. code-block:: bash python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl From 6cf69c6aa13e3299cf90dc26d7bf76cdc38982a7 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:41:02 +0000 Subject: [PATCH 05/10] Further updates. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index cceacf1fb65..2ba189b39d6 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -6,6 +6,7 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p **Common Options for All Commands:** **Usage:** + .. code-block:: bash trtllm-bench [OPTIONS] [OPTIONS] @@ -31,6 +32,7 @@ build Build TensorRT-LLM engines optimized for benchmarking. **Usage:** + .. code-block:: bash trtllm-bench -m build [OPTIONS] From 2fea47054e8faa32b9f20b7314b6eb3e434c0e67 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 3 Jul 2025 21:49:57 +0000 Subject: [PATCH 06/10] More updates Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index 2ba189b39d6..7dcac1a2b06 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -215,7 +215,7 @@ Run latency benchmark: .. code-block:: bash - trtllm-bench -m --engine_dir --kv_cache_free_gpu_mem_fraction --dataset --num_requests --warmup --concurrency --beam_width --medusa_choices --report_json --iteration_log + trtllm-bench -m latency --engine_dir --dataset Dataset Preparation ------------------ @@ -336,7 +336,7 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g .. code-block:: bash - python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl + python prepare_dataset.py --tokenizer --stdout dataset --output benchmark_data.jsonl This workflow allows you to: From ce1d3f8fa18a8fef6e2d24d8bbd7e7b92cf12dce Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Wed, 9 Jul 2025 04:53:22 +0000 Subject: [PATCH 07/10] Update to use click parsing. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 212 +------------------------- 1 file changed, 4 insertions(+), 208 deletions(-) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index 7dcac1a2b06..bdecd7cd86d 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -7,215 +7,11 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p **Usage:** -.. code-block:: bash - - trtllm-bench [OPTIONS] [OPTIONS] - -.. list-table:: - :widths: 20 80 - :header-rows: 1 - - * - Option - - Description - * - ``--model``, ``-m`` - - HuggingFace model name (required) - * - ``--model_path`` - - Path to local HuggingFace checkpoint - * - ``--workspace``, ``-w`` - - Directory for intermediate files (default: /tmp) - * - ``--log_level`` - - Logging level (default: info) - - -build ------ -Build TensorRT-LLM engines optimized for benchmarking. - -**Usage:** - -.. code-block:: bash - - trtllm-bench -m build [OPTIONS] - -**Key Options:** - -.. list-table:: - :widths: 20 80 - :header-rows: 1 - - * - Option - - Description - * - ``--tp_size``, ``-tp`` - - Number of tensor parallel shards (default: 1) - * - ``--pp_size``, ``-pp`` - - Number of pipeline parallel shards (default: 1) - * - ``--quantization``, ``-q`` - - Quantization algorithm (e.g., fp8, int8_sq, nvfp4) - * - ``--max_seq_len`` - - Maximum total sequence length for requests - * - ``--dataset`` - - Dataset file to extract sequence statistics for engine optimization - * - ``--max_batch_size`` - - Maximum number of requests the engine can schedule - * - ``--max_num_tokens`` - - Maximum number of batched tokens the engine can schedule - * - ``--target_input_len`` - - Target average input length for tuning heuristics - * - ``--target_output_len`` - - Target average output length for tuning heuristics - -**Engine Build Modes:** -The build command supports three mutually exclusive optimization modes: - -1. **Dataset-based**: Use ``--dataset`` to analyze sequence statistics and optimize engine parameters -2. **IFB Scheduler**: Use ``--max_batch_size`` and ``--max_num_tokens`` for manual tweaking of inflight batching -3. **Tuning Heuristics**: Use ``--target_input_len`` and ``--target_output_len`` for heuristic-based optimization - -throughput ----------- -Run throughput benchmarks to measure the engine's processing capacity under load. - -**Usage:** - -.. code-block:: bash - - trtllm-bench -m throughput [OPTIONS] - -**Key Options:** - -.. list-table:: - :widths: 20 80 - :header-rows: 1 - - * - Option - - Description - * - ``--engine_dir`` - - Path to the serialized TRT-LLM engine - * - ``--backend`` - - Backend choice (pytorch, _autodeploy) -- unspecified for TensorRT. - * - ``--extra_llm_api_options`` - - Path to YAML file that overwrites trtllm-bench parameters - * - ``--max_batch_size`` - - Maximum runtime batch size - * - ``--max_num_tokens`` - - Maximum runtime tokens the engine can accept - * - ``--max_seq_len`` - - Maximum sequence length - * - ``--beam_width`` - - Number of search beams (default: 1) - * - ``--kv_cache_free_gpu_mem_fraction`` - - GPU memory fraction for KV cache (default: 0.90) - * - ``--dataset`` - - Dataset file for benchmark input - * - ``--eos_id`` - - End-of-sequence token (-1 to disable) - * - ``--modality`` - - Modality of multimodal requests (image, video) - * - ``--max_input_len`` - - Maximum input sequence length for multimodal models (default: 4096) - * - ``--num_requests`` - - Number of requests to process (0 for all) - * - ``--warmup`` - - Number of warmup requests before benchmarking - * - ``--target_input_len`` - - Target average input length for tuning heuristics - * - ``--target_output_len`` - - Target average output length for tuning heuristics - * - ``--tp`` - - Tensor parallelism size (default: 1) - * - ``--pp`` - - Pipeline parallelism size (default: 1) - * - ``--ep`` - - Expert parallelism size - * - ``--cluster_size`` - - Expert cluster parallelism size - * - ``--concurrency`` - - Number of concurrent requests to process - * - ``--streaming`` - - Enable streaming output mode - * - ``--report_json`` - - Path to save benchmark report - * - ``--iteration_log`` - - Path to save iteration logging - * - ``--output_json`` - - Path to save output tokens - * - ``--enable_chunked_context`` - - Enable chunking in prefill stage for enhanced throughput - * - ``--scheduler_policy`` - - KV cache scheduler policy (guaranteed_no_evict, max_utilization) - -**Performance Features:** - -- Supports both streaming and non-streaming modes -- Configurable concurrency for load testing -- Comprehensive reporting with detailed statistics - -latency -------- -Run low-latency benchmarks optimized for minimal response time. - -**Usage:** - -.. code-block:: bash - - trtllm-bench -m latency [OPTIONS] - -**Key Options:** - -.. list-table:: - :widths: 20 80 - :header-rows: 1 - - * - Option - - Description - * - ``--engine_dir`` - - Path to the serialized TRT-LLM engine (required) - * - ``--kv_cache_free_gpu_mem_fraction`` - - GPU memory fraction for KV cache (default: 0.90) - * - ``--dataset`` - - Dataset file for benchmark input - * - ``--num_requests`` - - Number of requests to process (0 for all) - * - ``--warmup`` - - Number of warmup requests (default: 2) - * - ``--concurrency`` - - Number of concurrent requests (default: 1) - * - ``--beam_width`` - - Number of search beams for beam search - * - ``--medusa_choices`` - - Path to YAML file defining Medusa tree for speculative decoding - * - ``--report_json`` - - Path to save benchmark report - * - ``--iteration_log`` - - Path to save iteration logging - - -Examples --------- - -Build an engine optimized for a specific dataset (TensorRT backend only): - -.. code-block:: bash - - trtllm-bench -m build --dataset --tp_size --pp_size --quantization - -Run throughput benchmark (PyTorch): - -.. code-block:: bash - - trtllm-bench -m throughput --backend pytorch --dataset --tp_size --pp_size - -Run throughput benchmark (TensorRT): - -.. code-block:: bash - - trtllm-bench -m throughput --engine_dir --dataset - -Run latency benchmark: - -.. code-block:: bash +.. click:: tensorrt_llm.commands.bench:main + :prog: trtllm-bench + :nested: full + :commands: throughput, latency, build - trtllm-bench -m latency --engine_dir --dataset Dataset Preparation ------------------ From da65de4fe1f6df43f431078887dbe628869e0da0 Mon Sep 17 00:00:00 2001 From: Frank <3429989+FrankD412@users.noreply.github.com> Date: Wed, 9 Jul 2025 20:50:09 -0700 Subject: [PATCH 08/10] Update trtllm-bench.rst Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index bdecd7cd86d..9ab150a6c01 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -15,7 +15,7 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p Dataset Preparation ------------------ -trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which generates benchmark datasets in the required format. The prepare_dataset script supports: +trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py) script, which generates benchmark datasets in the required format. The prepare_dataset script supports: **Dataset Types:** - Real datasets from various sources From 2401485bee5e61c86da7b064ce5740816f2f8b15 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 10 Jul 2025 22:29:43 +0000 Subject: [PATCH 09/10] Update prepare_dataset section. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 77 ++++++++++++++++----------- 1 file changed, 46 insertions(+), 31 deletions(-) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index 9ab150a6c01..06b02a46acf 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -13,16 +13,20 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p :commands: throughput, latency, build -Dataset Preparation ------------------- -trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py) script, which generates benchmark datasets in the required format. The prepare_dataset script supports: + +prepare_dataset.py +=========================== + +trtllm-bench is designed to work with the `prepare_dataset.py `_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports: **Dataset Types:** + - Real datasets from various sources - Synthetic datasets with normal or uniform token distributions - LoRA task-specific datasets **Key Features:** + - Tokenizer integration for proper text preprocessing - Configurable random seeds for reproducible results - Support for LoRA adapters and task IDs @@ -31,8 +35,17 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github. .. important:: The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format. -**prepare_dataset.py CLI Options:** +**Usage:** + +prepare_dataset +------------------- + +.. code-block:: bash + + python prepare_dataset.py [OPTIONS] +**Options** +---- .. list-table:: :widths: 20 80 :header-rows: 1 @@ -56,23 +69,17 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github. * - ``--log-level`` - Logging level: info or debug (default: info) -**prepare_dataset.py Subcommands:** +dataset +------------------- -.. list-table:: - :widths: 20 80 - :header-rows: 1 +Process real datasets from various sources. - * - Subcommand - - Description - * - ``dataset`` - - Process real datasets from various sources - * - ``token_norm_dist`` - - Generate synthetic datasets with normal token distribution - * - ``token_unif_dist`` - - Generate synthetic datasets with uniform token distribution +.. code-block:: bash -**Dataset Subcommand Options:** + python prepare_dataset.py dataset [OPTIONS] +**Options** +---- .. list-table:: :widths: 20 80 :header-rows: 1 @@ -90,8 +97,18 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github. * - ``--format`` - Input format: json, jsonl, csv, or txt (default: auto-detect) -**Token Normal Distribution Subcommand Options:** +token_norm_dist +------------------- + +Generate synthetic datasets with normal token distribution. + +.. code-block:: bash + + python prepare_dataset.py token_norm_dist [OPTIONS] + +**Options** +---- .. list-table:: :widths: 20 80 :header-rows: 1 @@ -109,8 +126,18 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github. * - ``--output-stdev`` - Normal distribution standard deviation for output tokens (required) -**Token Uniform Distribution Subcommand Options:** +token_unif_dist +------------------- + +Generate synthetic datasets with uniform token distribution + +.. code-block:: bash + + python prepare_dataset.py token_unif_dist [OPTIONS] + +**Options** +---- .. list-table:: :widths: 20 80 :header-rows: 1 @@ -127,15 +154,3 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github. - Uniform distribution minimum for output tokens (required) * - ``--output-max`` - Uniform distribution maximum for output tokens (required) - -**Usage Example:** - -.. code-block:: bash - - python prepare_dataset.py --tokenizer --stdout dataset --output benchmark_data.jsonl - -This workflow allows you to: - -1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag -2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset -3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency`` From 03e55a14528dabbd77f43b83208c8ac25202d896 Mon Sep 17 00:00:00 2001 From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Date: Thu, 10 Jul 2025 23:23:59 +0000 Subject: [PATCH 10/10] Update formatting. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --- docs/source/commands/trtllm-bench.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst index 06b02a46acf..7f03c8dfc66 100644 --- a/docs/source/commands/trtllm-bench.rst +++ b/docs/source/commands/trtllm-bench.rst @@ -45,7 +45,9 @@ prepare_dataset python prepare_dataset.py [OPTIONS] **Options** + ---- + .. list-table:: :widths: 20 80 :header-rows: 1 @@ -79,7 +81,9 @@ Process real datasets from various sources. python prepare_dataset.py dataset [OPTIONS] **Options** + ---- + .. list-table:: :widths: 20 80 :header-rows: 1 @@ -108,7 +112,9 @@ Generate synthetic datasets with normal token distribution. python prepare_dataset.py token_norm_dist [OPTIONS] **Options** + ---- + .. list-table:: :widths: 20 80 :header-rows: 1 @@ -137,7 +143,9 @@ Generate synthetic datasets with uniform token distribution python prepare_dataset.py token_unif_dist [OPTIONS] **Options** + ---- + .. list-table:: :widths: 20 80 :header-rows: 1