From 4c9e40ae73322f4d5a96e4ad10399107d0c08761 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:07:10 +0000
Subject: [PATCH 01/10] Add initial documentation for trtllm-bench CLI.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 245 ++++++++++++++++++++++++++
 docs/source/index.rst                 |   1 +
 2 files changed, 246 insertions(+)
 create mode 100644 docs/source/commands/trtllm-bench.rst
diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
new file mode 100644
index 00000000000..5b96f226774
--- /dev/null
+++ b/docs/source/commands/trtllm-bench.rst
@@ -0,0 +1,245 @@
+trtllm-bench
+===========================
+
+trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It provides three main subcommands for different benchmarking scenarios:
+
+**Common Options for All Commands:**
+
+**Usage:**
+.. code-block:: bash
+
+    trtllm-bench [OPTIONS] <subcommand> [OPTIONS]
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--model``, ``-m``
+     - HuggingFace model name (required)
+   * - ``--model_path``
+     - Path to local HuggingFace checkpoint
+   * - ``--workspace``, ``-w``
+     - Directory for intermediate files (default: /tmp)
+   * - ``--log_level``
+     - Logging level (default: info)
+
+
+build
+-----
+Build TensorRT-LLM engines optimized for benchmarking.
+
+**Usage:**
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> build [OPTIONS]
+
+**Key Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--tp_size``, ``-tp``
+     - Number of tensor parallel shards (default: 1)
+   * - ``--pp_size``, ``-pp``
+     - Number of pipeline parallel shards (default: 1)
+   * - ``--quantization``, ``-q``
+     - Quantization algorithm (e.g., fp8, int8_sq, nvfp4)
+   * - ``--max_seq_len``
+     - Maximum total sequence length for requests
+   * - ``--dataset``
+     - Dataset file to extract sequence statistics for engine optimization
+   * - ``--max_batch_size``
+     - Maximum number of requests the engine can schedule
+   * - ``--max_num_tokens``
+     - Maximum number of batched tokens the engine can schedule
+   * - ``--target_input_len``
+     - Target average input length for tuning heuristics
+   * - ``--target_output_len``
+     - Target average output length for tuning heuristics
+
+**Engine Build Modes:**
+The build command supports three mutually exclusive optimization modes:
+
+1. **Dataset-based**: Use ``--dataset`` to analyze sequence statistics and optimize engine parameters
+2. **IFB Scheduler**: Use ``--max_batch_size`` and ``--max_num_tokens`` for manual tweaking of inflight batching
+3. **Tuning Heuristics**: Use ``--target_input_len`` and ``--target_output_len`` for heuristic-based optimization
+
+throughput
+----------
+Run throughput benchmarks to measure the engine's processing capacity under load.
+
+**Usage:**
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> throughput [OPTIONS]
+
+**Key Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--engine_dir``
+     - Path to the serialized TRT-LLM engine
+   * - ``--backend``
+     - Backend choice (pytorch, _autodeploy)
+   * - ``--max_batch_size``
+     - Maximum runtime batch size
+   * - ``--max_num_tokens``
+     - Maximum runtime tokens the engine can accept
+   * - ``--concurrency``
+     - Number of concurrent requests to process
+   * - ``--dataset``
+     - Dataset file for benchmark input
+   * - ``--num_requests``
+     - Number of requests to process (0 for all)
+   * - ``--warmup``
+     - Number of warmup requests before benchmarking
+   * - ``--streaming``
+     - Enable streaming output mode
+   * - ``--report_json``
+     - Path to save benchmark report
+   * - ``--output_json``
+     - Path to save output tokens
+
+**Performance Features:**
+- Supports both streaming and non-streaming modes
+- Configurable concurrency for load testing
+- Comprehensive reporting with detailed statistics
+
+latency
+-------
+Run low-latency benchmarks optimized for minimal response time.
+
+**Usage:**
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> latency [OPTIONS]
+
+**Key Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--engine_dir``
+     - Path to the serialized TRT-LLM engine (required)
+   * - ``--kv_cache_free_gpu_mem_fraction``
+     - GPU memory fraction for KV cache (default: 0.90)
+   * - ``--dataset``
+     - Dataset file for benchmark input
+   * - ``--num_requests``
+     - Number of requests to process
+   * - ``--warmup``
+     - Number of warmup requests (default: 2)
+   * - ``--concurrency``
+     - Number of concurrent requests (default: 1)
+   * - ``--beam_width``
+     - Number of search beams for beam search
+   * - ``--medusa_choices``
+     - Path to YAML file defining Medusa tree for speculative decoding
+   * - ``--report_json``
+     - Path to save benchmark report
+   * - ``--iteration_log``
+     - Path to save iteration logging
+
+
+Examples
+--------
+
+Build an engine optimized for a specific dataset (TensorRT backend only):
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> build --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size> --quantization <quantization>
+
+Run throughput benchmark (PyTorch):
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> throughput --backend pytorch --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size>
+
+Run throughput benchmark (TensorRT):
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> throughput --engine_dir <engine_path> --dataset <dataset_path>
+
+Run latency benchmark:
+.. code-block:: bash
+
+    trtllm-bench -m <model_name> --engine_dir <engine_path> --kv_cache_free_gpu_mem_fraction <kv_cache_free_gpu_mem_fraction> --dataset <dataset_path> --num_requests <num_requests> --warmup <warmup> --concurrency <concurrency> --beam_width <beam_width> --medusa_choices <medusa_choices> --report_json <report_json> --iteration_log <iteration_log>
+
+Dataset Preparation
+------------------
+trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
+
+**Dataset Types:**
+- Real datasets from various sources
+- Synthetic datasets with normal or uniform token distributions
+- LoRA task-specific datasets
+
+**Key Features:**
+- Tokenizer integration for proper text preprocessing
+- Configurable random seeds for reproducible results
+- Support for LoRA adapters and task IDs
+- Output in JSON format compatible with trtllm-bench
+
+.. important::
+   The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format.
+
+**prepare_dataset.py CLI Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--tokenizer``
+     - Tokenizer directory or HuggingFace model name (required)
+   * - ``--output``
+     - Output JSON filename (default: preprocessed_dataset.json)
+   * - ``--stdout``
+     - Print output to stdout with JSON dataset entry on each line (**required for trtllm-bench**)
+   * - ``--random-seed``
+     - Random seed for token generation (default: 420)
+   * - ``--task-id``
+     - LoRA task ID (default: -1)
+   * - ``--rand-task-id``
+     - Random LoRA task range (two integers)
+   * - ``--lora-dir``
+     - Directory containing LoRA adapters
+   * - ``--log-level``
+     - Logging level: info or debug (default: info)
+
+**prepare_dataset.py Subcommands:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Subcommand
+     - Description
+   * - ``dataset``
+     - Process real datasets from various sources
+   * - ``token_norm_dist``
+     - Generate synthetic datasets with normal token distribution
+   * - ``token_unif_dist``
+     - Generate synthetic datasets with uniform token distribution
+
+**Usage Example:**
+.. code-block:: bash
+
+    python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl
+
+This workflow allows you to:
+1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag
+2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset
+3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency``
diff --git a/docs/source/index.rst b/docs/source/index.rst
index b63ec95a676..50b9c122678 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -77,6 +77,7 @@ Welcome to TensorRT-LLM's Documentation!
    :caption: Command-Line Reference
    :hidden:
 
+   commands/trtllm-bench
    commands/trtllm-build
    commands/trtllm-serve
 

From d70449c740e810f6adf2eafe7819ddf811c585a3 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:15:40 +0000
Subject: [PATCH 02/10] Updates to CLI options.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 40 ++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index 5b96f226774..d8dc161f309 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -89,25 +89,57 @@ Run throughput benchmarks to measure the engine's processing capacity under load
    * - ``--engine_dir``
      - Path to the serialized TRT-LLM engine
    * - ``--backend``
-     - Backend choice (pytorch, _autodeploy)
+     - Backend choice (pytorch, _autodeploy) -- unspecified for TensorRT.
+   * - ``--extra_llm_api_options``
+     - Path to YAML file that overwrites trtllm-bench parameters
    * - ``--max_batch_size``
      - Maximum runtime batch size
    * - ``--max_num_tokens``
      - Maximum runtime tokens the engine can accept
-   * - ``--concurrency``
-     - Number of concurrent requests to process
+   * - ``--max_seq_len``
+     - Maximum sequence length
+   * - ``--beam_width``
+     - Number of search beams (default: 1)
+   * - ``--kv_cache_free_gpu_mem_fraction``
+     - GPU memory fraction for KV cache (default: 0.90)
    * - ``--dataset``
      - Dataset file for benchmark input
+   * - ``--eos_id``
+     - End-of-sequence token (-1 to disable)
+   * - ``--modality``
+     - Modality of multimodal requests (image, video)
+   * - ``--max_input_len``
+     - Maximum input sequence length for multimodal models (default: 4096)
    * - ``--num_requests``
      - Number of requests to process (0 for all)
    * - ``--warmup``
      - Number of warmup requests before benchmarking
+   * - ``--target_input_len``
+     - Target average input length for tuning heuristics
+   * - ``--target_output_len``
+     - Target average output length for tuning heuristics
+   * - ``--tp``
+     - Tensor parallelism size (default: 1)
+   * - ``--pp``
+     - Pipeline parallelism size (default: 1)
+   * - ``--ep``
+     - Expert parallelism size
+   * - ``--cluster_size``
+     - Expert cluster parallelism size
+   * - ``--concurrency``
+     - Number of concurrent requests to process
    * - ``--streaming``
      - Enable streaming output mode
    * - ``--report_json``
      - Path to save benchmark report
+   * - ``--iteration_log``
+     - Path to save iteration logging
    * - ``--output_json``
      - Path to save output tokens
+   * - ``--enable_chunked_context``
+     - Enable chunking in prefill stage for enhanced throughput
+   * - ``--scheduler_policy``
+     - KV cache scheduler policy (guaranteed_no_evict, max_utilization)
 
 **Performance Features:**
 - Supports both streaming and non-streaming modes
@@ -138,7 +170,7 @@ Run low-latency benchmarks optimized for minimal response time.
    * - ``--dataset``
      - Dataset file for benchmark input
    * - ``--num_requests``
-     - Number of requests to process
+     - Number of requests to process (0 for all)
    * - ``--warmup``
      - Number of warmup requests (default: 2)
    * - ``--concurrency``

From 64f5ac328e2bb259fa2c4405269a3f338222437b Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:20:56 +0000
Subject: [PATCH 03/10] Update to fix improper formatting.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index d8dc161f309..b8b86f6ec0d 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -74,6 +74,7 @@ throughput
 Run throughput benchmarks to measure the engine's processing capacity under load.
 
 **Usage:**
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> throughput [OPTIONS]
@@ -142,6 +143,7 @@ Run throughput benchmarks to measure the engine's processing capacity under load
      - KV cache scheduler policy (guaranteed_no_evict, max_utilization)
 
 **Performance Features:**
+
 - Supports both streaming and non-streaming modes
 - Configurable concurrency for load testing
 - Comprehensive reporting with detailed statistics
@@ -151,6 +153,7 @@ latency
 Run low-latency benchmarks optimized for minimal response time.
 
 **Usage:**
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> latency [OPTIONS]
@@ -189,21 +192,25 @@ Examples
 --------
 
 Build an engine optimized for a specific dataset (TensorRT backend only):
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> build --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size> --quantization <quantization>
 
 Run throughput benchmark (PyTorch):
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> throughput --backend pytorch --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size>
 
 Run throughput benchmark (TensorRT):
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> throughput --engine_dir <engine_path> --dataset <dataset_path>
 
 Run latency benchmark:
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> --engine_dir <engine_path> --kv_cache_free_gpu_mem_fraction <kv_cache_free_gpu_mem_fraction> --dataset <dataset_path> --num_requests <num_requests> --warmup <warmup> --concurrency <concurrency> --beam_width <beam_width> --medusa_choices <medusa_choices> --report_json <report_json> --iteration_log <iteration_log>
@@ -272,6 +279,7 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g
     python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl
 
 This workflow allows you to:
+
 1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag
 2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset
 3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency``

From 12d6d4358f190ef568cbea96f75eb00c0dea90b2 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:34:29 +0000
Subject: [PATCH 04/10] Further updates.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 58 +++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index b8b86f6ec0d..cceacf1fb65 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -273,7 +273,65 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g
    * - ``token_unif_dist``
      - Generate synthetic datasets with uniform token distribution
 
+**Dataset Subcommand Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--input``
+     - Input dataset file or directory (required)
+   * - ``--max-input-length``
+     - Maximum input sequence length (default: 2048)
+   * - ``--max-output-length``
+     - Maximum output sequence length (default: 512)
+   * - ``--num-samples``
+     - Number of samples to process (default: all)
+   * - ``--format``
+     - Input format: json, jsonl, csv, or txt (default: auto-detect)
+
+**Token Normal Distribution Subcommand Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--num-requests``
+     - Number of requests to be generated (required)
+   * - ``--input-mean``
+     - Normal distribution mean for input tokens (required)
+   * - ``--input-stdev``
+     - Normal distribution standard deviation for input tokens (required)
+   * - ``--output-mean``
+     - Normal distribution mean for output tokens (required)
+   * - ``--output-stdev``
+     - Normal distribution standard deviation for output tokens (required)
+
+**Token Uniform Distribution Subcommand Options:**
+
+.. list-table::
+   :widths: 20 80
+   :header-rows: 1
+
+   * - Option
+     - Description
+   * - ``--num-requests``
+     - Number of requests to be generated (required)
+   * - ``--input-min``
+     - Uniform distribution minimum for input tokens (required)
+   * - ``--input-max``
+     - Uniform distribution maximum for input tokens (required)
+   * - ``--output-min``
+     - Uniform distribution minimum for output tokens (required)
+   * - ``--output-max``
+     - Uniform distribution maximum for output tokens (required)
+
 **Usage Example:**
+
 .. code-block:: bash
 
     python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl

From 6cf69c6aa13e3299cf90dc26d7bf76cdc38982a7 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:41:02 +0000
Subject: [PATCH 05/10] Further updates.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index cceacf1fb65..2ba189b39d6 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -6,6 +6,7 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p
 **Common Options for All Commands:**
 
 **Usage:**
+
 .. code-block:: bash
 
     trtllm-bench [OPTIONS] <subcommand> [OPTIONS]
@@ -31,6 +32,7 @@ build
 Build TensorRT-LLM engines optimized for benchmarking.
 
 **Usage:**
+
 .. code-block:: bash
 
     trtllm-bench -m <model_name> build [OPTIONS]

From 2fea47054e8faa32b9f20b7314b6eb3e434c0e67 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 3 Jul 2025 21:49:57 +0000
Subject: [PATCH 06/10] More updates

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index 2ba189b39d6..7dcac1a2b06 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -215,7 +215,7 @@ Run latency benchmark:
 
 .. code-block:: bash
 
-    trtllm-bench -m <model_name> --engine_dir <engine_path> --kv_cache_free_gpu_mem_fraction <kv_cache_free_gpu_mem_fraction> --dataset <dataset_path> --num_requests <num_requests> --warmup <warmup> --concurrency <concurrency> --beam_width <beam_width> --medusa_choices <medusa_choices> --report_json <report_json> --iteration_log <iteration_log>
+    trtllm-bench -m <model_name> latency --engine_dir <engine_path> --dataset <dataset_path>
 
 Dataset Preparation
 ------------------
@@ -336,7 +336,7 @@ trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which g
 
 .. code-block:: bash
 
-    python prepare_dataset.py --tokenizer meta-llama/Meta-Llama-3.3-8B --stdout dataset --output benchmark_data.jsonl
+    python prepare_dataset.py --tokenizer <model_name> --stdout dataset --output benchmark_data.jsonl
 
 This workflow allows you to:
 

From ce1d3f8fa18a8fef6e2d24d8bbd7e7b92cf12dce Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Wed, 9 Jul 2025 04:53:22 +0000
Subject: [PATCH 07/10] Update to use click parsing.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 212 +-------------------------
 1 file changed, 4 insertions(+), 208 deletions(-)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index 7dcac1a2b06..bdecd7cd86d 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -7,215 +7,11 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p
 
 **Usage:**
 
-.. code-block:: bash
-
-    trtllm-bench [OPTIONS] <subcommand> [OPTIONS]
-
-.. list-table::
-   :widths: 20 80
-   :header-rows: 1
-
-   * - Option
-     - Description
-   * - ``--model``, ``-m``
-     - HuggingFace model name (required)
-   * - ``--model_path``
-     - Path to local HuggingFace checkpoint
-   * - ``--workspace``, ``-w``
-     - Directory for intermediate files (default: /tmp)
-   * - ``--log_level``
-     - Logging level (default: info)
-
-
-build
------
-Build TensorRT-LLM engines optimized for benchmarking.
-
-**Usage:**
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> build [OPTIONS]
-
-**Key Options:**
-
-.. list-table::
-   :widths: 20 80
-   :header-rows: 1
-
-   * - Option
-     - Description
-   * - ``--tp_size``, ``-tp``
-     - Number of tensor parallel shards (default: 1)
-   * - ``--pp_size``, ``-pp``
-     - Number of pipeline parallel shards (default: 1)
-   * - ``--quantization``, ``-q``
-     - Quantization algorithm (e.g., fp8, int8_sq, nvfp4)
-   * - ``--max_seq_len``
-     - Maximum total sequence length for requests
-   * - ``--dataset``
-     - Dataset file to extract sequence statistics for engine optimization
-   * - ``--max_batch_size``
-     - Maximum number of requests the engine can schedule
-   * - ``--max_num_tokens``
-     - Maximum number of batched tokens the engine can schedule
-   * - ``--target_input_len``
-     - Target average input length for tuning heuristics
-   * - ``--target_output_len``
-     - Target average output length for tuning heuristics
-
-**Engine Build Modes:**
-The build command supports three mutually exclusive optimization modes:
-
-1. **Dataset-based**: Use ``--dataset`` to analyze sequence statistics and optimize engine parameters
-2. **IFB Scheduler**: Use ``--max_batch_size`` and ``--max_num_tokens`` for manual tweaking of inflight batching
-3. **Tuning Heuristics**: Use ``--target_input_len`` and ``--target_output_len`` for heuristic-based optimization
-
-throughput
-----------
-Run throughput benchmarks to measure the engine's processing capacity under load.
-
-**Usage:**
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> throughput [OPTIONS]
-
-**Key Options:**
-
-.. list-table::
-   :widths: 20 80
-   :header-rows: 1
-
-   * - Option
-     - Description
-   * - ``--engine_dir``
-     - Path to the serialized TRT-LLM engine
-   * - ``--backend``
-     - Backend choice (pytorch, _autodeploy) -- unspecified for TensorRT.
-   * - ``--extra_llm_api_options``
-     - Path to YAML file that overwrites trtllm-bench parameters
-   * - ``--max_batch_size``
-     - Maximum runtime batch size
-   * - ``--max_num_tokens``
-     - Maximum runtime tokens the engine can accept
-   * - ``--max_seq_len``
-     - Maximum sequence length
-   * - ``--beam_width``
-     - Number of search beams (default: 1)
-   * - ``--kv_cache_free_gpu_mem_fraction``
-     - GPU memory fraction for KV cache (default: 0.90)
-   * - ``--dataset``
-     - Dataset file for benchmark input
-   * - ``--eos_id``
-     - End-of-sequence token (-1 to disable)
-   * - ``--modality``
-     - Modality of multimodal requests (image, video)
-   * - ``--max_input_len``
-     - Maximum input sequence length for multimodal models (default: 4096)
-   * - ``--num_requests``
-     - Number of requests to process (0 for all)
-   * - ``--warmup``
-     - Number of warmup requests before benchmarking
-   * - ``--target_input_len``
-     - Target average input length for tuning heuristics
-   * - ``--target_output_len``
-     - Target average output length for tuning heuristics
-   * - ``--tp``
-     - Tensor parallelism size (default: 1)
-   * - ``--pp``
-     - Pipeline parallelism size (default: 1)
-   * - ``--ep``
-     - Expert parallelism size
-   * - ``--cluster_size``
-     - Expert cluster parallelism size
-   * - ``--concurrency``
-     - Number of concurrent requests to process
-   * - ``--streaming``
-     - Enable streaming output mode
-   * - ``--report_json``
-     - Path to save benchmark report
-   * - ``--iteration_log``
-     - Path to save iteration logging
-   * - ``--output_json``
-     - Path to save output tokens
-   * - ``--enable_chunked_context``
-     - Enable chunking in prefill stage for enhanced throughput
-   * - ``--scheduler_policy``
-     - KV cache scheduler policy (guaranteed_no_evict, max_utilization)
-
-**Performance Features:**
-
-- Supports both streaming and non-streaming modes
-- Configurable concurrency for load testing
-- Comprehensive reporting with detailed statistics
-
-latency
--------
-Run low-latency benchmarks optimized for minimal response time.
-
-**Usage:**
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> latency [OPTIONS]
-
-**Key Options:**
-
-.. list-table::
-   :widths: 20 80
-   :header-rows: 1
-
-   * - Option
-     - Description
-   * - ``--engine_dir``
-     - Path to the serialized TRT-LLM engine (required)
-   * - ``--kv_cache_free_gpu_mem_fraction``
-     - GPU memory fraction for KV cache (default: 0.90)
-   * - ``--dataset``
-     - Dataset file for benchmark input
-   * - ``--num_requests``
-     - Number of requests to process (0 for all)
-   * - ``--warmup``
-     - Number of warmup requests (default: 2)
-   * - ``--concurrency``
-     - Number of concurrent requests (default: 1)
-   * - ``--beam_width``
-     - Number of search beams for beam search
-   * - ``--medusa_choices``
-     - Path to YAML file defining Medusa tree for speculative decoding
-   * - ``--report_json``
-     - Path to save benchmark report
-   * - ``--iteration_log``
-     - Path to save iteration logging
-
-
-Examples
---------
-
-Build an engine optimized for a specific dataset (TensorRT backend only):
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> build --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size> --quantization <quantization>
-
-Run throughput benchmark (PyTorch):
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> throughput --backend pytorch --dataset <dataset_path> --tp_size <tp_size> --pp_size <pp_size>
-
-Run throughput benchmark (TensorRT):
-
-.. code-block:: bash
-
-    trtllm-bench -m <model_name> throughput --engine_dir <engine_path> --dataset <dataset_path>
-
-Run latency benchmark:
-
-.. code-block:: bash
+.. click:: tensorrt_llm.commands.bench:main
+   :prog: trtllm-bench
+   :nested: full
+   :commands: throughput, latency, build
 
-    trtllm-bench -m <model_name> latency --engine_dir <engine_path> --dataset <dataset_path>
 
 Dataset Preparation
 ------------------

From da65de4fe1f6df43f431078887dbe628869e0da0 Mon Sep 17 00:00:00 2001
From: Frank <3429989+FrankD412@users.noreply.github.com>
Date: Wed, 9 Jul 2025 20:50:09 -0700
Subject: [PATCH 08/10] Update trtllm-bench.rst

Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Frank <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index bdecd7cd86d..9ab150a6c01 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -15,7 +15,7 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p
 
 Dataset Preparation
 ------------------
-trtllm-bench is designed to work with the ``prepare_dataset.py`` script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
+trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py) script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
 
 **Dataset Types:**
 - Real datasets from various sources

From 2401485bee5e61c86da7b064ce5740816f2f8b15 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 10 Jul 2025 22:29:43 +0000
Subject: [PATCH 09/10] Update prepare_dataset section.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 77 ++++++++++++++++-----------
 1 file changed, 46 insertions(+), 31 deletions(-)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index 9ab150a6c01..06b02a46acf 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -13,16 +13,20 @@ trtllm-bench is a comprehensive benchmarking tool for TensorRT-LLM engines. It p
    :commands: throughput, latency, build
 
 
-Dataset Preparation
-------------------
-trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py) script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
+
+prepare_dataset.py
+===========================
+
+trtllm-bench is designed to work with the `prepare_dataset.py <https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/prepare_dataset.py>`_ script, which generates benchmark datasets in the required format. The prepare_dataset script supports:
 
 **Dataset Types:**
+
 - Real datasets from various sources
 - Synthetic datasets with normal or uniform token distributions
 - LoRA task-specific datasets
 
 **Key Features:**
+
 - Tokenizer integration for proper text preprocessing
 - Configurable random seeds for reproducible results
 - Support for LoRA adapters and task IDs
@@ -31,8 +35,17 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.
 .. important::
    The ``--stdout`` flag is **required** when using prepare_dataset.py with trtllm-bench to ensure proper data streaming format.
 
-**prepare_dataset.py CLI Options:**
+**Usage:**
+
+prepare_dataset
+-------------------
+
+.. code-block:: bash
+
+    python prepare_dataset.py [OPTIONS]
 
+**Options**
+----
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -56,23 +69,17 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.
    * - ``--log-level``
      - Logging level: info or debug (default: info)
 
-**prepare_dataset.py Subcommands:**
+dataset
+-------------------
 
-.. list-table::
-   :widths: 20 80
-   :header-rows: 1
+Process real datasets from various sources.
 
-   * - Subcommand
-     - Description
-   * - ``dataset``
-     - Process real datasets from various sources
-   * - ``token_norm_dist``
-     - Generate synthetic datasets with normal token distribution
-   * - ``token_unif_dist``
-     - Generate synthetic datasets with uniform token distribution
+.. code-block:: bash
 
-**Dataset Subcommand Options:**
+    python prepare_dataset.py dataset [OPTIONS]
 
+**Options**
+----
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -90,8 +97,18 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.
    * - ``--format``
      - Input format: json, jsonl, csv, or txt (default: auto-detect)
 
-**Token Normal Distribution Subcommand Options:**
 
+token_norm_dist
+-------------------
+
+Generate synthetic datasets with normal token distribution.
+
+.. code-block:: bash
+
+    python prepare_dataset.py token_norm_dist [OPTIONS]
+
+**Options**
+----
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -109,8 +126,18 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.
    * - ``--output-stdev``
      - Normal distribution standard deviation for output tokens (required)
 
-**Token Uniform Distribution Subcommand Options:**
 
+token_unif_dist
+-------------------
+
+Generate synthetic datasets with uniform token distribution
+
+.. code-block:: bash
+
+    python prepare_dataset.py token_unif_dist [OPTIONS]
+
+**Options**
+----
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -127,15 +154,3 @@ trtllm-bench is designed to work with the [`prepare_dataset.py`](https://github.
      - Uniform distribution minimum for output tokens (required)
    * - ``--output-max``
      - Uniform distribution maximum for output tokens (required)
-
-**Usage Example:**
-
-.. code-block:: bash
-
-    python prepare_dataset.py --tokenizer <model_name> --stdout dataset --output benchmark_data.jsonl
-
-This workflow allows you to:
-
-1. Prepare datasets using ``prepare_dataset.py`` with the required ``--stdout`` flag
-2. Build optimized engines with ``trtllm-bench build`` using the prepared dataset
-3. Run comprehensive benchmarks with ``trtllm-bench throughput`` or ``trtllm-bench latency``

From 03e55a14528dabbd77f43b83208c8ac25202d896 Mon Sep 17 00:00:00 2001
From: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Date: Thu, 10 Jul 2025 23:23:59 +0000
Subject: [PATCH 10/10] Update formatting.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---
 docs/source/commands/trtllm-bench.rst | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/source/commands/trtllm-bench.rst b/docs/source/commands/trtllm-bench.rst
index 06b02a46acf..7f03c8dfc66 100644
--- a/docs/source/commands/trtllm-bench.rst
+++ b/docs/source/commands/trtllm-bench.rst
@@ -45,7 +45,9 @@ prepare_dataset
     python prepare_dataset.py [OPTIONS]
 
 **Options**
+
 ----
+
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -79,7 +81,9 @@ Process real datasets from various sources.
     python prepare_dataset.py dataset [OPTIONS]
 
 **Options**
+
 ----
+
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -108,7 +112,9 @@ Generate synthetic datasets with normal token distribution.
     python prepare_dataset.py token_norm_dist [OPTIONS]
 
 **Options**
+
 ----
+
 .. list-table::
    :widths: 20 80
    :header-rows: 1
@@ -137,7 +143,9 @@ Generate synthetic datasets with uniform token distribution
     python prepare_dataset.py token_unif_dist [OPTIONS]
 
 **Options**
+
 ----
+
 .. list-table::
    :widths: 20 80
    :header-rows: 1