NVIDIA · Tabrizian · Jul 17, 2025 · Aug 6, 2025 · Aug 7, 2025 · Aug 7, 2025
@@ -143,3 +143,7 @@ Once the context and generation servers are launched, you can again launch the d
 ```
 trtllm-serve disaggregated -c disagg_config.yaml
 ```
+
+## Launching disaggregated serving using SLURM
+
+Please refer to the [SLURM example](./slurm/README.md) for an example of how to launch disaggregated serving using SLURM.
@@ -0,0 +1,61 @@
+# Disaggregated Serving Launcher for SLURM
+
+## Overview
+
+A tool for running disaggregated serving benchmarks with TRT-LLM on SLURM.
+
+## Usage
+
+```bash
+python3 launcher.py --account <account> --partition <partition> --time <time> --job-name <job-name> --container-image <container-image> --config-file <config-file> --experiment-path <experiment-path> --request-allocation --num-gpus <num-gpus>
+```
+
+### Configuration
+
+The configuration file should be in the following format:
+
+```yaml
+exec:
+  model_path: <model-path>
+  # Determines the disaggregated serving configuration
+  config:
+    context:
+      tp: <tp>
+      ep: <ep>
+      pp: <pp>
+      max_batch_size: <max_batch_size>
+      max_num_tokens: <max_num_tokens>
+      max_seq_len: <max_seq_len>
+      config:
+        # Determines the context server PyTorch configuration
+        print_iter_log: true
+        disable_overlap_scheduler: true
+        kv_cache_config:
+          free_gpu_memory_fraction: 0.75
+          enable_block_reuse: false
+    generation:
+      tp: <tp>
+      ep: <ep>
+      pp: <pp>
+      max_batch_size: <max_batch_size>
+      max_num_tokens: <max_num_tokens>
+      max_seq_len: <max_seq_len>
+      config:
+        # Determines the generation server PyTorch configuration
+        print_iter_log: true
+        kv_cache_config:
+          free_gpu_memory_fraction: 0.75
+          enable_block_reuse: false
+
+# Determines the profiling configuration
+profile:
+  isl: <isl>
+  osl: <osl>
+  use_benchmark_serving: true
+  concurrency:
+    - <concurrency>
+```
+
+Please refer to the [config.yaml](config.yaml) file for an example configuration.
+
+The experiment results will be saved in the experiment path.
@@ -0,0 +1,36 @@
+exec:
+  config:
+    context:
+      tp: 4
+      ep: 4
+      pp: 1
+      max_batch_size: 4
+      max_num_tokens: 1024
+      max_seq_len: 1024
+      config:
+        kv_cache_config:
+          free_gpu_memory_fraction: 0.75
+          enable_block_reuse: false
+        print_iter_log: true
+      dp: 1
+    generation:
+      tp: 4
+      ep: 4
+      pp: 1
+      max_batch_size: 1
+      max_num_tokens: 4096
+      max_seq_len: 2048
+      config:
+        print_iter_log: true
+        kv_cache_config:
+          free_gpu_memory_fraction: 0.75
+          enable_block_reuse: false
+      dp: 1
+  model_path: TinyLlama/TinyLlama-1.1B-Chat-v1.0
+profile:
+  isl: 1024
+  osl: 1024
+  use_benchmark_serving: true
+  concurrency:
+  - 128
+  - 256
@@ -0,0 +1,16 @@
+"""
+Disaggregated serving profiler package.
+
+This package contains the job management and parameter sweeping functionality
+for the TRT-LLM disaggregated serving launcher.
+"""
+
+from .job_manager import JobManager, calculate_nodes_needed, wait_for_server
+from .sweeper import (AutoSweeper, ParameterSweeper, get_slurm_allocation,
+                      run_sweep_configuration)
+
+__all__ = [
+    'JobManager', 'calculate_nodes_needed', 'wait_for_server',
+    'ParameterSweeper', 'AutoSweeper', 'get_slurm_allocation',
+    'run_sweep_configuration'
+]