Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions examples/disaggregated/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,3 +143,7 @@ Once the context and generation servers are launched, you can again launch the d
```
trtllm-serve disaggregated -c disagg_config.yaml
```

## Launching disaggregated serving using SLURM

Please refer to the [SLURM example](./slurm/README.md) for an example of how to launch disaggregated serving using SLURM.
61 changes: 61 additions & 0 deletions examples/disaggregated/slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Disaggregated Serving Launcher for SLURM

## Overview

A tool for running disaggregated serving benchmarks with TRT-LLM on SLURM.

## Usage

```bash
python3 launcher.py --account <account> --partition <partition> --time <time> --job-name <job-name> --container-image <container-image> --config-file <config-file> --experiment-path <experiment-path> --request-allocation --num-gpus <num-gpus>
```

### Configuration

The configuration file should be in the following format:

```yaml
exec:
model_path: <model-path>
# Determines the disaggregated serving configuration
config:
context:
tp: <tp>
ep: <ep>
pp: <pp>
max_batch_size: <max_batch_size>
max_num_tokens: <max_num_tokens>
max_seq_len: <max_seq_len>
config:
# Determines the context server PyTorch configuration
print_iter_log: true
disable_overlap_scheduler: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
generation:
tp: <tp>
ep: <ep>
pp: <pp>
max_batch_size: <max_batch_size>
max_num_tokens: <max_num_tokens>
max_seq_len: <max_seq_len>
config:
# Determines the generation server PyTorch configuration
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
enable_block_reuse: false

# Determines the profiling configuration
profile:
isl: <isl>
osl: <osl>
use_benchmark_serving: true
concurrency:
- <concurrency>
```

Please refer to the [config.yaml](config.yaml) file for an example configuration.

The experiment results will be saved in the experiment path.
36 changes: 36 additions & 0 deletions examples/disaggregated/slurm/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
exec:
config:
context:
tp: 4
ep: 4
pp: 1
max_batch_size: 4
max_num_tokens: 1024
max_seq_len: 1024
config:
kv_cache_config:
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
print_iter_log: true
dp: 1
generation:
tp: 4
ep: 4
pp: 1
max_batch_size: 1
max_num_tokens: 4096
max_seq_len: 2048
config:
print_iter_log: true
kv_cache_config:
free_gpu_memory_fraction: 0.75
enable_block_reuse: false
dp: 1
model_path: TinyLlama/TinyLlama-1.1B-Chat-v1.0
profile:
isl: 1024
osl: 1024
use_benchmark_serving: true
concurrency:
- 128
- 256
16 changes: 16 additions & 0 deletions examples/disaggregated/slurm/disagg_profiler/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""
Disaggregated serving profiler package.

This package contains the job management and parameter sweeping functionality
for the TRT-LLM disaggregated serving launcher.
"""

from .job_manager import JobManager, calculate_nodes_needed, wait_for_server
from .sweeper import (AutoSweeper, ParameterSweeper, get_slurm_allocation,
run_sweep_configuration)

__all__ = [
'JobManager', 'calculate_nodes_needed', 'wait_for_server',
'ParameterSweeper', 'AutoSweeper', 'get_slurm_allocation',
'run_sweep_configuration'
]
Loading