Skip to content

Commit 2c0f894

Browse files
committed
fix: README modifications.
Signed-off-by: Fadi Saady <[email protected]>
1 parent 6ecf4f4 commit 2c0f894

File tree

2 files changed

+26
-5
lines changed

2 files changed

+26
-5
lines changed

examples/sglang/slurm_jobs/README.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
1-
# SLURM Jobs for Dynamo Serve Benchmarking
1+
# Example: Deploy Multi-node SGLang with Dynamo on SLURM
22

3-
This folder contains SLURM job scripts designed to launch Dynamo Serve service on SLURM cluster nodes and monitor GPU activity. The primary purpose is to automate the process of starting prefill and decode nodes to enable running benchmarks.
3+
This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md) on a SLURM cluster.
44

55
## Overview
66

7-
The scripts in this folder orchestrate the deployment of Dynamo Serve across multiple cluster nodes, with separate nodes handling prefill and decode operations. The system uses a Python-based job submission system with Jinja2 templates for flexible configuration.
7+
The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md) example, with separate nodes handling prefill and decode.
8+
The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks.
89

910
## Scripts
1011

1112
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
1213
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
13-
- **`scripts/worker_setup.py`**: Worker script that handles the actual Dynamo Serve setup on each node
14+
- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node
1415
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
1516

1617
## Logs Folder Structure
@@ -41,6 +42,23 @@ logs/
4142
└── ...
4243
```
4344

45+
## Setup
46+
47+
For simplicity of the example, we will make some assumptions about your SLURM cluster:
48+
1. We assume you have access to a SLURM cluster with multiple GPU nodes
49+
available. For functional testing, most setups should be fine. For performance
50+
testing, you should aim to allocate groups of nodes that are performantly
51+
inter-connected, such as those in an NVL72 setup.
52+
2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
53+
SPANK plugin setup. In particular, the `job_script_template.j2` template in this
54+
example will use `srun` arguments like `--container-image`,
55+
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
56+
If your cluster supports similar container based plugins, you may be able to
57+
modify the template to use that instead.
58+
3. We assume you have already built a recent Dynamo+SGLang container image as
59+
described [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md#instructions).
60+
This is the image that can be passed to the `--container-image` argument in later steps.
61+
4462
## Usage
4563

4664
1. **Submit a benchmark job**:
@@ -49,7 +67,8 @@ logs/
4967
--template job_script_template.j2 \
5068
--model-dir /path/to/model \
5169
--config-dir /path/to/configs \
52-
--container-image container-image-uri
70+
--container-image container-image-uri \
71+
--account your-slurm-account
5372
```
5473

5574
**Required arguments**:

examples/sglang/slurm_jobs/scripts/worker_setup.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,8 @@ def setup_prefill_node(
212212
if not wait_for_etcd(f"http://{prefill_host_ip}:{ETCD_CLIENT_PORT}"):
213213
raise RuntimeError("Failed to connect to etcd")
214214

215+
# NOTE: This implements the example in examples/sglang/dsr1-wideep.md
216+
# For other examples, the command might have to be modified.
215217
dynamo_cmd = (
216218
f"python3 components/worker.py "
217219
"--model-path /model/ "

0 commit comments

Comments
 (0)