|
1 | | -# SLURM Jobs for Dynamo Serve Benchmarking |
| 1 | +# Example: Deploy Multi-node SGLang with Dynamo on SLURM |
2 | 2 |
|
3 | | -This folder contains SLURM job scripts designed to launch Dynamo Serve service on SLURM cluster nodes and monitor GPU activity. The primary purpose is to automate the process of starting prefill and decode nodes to enable running benchmarks. |
| 3 | +This folder implements the example of [SGLang DeepSeek-R1 Disaggregated with WideEP](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md) on a SLURM cluster. |
4 | 4 |
|
5 | 5 | ## Overview |
6 | 6 |
|
7 | | -The scripts in this folder orchestrate the deployment of Dynamo Serve across multiple cluster nodes, with separate nodes handling prefill and decode operations. The system uses a Python-based job submission system with Jinja2 templates for flexible configuration. |
| 7 | +The scripts in this folder set up multiple cluster nodes to run the [SGLang DeepSeek-R1 Disaggregated with WideEP](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md) example, with separate nodes handling prefill and decode. |
| 8 | +The node setup is done using Python job submission scripts with Jinja2 templates for flexible configuration. The setup also includes GPU utilization monitoring capabilities to track performance during benchmarks. |
8 | 9 |
|
9 | 10 | ## Scripts |
10 | 11 |
|
11 | 12 | - **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates |
12 | 13 | - **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts |
13 | | -- **`scripts/worker_setup.py`**: Worker script that handles the actual Dynamo Serve setup on each node |
| 14 | +- **`scripts/worker_setup.py`**: Worker script that handles the setup on each node |
14 | 15 | - **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks |
15 | 16 |
|
16 | 17 | ## Logs Folder Structure |
|
41 | 42 | └── ... |
42 | 43 | ``` |
43 | 44 |
|
| 45 | +## Setup |
| 46 | + |
| 47 | +For simplicity of the example, we will make some assumptions about your SLURM cluster: |
| 48 | +1. We assume you have access to a SLURM cluster with multiple GPU nodes |
| 49 | + available. For functional testing, most setups should be fine. For performance |
| 50 | + testing, you should aim to allocate groups of nodes that are performantly |
| 51 | + inter-connected, such as those in an NVL72 setup. |
| 52 | +2. We assume this SLURM cluster has the [Pyxis](https://github.com/NVIDIA/pyxis) |
| 53 | + SPANK plugin setup. In particular, the `job_script_template.j2` template in this |
| 54 | + example will use `srun` arguments like `--container-image`, |
| 55 | + `--container-mounts`, and `--container-env` that are added to `srun` by Pyxis. |
| 56 | + If your cluster supports similar container based plugins, you may be able to |
| 57 | + modify the template to use that instead. |
| 58 | +3. We assume you have already built a recent Dynamo+SGLang container image as |
| 59 | + described [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/sglang/dsr1-wideep.md#instructions). |
| 60 | + This is the image that can be passed to the `--container-image` argument in later steps. |
| 61 | + |
44 | 62 | ## Usage |
45 | 63 |
|
46 | 64 | 1. **Submit a benchmark job**: |
|
49 | 67 | --template job_script_template.j2 \ |
50 | 68 | --model-dir /path/to/model \ |
51 | 69 | --config-dir /path/to/configs \ |
52 | | - --container-image container-image-uri |
| 70 | + --container-image container-image-uri \ |
| 71 | + --account your-slurm-account |
53 | 72 | ``` |
54 | 73 |
|
55 | 74 | **Required arguments**: |
|
0 commit comments