Skip to content

Commit 51baf61

Browse files
committed
feat: automate slurm handling in sglang example.
Signed-off-by: Fadi Saady <[email protected]>
1 parent ee86bad commit 51baf61

File tree

6 files changed

+691
-0
lines changed

6 files changed

+691
-0
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
logs/*
2+
outputs/*
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# SLURM Jobs for Dynamo Serve Benchmarking
2+
3+
This folder contains SLURM job scripts designed to launch Dynamo Serve service on SLURM cluster nodes and monitor GPU activity. The primary purpose is to automate the process of starting prefill and decode nodes to enable running benchmarks.
4+
5+
## Overview
6+
7+
The scripts in this folder orchestrate the deployment of Dynamo Serve across multiple cluster nodes, with separate nodes handling prefill and decode operations. The system uses a Python-based job submission system with Jinja2 templates for flexible configuration.
8+
9+
## Scripts
10+
11+
- **`submit_job_script.py`**: Main script for generating and submitting SLURM job scripts from templates
12+
- **`job_script_template.j2`**: Jinja2 template for generating SLURM job scripts
13+
- **`scripts/worker_setup.py`**: Worker script that handles the actual Dynamo Serve setup on each node
14+
- **`scripts/monitor_gpu_utilization.sh`**: Script for monitoring GPU utilization during benchmarks
15+
16+
## Logs Folder Structure
17+
18+
Each SLURM job creates a unique log directory under `logs/` using the job ID. For example, job ID `3062824` creates the directory `logs/3062824/`.
19+
20+
### Log File Structure
21+
22+
```
23+
logs/
24+
├── 3062824/ # Job ID directory
25+
│ ├── log.out # Main job output (node allocation, IP addresses, launch commands)
26+
│ ├── log.err # Main job errors
27+
│ ├── eos0197_prefill.out # Prefill node stdout (eos0197)
28+
│ ├── eos0197_prefill.err # Prefill node stderr (eos0197)
29+
│ ├── eos0200_prefill.out # Prefill node stdout (eos0200)
30+
│ ├── eos0200_prefill.err # Prefill node stderr (eos0200)
31+
│ ├── eos0201_decode.out # Decode node stdout (eos0201)
32+
│ ├── eos0201_decode.err # Decode node stderr (eos0201)
33+
│ ├── eos0204_decode.out # Decode node stdout (eos0204)
34+
│ ├── eos0204_decode.err # Decode node stderr (eos0204)
35+
│ ├── eos0197_prefill_gpu_utilization.log # GPU utilization monitoring (eos0197)
36+
│ ├── eos0200_prefill_gpu_utilization.log # GPU utilization monitoring (eos0200)
37+
│ ├── eos0201_decode_gpu_utilization.log # GPU utilization monitoring (eos0201)
38+
│ └── eos0204_decode_gpu_utilization.log # GPU utilization monitoring (eos0204)
39+
├── 3063137/ # Another job ID directory
40+
├── 3062689/ # Another job ID directory
41+
└── ...
42+
```
43+
44+
## Usage
45+
46+
1. **Submit a benchmark job**:
47+
```bash
48+
python submit_job_script.py \
49+
--template job_script_template.j2 \
50+
--model-dir /path/to/model \
51+
--config-dir /path/to/configs \
52+
--container-image container-image-uri
53+
```
54+
55+
**Required arguments**:
56+
- `--template`: Path to Jinja2 template file
57+
- `--model-dir`: Model directory path
58+
- `--config-dir`: Config directory path
59+
- `--container-image`: Container image URI (e.g., `registry/repository:tag`)
60+
- `--account`: SLURM account
61+
62+
**Optional arguments**:
63+
- `--prefill-nodes`: Number of prefill nodes (default: `2`)
64+
- `--decode-nodes`: Number of decode nodes (default: `2`)
65+
- `--gpus-per-node`: Number of GPUs per node (default: `8`)
66+
- `--network-interface`: Network interface to use (default: `eth3`)
67+
- `--job-name`: SLURM job name (default: `dynamo_setup`)
68+
- `--time-limit`: Time limit in HH:MM:SS format (default: `01:00:00`)
69+
70+
**Note**: The script automatically calculates the total number of nodes needed based on `--prefill-nodes` and `--decode-nodes` parameters.
71+
72+
2. **Monitor job progress**:
73+
```bash
74+
squeue -u $USER
75+
```
76+
77+
3. **Check logs in real-time**:
78+
```bash
79+
tail -f logs/{JOB_ID}/log.out
80+
```
81+
82+
4. **Monitor GPU utilization**:
83+
```bash
84+
tail -f logs/{JOB_ID}/{node}_prefill_gpu_utilization.log
85+
```
86+
87+
## Outputs
88+
89+
Benchmark results and outputs are stored in the `outputs/` directory, which is mounted into the container.
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
#!/bin/bash
2+
#SBATCH --job-name={{ job_name }}
3+
#SBATCH --nodes={{ total_nodes }}
4+
#SBATCH --ntasks={{ total_nodes }}
5+
#SBATCH --ntasks-per-node=1
6+
#SBATCH --account={{ account }}
7+
#SBATCH --time={{ time_limit }}
8+
#SBATCH --output=logs/%j/log.out
9+
#SBATCH --error=logs/%j/log.err
10+
11+
# Constants
12+
PREFILL_NODES={{ prefill_nodes }}
13+
DECODE_NODES={{ decode_nodes }}
14+
TOTAL_NODES=$((PREFILL_NODES + DECODE_NODES))
15+
GPUS_PER_NODE={{ gpus_per_node }}
16+
LOG_DIR="${SLURM_SUBMIT_DIR}/logs/${SLURM_JOB_ID}/"
17+
SCRIPT_DIR="${SLURM_SUBMIT_DIR}/scripts"
18+
OUTPUT_DIR="${SLURM_SUBMIT_DIR}/outputs"
19+
MODEL_DIR="{{ model_dir }}"
20+
CONFIG_DIR="{{ config_dir }}"
21+
CONTAINER_IMAGE="{{ container_image }}"
22+
NETWORK_INTERFACE="{{ network_interface }}"
23+
24+
{% raw %}
25+
26+
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}"
27+
28+
nodes=($(scontrol show hostnames $SLURM_NODELIST))
29+
if [ ${#nodes[@]} -ne $TOTAL_NODES ]; then
30+
echo "Error: Expected $TOTAL_NODES nodes but got ${#nodes[@]} nodes"
31+
exit 1
32+
fi
33+
34+
# Print node information
35+
for i in "${!nodes[@]}"; do
36+
echo "Node $i: ${nodes[$i]}"
37+
done
38+
39+
PREFILL_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[0]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
40+
if [ -z "$PREFILL_HOST_IP" ]; then
41+
echo "Error: Could not retrieve IP address for prefill host ${nodes[0]} on interface $NETWORK_INTERFACE"
42+
exit 1
43+
fi
44+
echo "Prefill host IP address: $PREFILL_HOST_IP"
45+
46+
DECODE_HOST_IP=$(srun --nodes=1 --ntasks=1 --nodelist=${nodes[$PREFILL_NODES]} ifconfig $NETWORK_INTERFACE | grep -oP 'inet \K[0-9.]+')
47+
if [ -z "$DECODE_HOST_IP" ]; then
48+
echo "Error: Could not retrieve IP address for decode host ${nodes[$PREFILL_NODES]} on interface $NETWORK_INTERFACE"
49+
exit 1
50+
fi
51+
echo "Decode host IP address: $DECODE_HOST_IP"
52+
53+
# Prepare enroot arguments to pass to srun commands
54+
ENROOT_ARGS="\
55+
--container-image=${CONTAINER_IMAGE} \
56+
--no-container-entrypoint \
57+
--container-mount-home \
58+
--no-container-remap-root \
59+
--container-mounts=${MODEL_DIR}:/model/,${CONFIG_DIR}:/configs/,${SCRIPT_DIR}:/scripts/,${OUTPUT_DIR}:/outputs/,${LOG_DIR}:/logs/ \
60+
"
61+
62+
# Launch prefill tasks on the first PREFILL_NODES nodes
63+
for i in $(seq 0 $((PREFILL_NODES - 1))); do
64+
node=${nodes[$i]}
65+
rank=$i
66+
echo "Launching prefill task on node ${i} (rank ${rank}): $node"
67+
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err"
68+
echo "Command: python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &"
69+
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
70+
--output=${LOG_DIR}/${node}_prefill.out --error=${LOG_DIR}/${node}_prefill.err \
71+
python /scripts/worker_setup.py --prefill_host_ip ${PREFILL_HOST_IP} --decode_host_ip ${DECODE_HOST_IP} --rank ${rank} --total_nodes ${PREFILL_NODES} --worker_type prefill --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_prefill_gpu_utilization.log &
72+
done
73+
74+
# Launch decode tasks on the next DECODE_NODES nodes
75+
for i in $(seq $PREFILL_NODES $((PREFILL_NODES + DECODE_NODES - 1))); do
76+
node=${nodes[$i]}
77+
rank=$((i - PREFILL_NODES))
78+
echo "Launching decode task on node ${i} (rank ${rank}): $node"
79+
echo "Srun args: $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node --output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err"
80+
echo "Command: python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &"
81+
srun $ENROOT_ARGS --nodes=1 --ntasks=1 --nodelist=$node \
82+
--output=${LOG_DIR}/${node}_decode.out --error=${LOG_DIR}/${node}_decode.err \
83+
python /scripts/worker_setup.py --decode_host_ip ${DECODE_HOST_IP} --prefill_host_ip ${PREFILL_HOST_IP} --rank ${rank} --total_nodes ${DECODE_NODES} --worker_type decode --gpus_per_node ${GPUS_PER_NODE} --gpu_utilization_log /logs/${node}_decode_gpu_utilization.log &
84+
done
85+
86+
echo ""
87+
echo "To connect to the host prefill node:"
88+
echo "srun $ENROOT_ARGS --jobid $SLURM_JOB_ID -w ${nodes[0]} --overlap --pty bash"
89+
90+
echo ""
91+
echo "Make sure to cancel the job at the end:"
92+
echo "scancel $SLURM_JOB_ID"
93+
94+
# Wait for all tasks to complete
95+
wait
96+
echo "Script finished at $(date)"
97+
98+
{% endraw %}
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
3+
# Usage: ./monitor_gpu_utilization.sh [interval_seconds]
4+
5+
# Default interval is 2 seconds
6+
INTERVAL=${1:-2}
7+
8+
# Check if nvidia-smi is available
9+
if ! command -v nvidia-smi &> /dev/null; then
10+
echo "$(date '+%Y-%m-%d %H:%M:%S') Error: nvidia-smi not found"
11+
exit 1
12+
fi
13+
14+
echo "Starting GPU utilization monitoring (checking every ${INTERVAL}s, printing only on changes)..."
15+
16+
PREV_UTILIZATION=""
17+
while true; do
18+
CURRENT_UTILIZATION=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits | paste -sd ' ' -)
19+
if [ $? -ne 0 ]; then
20+
echo "$(date '+%Y-%m-%d %H:%M:%S') Error: nvidia-smi command failed"
21+
else
22+
if [ "$CURRENT_UTILIZATION" != "$PREV_UTILIZATION" ]; then
23+
echo "$(date '+%Y-%m-%d %H:%M:%S') GPU Utilization: $CURRENT_UTILIZATION"
24+
PREV_UTILIZATION="$CURRENT_UTILIZATION"
25+
fi
26+
fi
27+
28+
sleep $INTERVAL
29+
done

0 commit comments

Comments
 (0)