Slurm for GPU clusters¶
Scope: Slurm as the HPC batch workload manager for GPU clusters, covering partitions, gang scheduling, GRES, topology-aware placement, and multi-node training launch.
Reference templates on real APIs; pin versions and validate before production use.
What it is¶
Slurm (slurm.schedmd.com) is the dominant open-source HPC workload manager: a batch scheduler that allocates partitions (named node pools) to jobs, gang-schedules tightly-coupled multi-node work, tracks GPUs as GRES (Generic RESources, gpu), and places jobs topology-aware against the network fabric. Jobs are submitted as batch scripts (sbatch) or interactive steps (srun), with native MPI/PMIx launch. It is the default for bare-metal pretraining (provisioning and scheduling, distributed training). As of mid-2026 deployments commonly run the 25.x line (six-month release cadence); verify the target version on slurm.schedmd.com.
Why use it¶
- Tightly-coupled training: gang scheduling places all ranks of a multi-node job at once or not at all, which is what synchronous data/model-parallel training needs.
- Bare metal: no container/pod overhead by default; processes run directly on the node (containers via Pyxis/Enroot when wanted).
- Topology-aware:
topology.confmaps switches/blocks so the scheduler packs a job onto the fewest, closest switches (rail-aware), minimising cross-spine NCCL traffic (networking fabric, performance tuning). - MPI-native: first-class PMIx/MPI launch for HPC codes and
torchrun.
When to use it (and when not)¶
Compared with Kubernetes (Kubernetes) in orchestration overview:
- Use Slurm for tightly-coupled, multi-node, topology-sensitive training on bare metal, and for batch/offline GPU jobs (sweeps, batch inference).
- Prefer Kubernetes (Kubernetes) for multi-tenant platforms, long-running online services, declarative GitOps, and container-first operations.
- Many sites run both, partitioning nodes between a Slurm pool and a K8s pool; Ray (Ray) can run on either.
Architecture¶
flowchart TB
USER["User: sbatch / srun"] --> CTLD["slurmctld (controller)"]
CTLD --> DBD["slurmdbd (accounting)"]
CTLD --> D1["slurmd (compute node, GRES gpu)"]
CTLD --> D2["slurmd (compute node, GRES gpu)"]
D1 --- D2
TOPO["topology.conf"] -.->|"switch / block map"| CTLD
How to use it¶
A batch job script declares resources with #SBATCH and launches ranks with srun. A multi-node, 8-GPU-per-node torchrun job:
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=blackwell
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=23456
srun torchrun \
--nnodes="$SLURM_NNODES" \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint="$MASTER_ADDR:$MASTER_PORT" \
train.py --fsdp --bf16
How to develop with it¶
Iterate with srun for interactive steps; structure sweeps as job arrays and stage pipelines with dependencies. Request GPUs explicitly with --gres=gpu:N (or --gpus).
# interactive single-GPU shell on the gpu partition
srun --partition=gpu --gres=gpu:1 --pty bash
# job array: 16 independent runs, indices 0-15, each on 1 GPU
sbatch --array=0-15 --gres=gpu:1 sweep.sh # use $SLURM_ARRAY_TASK_ID inside
# dependency: run eval only after the training job succeeds
jid=$(sbatch --parsable train.sh)
sbatch --dependency=afterok:"$jid" eval.sh
How to scale it¶
Multi-node scaling is via MPI/PMIx launch and topology.conf rail-aware placement.
- MPI/PMIx:
srun --mpi=pmixlaunches an MPI/torchrun program across all allocated nodes; the scheduler co-schedules the gang. - Topology: choose the plugin in
slurm.conf(TopologyPlugin=topology/treefor switch trees, ortopology/blockfor block topology). Thentopology.confdeclares the fabric so jobs are packed onto the fewest switches:
# slurm.conf: TopologyPlugin=topology/tree
# topology.conf — leaf switches feed a spine (rail-aware packing)
SwitchName=leaf0 Nodes=gpu[001-016]
SwitchName=leaf1 Nodes=gpu[017-032]
SwitchName=spine Switches=leaf[0-1]
Inference¶
Slurm fits batch / offline inference well: a sweep of prompts or a dataset scored across GPUs as an array or a single multi-GPU job. Online, low-latency serving is less common under Slurm: it is a batch scheduler, not a service mesh with autoscaling/ingress. For online serving prefer Ray Serve / KServe / Dynamo on Kubernetes (inference serving, disaggregated inference); use Slurm for the offline scoring and eval jobs that feed them.
Fine-tuning¶
Run distributed SFT/LoRA (SFT and LoRA) and RL post-training (fine-tuning and post-training, GRPO) as srun torchrun multi-node jobs, the same launch pattern as pretraining (distributed training, distributed-training recipes). RL libraries that use Ray as a controller (RL libraries) run on Slurm via the Ray-on-Slurm pattern (cookbook recipe 3): a job brings up a Ray head + workers, then runs the RL program over the allocation.
Optimised hardware¶
- GRES gpu: declare GPUs in
gres.conf(AutoDetect=nvmlfor NVML detection); request with--gres=gpu:Nor--gpus-per-node=N. - GPU binding:
--gpu-bind=closestbinds each task to its NUMA/PCIe-closest GPU;--gpu-bind=map_gpu:0,1,...for explicit maps.--gpus-per-taskimplicitly sets--gpu-bind=per_task(performance tuning). - NCCL over IB: set
NCCL_IB_HCAto the IB HCAs,NCCL_NET_GDR_LEVEL=SYSfor GPUDirect RDMA; for multi-node IB ensureNCCL_IB_DISABLE=0. Confirm[GDRDMA]inNCCL_DEBUG=INFO. ACS off for P2P/GDR (networking fabric). - topology.conf: rail-aware placement keeps a job on the fewest leaf switches, cutting cross-spine hops for the all-reduce (provisioning and scheduling).
Cookbook (common use cases)¶
1. Multi-node torchrun sbatch (the canonical FSDP launch; see the job above for the full header):
srun torchrun --nnodes="$SLURM_NNODES" --nproc_per_node=8 \
--rdzv_backend=c10d --rdzv_endpoint="$MASTER_ADDR:23456" train.py --fsdp
2. GRES GPU request (interactive 2-GPU step, closest binding):
srun --partition=gpu --gres=gpu:2 --gpu-bind=closest --pty bash
nvidia-smi -L # confirm two visible devices
3. Ray-on-Slurm launch (bring up Ray over the allocation, then run the program; see Ray):
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
head=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
ip_head="$head:6379"
# Ray 2.49+: starts Ray on all nodes, runs the script on the head only
srun --nodes="$SLURM_NNODES" --ntasks="$SLURM_NNODES" \
ray symmetric-run --address "$ip_head" --min-nodes "$SLURM_NNODES" \
--num-gpus="$SLURM_GPUS_PER_NODE" -- python train_rl.py
Gotchas & failure modes¶
- Missing
--exclusive: sharing a node with another job competes for NVLink/PCIe and skews timings; pretraining jobs should own the node. - No
topology.conf: the scheduler scatters a job across distant switches, adding cross-spine hops to every all-reduce (performance tuning). - GRES not auto-detected: without
AutoDetect=nvmlingres.conf, GPU counts can mismatch the hardware; jobs see the wrong device set. - Wrong
MASTER_ADDR: deriving it from$SLURM_JOB_NODELISTincorrectly hangs the c10d rendezvous; always resolve viascontrol show hostnames. - NCCL on TCP: IB env unset or ACS on → fabric falls back to TCP; check
NCCL_DEBUG=INFOfor[GDRDMA]and IB transport (networking fabric). - Online serving forced onto Slurm: no autoscaling/ingress; push latency-SLO serving to K8s (inference serving).
References¶
- Slurm documentation: https://slurm.schedmd.com/documentation.html
- GRES scheduling: https://slurm.schedmd.com/gres.html ·
srun: https://slurm.schedmd.com/srun.html - Topology guide: https://slurm.schedmd.com/topology.html ·
topology.conf: https://slurm.schedmd.com/topology.conf.html - Pyxis (containers): https://github.com/NVIDIA/pyxis · Enroot: https://github.com/NVIDIA/enroot
- Ray on Slurm: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
- PyTorch FSDP: https://docs.pytorch.org/docs/stable/fsdp.html
Related: Provisioning · Distributed Training · Orchestration · Kubernetes · Ray · Training Recipes · Glossary