Markdown

Slurm for GPU clusters¶

Scope: Slurm as the HPC batch workload manager for GPU clusters, covering partitions, gang scheduling, GRES, topology-aware placement, and multi-node training launch.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

Slurm (slurm.schedmd.com) is the dominant open-source HPC workload manager: a batch scheduler that allocates partitions (named node pools) to jobs, gang-schedules tightly-coupled multi-node work, tracks GPUs as GRES (Generic RESources, gpu), and places jobs topology-aware against the network fabric. Jobs are submitted as batch scripts (sbatch) or interactive steps (srun), with native MPI/PMIx launch. It is the default for bare-metal pretraining (provisioning and scheduling, distributed training). As of mid-2026 deployments commonly run the 25.x line (six-month release cadence); verify the target version on slurm.schedmd.com.

Why use it¶

Tightly-coupled training: gang scheduling places all ranks of a multi-node job at once or not at all, which is what synchronous data/model-parallel training needs.
Bare metal: no container/pod overhead by default; processes run directly on the node (containers via Pyxis/Enroot when wanted).
Topology-aware: topology.conf maps switches/blocks so the scheduler packs a job onto the fewest, closest switches (rail-aware), minimising cross-spine NCCL traffic (networking fabric, performance tuning).
MPI-native: first-class PMIx/MPI launch for HPC codes and torchrun.

When to use it (and when not)¶

Compared with Kubernetes (Kubernetes) in orchestration overview:

Use Slurm for tightly-coupled, multi-node, topology-sensitive training on bare metal, and for batch/offline GPU jobs (sweeps, batch inference).
Prefer Kubernetes (Kubernetes) for multi-tenant platforms, long-running online services, declarative GitOps, and container-first operations.
Many sites run both, partitioning nodes between a Slurm pool and a K8s pool; Ray (Ray) can run on either.

Architecture¶

flowchart TB
  USER["User: sbatch / srun"] --> CTLD["slurmctld (controller)"]
  CTLD --> DBD["slurmdbd (accounting)"]
  CTLD --> D1["slurmd (compute node, GRES gpu)"]
  CTLD --> D2["slurmd (compute node, GRES gpu)"]
  D1 --- D2
  TOPO["topology.conf"] -.->|"switch / block map"| CTLD

How to use it¶

A batch job script declares resources with #SBATCH and launches ranks with srun. A multi-node, 8-GPU-per-node torchrun job:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=blackwell
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=23456

srun torchrun \
  --nnodes="$SLURM_NNODES" \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint="$MASTER_ADDR:$MASTER_PORT" \
  train.py --fsdp --bf16

How to develop with it¶

Iterate with srun for interactive steps; structure sweeps as job arrays and stage pipelines with dependencies. Request GPUs explicitly with --gres=gpu:N (or --gpus).

# interactive single-GPU shell on the gpu partition
srun --partition=gpu --gres=gpu:1 --pty bash

# job array: 16 independent runs, indices 0-15, each on 1 GPU
sbatch --array=0-15 --gres=gpu:1 sweep.sh   # use $SLURM_ARRAY_TASK_ID inside

# dependency: run eval only after the training job succeeds
jid=$(sbatch --parsable train.sh)
sbatch --dependency=afterok:"$jid" eval.sh

How to scale it¶

Multi-node scaling is via MPI/PMIx launch and topology.conf rail-aware placement.

MPI/PMIx: srun --mpi=pmix launches an MPI/torchrun program across all allocated nodes; the scheduler co-schedules the gang.
Topology: choose the plugin in slurm.conf (TopologyPlugin=topology/tree for switch trees, or topology/block for block topology). Then topology.conf declares the fabric so jobs are packed onto the fewest switches:

# slurm.conf:  TopologyPlugin=topology/tree
# topology.conf — leaf switches feed a spine (rail-aware packing)
SwitchName=leaf0 Nodes=gpu[001-016]
SwitchName=leaf1 Nodes=gpu[017-032]
SwitchName=spine Switches=leaf[0-1]

srun --mpi=pmix --nodes=32 --ntasks-per-node=8 --gres=gpu:8 ./train_mpi

Inference¶

Slurm fits batch / offline inference well: a sweep of prompts or a dataset scored across GPUs as an array or a single multi-GPU job. Online, low-latency serving is less common under Slurm: it is a batch scheduler, not a service mesh with autoscaling/ingress. For online serving prefer Ray Serve / KServe / Dynamo on Kubernetes (inference serving, disaggregated inference); use Slurm for the offline scoring and eval jobs that feed them.

Fine-tuning¶

Run distributed SFT/LoRA (SFT and LoRA) and RL post-training (fine-tuning and post-training, GRPO) as srun torchrun multi-node jobs, the same launch pattern as pretraining (distributed training, distributed-training recipes). RL libraries that use Ray as a controller (RL libraries) run on Slurm via the Ray-on-Slurm pattern (cookbook recipe 3): a job brings up a Ray head + workers, then runs the RL program over the allocation.

Optimised hardware¶

GRES gpu: declare GPUs in gres.conf (AutoDetect=nvml for NVML detection); request with --gres=gpu:N or --gpus-per-node=N.
GPU binding: --gpu-bind=closest binds each task to its NUMA/PCIe-closest GPU; --gpu-bind=map_gpu:0,1,... for explicit maps. --gpus-per-task implicitly sets --gpu-bind=per_task (performance tuning).
NCCL over IB: set NCCL_IB_HCA to the IB HCAs, NCCL_NET_GDR_LEVEL=SYS for GPUDirect RDMA; for multi-node IB ensure NCCL_IB_DISABLE=0. Confirm [GDRDMA] in NCCL_DEBUG=INFO. ACS off for P2P/GDR (networking fabric).
topology.conf: rail-aware placement keeps a job on the fewest leaf switches, cutting cross-spine hops for the all-reduce (provisioning and scheduling).

Cookbook (common use cases)¶

1. Multi-node torchrun sbatch (the canonical FSDP launch; see the job above for the full header):

srun torchrun --nnodes="$SLURM_NNODES" --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint="$MASTER_ADDR:23456" train.py --fsdp

2. GRES GPU request (interactive 2-GPU step, closest binding):

srun --partition=gpu --gres=gpu:2 --gpu-bind=closest --pty bash
nvidia-smi -L   # confirm two visible devices

3. Ray-on-Slurm launch (bring up Ray over the allocation, then run the program; see Ray):

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive

head=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
ip_head="$head:6379"
# Ray 2.49+: starts Ray on all nodes, runs the script on the head only
srun --nodes="$SLURM_NNODES" --ntasks="$SLURM_NNODES" \
  ray symmetric-run --address "$ip_head" --min-nodes "$SLURM_NNODES" \
  --num-gpus="$SLURM_GPUS_PER_NODE" -- python train_rl.py

Gotchas & failure modes¶

Missing --exclusive: sharing a node with another job competes for NVLink/PCIe and skews timings; pretraining jobs should own the node.
No topology.conf: the scheduler scatters a job across distant switches, adding cross-spine hops to every all-reduce (performance tuning).
GRES not auto-detected: without AutoDetect=nvml in gres.conf, GPU counts can mismatch the hardware; jobs see the wrong device set.
Wrong MASTER_ADDR: deriving it from $SLURM_JOB_NODELIST incorrectly hangs the c10d rendezvous; always resolve via scontrol show hostnames.
NCCL on TCP: IB env unset or ACS on → fabric falls back to TCP; check NCCL_DEBUG=INFO for [GDRDMA] and IB transport (networking fabric).
Online serving forced onto Slurm: no autoscaling/ingress; push latency-SLO serving to K8s (inference serving).

References¶

Slurm documentation: https://slurm.schedmd.com/documentation.html
GRES scheduling: https://slurm.schedmd.com/gres.html · srun: https://slurm.schedmd.com/srun.html
Topology guide: https://slurm.schedmd.com/topology.html · topology.conf: https://slurm.schedmd.com/topology.conf.html
Pyxis (containers): https://github.com/NVIDIA/pyxis · Enroot: https://github.com/NVIDIA/enroot
Ray on Slurm: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
PyTorch FSDP: https://docs.pytorch.org/docs/stable/fsdp.html