Ray on Slurm¶
Scope: Composing Ray with a Slurm allocation: starting a Ray head and workers inside one sbatch job, address/port wiring, GPU binding, clean teardown, and when this beats KubeRay or bare Slurm.
Reference templates on real APIs; pin versions and validate before production use. Not hardware-tested here.
What it is¶
A pattern that runs the Ray runtime (Ray) inside a single Slurm allocation (Slurm). One sbatch job reserves N nodes; the job script starts a Ray head on the first node and a Ray worker on each remaining node, all pointed at the head's address. Your Python entrypoint then runs against that ephemeral Ray cluster and Ray tears down when the job ends. Slurm owns gang scheduling, GRES (gpu) binding and topology placement (topology-aware placement); Ray owns task/actor scheduling, placement groups, and the Train/Serve/Data/RLlib libraries on top.
Two supported launch styles:
ray symmetric-run(Ray 2.49+) runs onesrunacross all tasks; Ray starts on every node and runs your entrypoint only on the head. Less boilerplate.- Manual
ray startuses explicit head/workersruninvocations with pinned ports. More verbose but works on older Ray and shared/multi-tenant partitions where you must avoid port collisions.
Why use it¶
- No second control plane. On an HPC cluster that already runs Slurm, you get a Ray cluster without standing up Kubernetes + KubeRay. Slurm remains the single point of quota, accounting and topology.
- Tightly-coupled allocation.
--exclusiveplus gang scheduling gives Ray a contiguous, topology-packed node set, so NCCL collectives in Ray Train ride the fabric instead of fighting for it (distributed training recipes). - Python-native on bare metal. RL post-training libraries (verl, slime, SkyRL, NeMo-RL) drive long-lived rollout and trainer actor groups via Ray; this pattern hosts that controller on an HPC partition without containers.
- Ephemeral and cheap to reason about. The cluster lives and dies with the job, with no leaked head node and no GCS to babysit between runs.
When to use it (and when not)¶
flowchart LR
Q["Need a Ray cluster?"] --> A{"Cluster manager already present?"}
A -->|"Kubernetes"| K["Use KubeRay (RayCluster / RayJob)"]
A -->|"Slurm only"| S{"Workload shape?"}
A -->|"None / raw VMs"| V["Ray cluster launcher on VMs"]
S -->|"Python actors, RL, Ray Train/Serve/Data"| R["Ray on Slurm (this page)"]
S -->|"Pure torchrun / MPI pretraining"| B["Bare Slurm + srun torchrun"]
- Use Ray on Slurm when the workload needs Ray's actor model, placement groups, or RL controller pattern, and the only scheduler on the cluster is Slurm.
- Prefer KubeRay when the cluster is Kubernetes-first (Kubernetes): you get declarative
RayCluster/RayJob, the in-tree autoscaler, andRayServiceHA, none of which the Slurm pattern provides. See orchestration overview and Slurm vs Kubernetes. - Prefer bare Slurm for tightly-coupled
torchrun/MPI pretraining with no distributed-Python coordination; Ray adds a runtime you do not need (Slurm for GPU Clusters). - Not for long-running services. A Slurm job is wall-clock bounded; an online inference SLO (SLO/SLI catalog) wants
RayServiceon Kubernetes, not ansbatchjob that the scheduler can preempt.
Architecture¶
One sbatch allocation holds the whole cluster. Node 0 runs ray start --head, which brings up the GCS (the cluster's metadata store) plus a raylet, object store, client server and dashboard on pinned ports. Every other node runs ray start --address and registers its raylet with the head's GCS. The entrypoint runs on the head, calls ray.init(address="auto"), and submits tasks, actors and placement groups that Ray schedules across the raylets. Slurm has already bound each task's GPUs via GRES and set CUDA_VISIBLE_DEVICES, so Ray's logical --num-gpus is just a bookkeeping mirror of what Slurm physically granted.
flowchart TB
SB["sbatch (one --exclusive allocation)"] --> ALLOC["Slurm allocation: N nodes, GRES gpu bound"]
ALLOC --> HEAD["node 0: ray start --head :6379 (GCS + raylet + dashboard)"]
ALLOC --> W1["node 1: ray start --address ip_head (raylet)"]
ALLOC --> WN["node N-1: ray start --address ip_head (raylet)"]
W1 -->|"register with GCS"| HEAD
WN -->|"register with GCS"| HEAD
ENTRY["train.py on head: ray.init(address='auto')"] --> HEAD
ENTRY --> PG["tasks / actors / placement groups"]
PG -->|"num_gpus=1 per task"| GPUS["GPUs (CUDA_VISIBLE_DEVICES set by Slurm)"]
SLURM["Slurm owns: gang scheduling, GRES binding, topology"] -.-> ALLOC
RAY["Ray owns: task/actor scheduling, placement groups, Train/Serve/Data/RLlib"] -.-> PG
Ports the head opens (pinned so concurrent jobs on a shared partition do not collide): GCS 6379, node manager 6700, object manager 6701, Ray client server 10001, and the worker-port range 10002-19999.
How to use it¶
Manual head and worker (portable)¶
The robust pattern: derive the head IP from the allocation, start the head on node 0, sleep, fan workers out one-per-node with srun -w, then run the entrypoint on the head only. Ports are pinned so concurrent jobs on a shared partition do not collide.
#!/bin/bash
#SBATCH --job-name=ray-job
#SBATCH --partition=blackwell
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
#SBATCH --time=02:00:00
set -euo pipefail
# --- discover nodes; node 0 is the head ---
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
# resolve the head's routable IP (use the fabric interface if multi-homed)
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
port=6379
ip_head="$head_node_ip:$port"
export ip_head
echo "Head at $ip_head"
# --- start the head on node 0 (pinned ports for multi-tenant safety) ---
srun --nodes=1 --ntasks=1 -w "$head_node" \
ray start --head --node-ip-address="$head_node_ip" \
--port=6379 \
--node-manager-port=6700 --object-manager-port=6701 \
--ray-client-server-port=10001 \
--min-worker-port=10002 --max-worker-port=19999 \
--num-cpus="${SLURM_CPUS_PER_TASK}" \
--num-gpus="${SLURM_GPUS_PER_NODE}" --block &
sleep 15 # let GCS come up before workers attach
# --- start one worker per remaining node ---
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
node_i=${nodes_array[$i]}
srun --nodes=1 --ntasks=1 -w "$node_i" \
ray start --address="$ip_head" \
--num-cpus="${SLURM_CPUS_PER_TASK}" \
--num-gpus="${SLURM_GPUS_PER_NODE}" --block &
sleep 5
done
# --- run the entrypoint on the head only; it connects to the running cluster ---
python -u train.py
# --- teardown: stop Ray on every node, then the job exits ---
srun --nodes="$SLURM_JOB_NUM_NODES" --ntasks="$SLURM_JOB_NUM_NODES" ray stop
The entrypoint attaches to the cluster Slurm just built and sees every GPU. This block is a reference template (needs ray + torch, not installed here, so not executed); the numpy checks below validate the core resource math it relies on.
# train.py
import ray
ray.init(address="auto") # connect to the head started by the sbatch script
print(ray.cluster_resources()) # {'GPU': 32.0, 'CPU': 384.0, ...} for 4x8
@ray.remote(num_gpus=1)
def probe() -> str:
import torch
return torch.cuda.get_device_name(0)
print(ray.get([probe.remote() for _ in range(int(ray.cluster_resources()["GPU"]))]))
The aggregate ray.cluster_resources() the head reports is exactly the sum over the allocation, and the head/worker split is deterministic from the nodelist. This numpy-only check reproduces that wiring and the {'GPU': 32.0, 'CPU': 384.0} figure, with edge and adversarial cases:
# Validate the head/worker wiring the sbatch script builds and the aggregate
# cluster resources train.py reads back via ray.cluster_resources().
from __future__ import annotations
import numpy as np
def wire_cluster(nodelist: list[str], gpus_per_node: int, cpus_per_task: int):
if not nodelist:
raise ValueError("empty allocation: Slurm granted no nodes")
head, workers = nodelist[0], nodelist[1:]
n = len(nodelist)
resources = {
"GPU": float(np.sum(np.full(n, gpus_per_node))),
"CPU": float(np.sum(np.full(n, cpus_per_task))),
}
return head, workers, f"{head}:6379", resources
# happy path: 4 nodes x 8 GPUs x 96 CPUs -> the {'GPU':32,'CPU':384} the page prints
nodes = [f"gpu-node-{i}" for i in range(4)]
head, workers, ip_head, res = wire_cluster(nodes, gpus_per_node=8, cpus_per_task=96)
assert head == "gpu-node-0"
assert workers == ["gpu-node-1", "gpu-node-2", "gpu-node-3"]
assert len(workers) == len(nodes) - 1 == 3 # worker_num = N - 1
assert res == {"GPU": 32.0, "CPU": 384.0} # matches the train.py comment
assert ip_head == "gpu-node-0:6379"
# edge: single-node allocation -> head only, zero workers, one node's GPUs
h1, w1, _, r1 = wire_cluster(["solo"], gpus_per_node=8, cpus_per_task=96)
assert w1 == [] and r1["GPU"] == 8.0
# adversarial: an empty nodelist must fail fast, not build a headless cluster
try:
wire_cluster([], 8, 96)
raise AssertionError("empty allocation should have raised")
except ValueError:
pass
print("A ok:", head, ip_head, res, "| workers:", len(workers))
The probe.remote() fan-out asks for one task per GPU (num_gpus=1), which fills the cluster in a single scheduling wave only when Ray's logical GPU count equals what Slurm granted. This check validates that invariant, including the --num-gpus=8 on a 4-GPU task failure, cross-checked against a slow device-id reference:
# Validate Ray's logical --num-gpus against Slurm's physically-granted GPUs, plus
# the one-task-per-GPU fan-out in train.py. Oversubscription = Ray believing in
# GPUs Slurm never granted (the --num-gpus=8 on a 4-GPU task failure).
from __future__ import annotations
import numpy as np
def concurrent_tasks(logical_gpus: int, gpus_per_task: int) -> int:
return logical_gpus // gpus_per_task # tasks Ray runs at once
def oversubscribed_devices(logical_gpus: int, physical_gpus: int, gpus_per_task: int = 1) -> int:
used = concurrent_tasks(logical_gpus, gpus_per_task) * gpus_per_task
return max(0, used - physical_gpus) # tasks landing on absent devices
# happy path: logical == physical, 1 GPU/task -> 32 probe.remote() fill 4x8 exactly
assert concurrent_tasks(32, 1) == 32
assert oversubscribed_devices(32, 32, 1) == 0
assert oversubscribed_devices(8, 8, 1) == 0 # per-node view, 8 == 8
# adversarial: --num-gpus=8 declared but Slurm granted only 4 -> 4 tasks hit
# devices that do not exist (CUDA invalid device / OOM)
assert oversubscribed_devices(8, 4, 1) == 4
# boundary: 2-GPU actors on an 8-GPU node -> 4 concurrent, none oversubscribed
assert concurrent_tasks(8, 2) == 4
assert oversubscribed_devices(8, 8, 2) == 0
# equivalence to a slow, independent reference: hand out logical device ids and
# count those outside the physical set {0..physical-1}
def oversub_slow(logical_gpus: int, physical_gpus: int, gpus_per_task: int = 1) -> int:
running = logical_gpus // gpus_per_task
handed_out = np.arange(running * gpus_per_task) # logical device ids Ray assigns
physical = set(range(physical_gpus))
return int(sum(int(d) not in physical for d in handed_out))
for lg, pg, g in [(32, 32, 1), (8, 4, 1), (8, 8, 2), (16, 8, 1), (9, 9, 3), (7, 4, 1)]:
assert oversubscribed_devices(lg, pg, g) == oversub_slow(lg, pg, g), (lg, pg, g)
print("B ok: oversub(logical=8, physical=4) =", oversubscribed_devices(8, 4, 1))
Symmetric run (Ray 2.49+, less boilerplate)¶
One srun over all tasks; Ray boots on every node and runs the entrypoint only on the head. The cluster stops automatically when the script finishes.
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
set -euo pipefail
nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
ip_head="${nodes_array[0]}:6379"
srun --nodes="$SLURM_JOB_NUM_NODES" --ntasks="$SLURM_JOB_NUM_NODES" \
ray symmetric-run \
--address "$ip_head" \
--min-nodes "$SLURM_JOB_NUM_NODES" \
--num-cpus="${SLURM_CPUS_PER_TASK}" \
--num-gpus="${SLURM_GPUS_PER_NODE}" \
-- python -u train.py
How to integrate it¶
- GPU binding. Slurm's
--gpus-per-nodesetsCUDA_VISIBLE_DEVICESper task; pass the same count to--num-gpusso Ray's logical resource matches what Slurm granted. Mismatch (e.g.--num-gpus=8on a 4-GPU task) lets Ray over-schedule and CUDA OOM/invalid devicefollows (reproduced in the check above). - Fabric / NCCL. Ray Train collectives run over NCCL; on IB/RoCE set
NCCL_IB_HCA,NCCL_NET_GDR_LEVEL=SYSand confirm[GDRDMA]inNCCL_DEBUG=INFO. Without RDMA, NCCL falls back to TCP and throughput craters; validate withnccl-testsfirst (fabric bring-up and benchmarking, workload bring-up recipes). - Topology. Let Slurm pack the allocation (
topology.conf,--exclusive) so co-scheduled ranks are switch-local; then use Ray placement groups (STRICT_PACKfor an in-node NVLink collective,STRICT_SPREADfor one-rank-per-node) on top (topology placement). - Health gating. Wrap submission so unhealthy nodes never enter the allocation (GPU health gating, smoke tests).
How to run it in production¶
- Pin ports on shared partitions. Two Ray jobs on overlapping nodes with default ports collide; give each job non-overlapping
--min-worker-port/--max-worker-portand dashboard/object ports. The overlap test below is exactly the check a submit wrapper should run before co-scheduling two jobs:
# Validate multi-tenant port pinning: two Ray jobs sharing nodes need disjoint
# worker-port ranges or their raylets collide. Inclusive integer ranges.
from __future__ import annotations
import numpy as np
def overlaps(a: tuple[int, int], b: tuple[int, int]) -> bool:
(lo_a, hi_a), (lo_b, hi_b) = a, b
assert lo_a <= hi_a and lo_b <= hi_b, "each range must be low <= high"
return max(lo_a, lo_b) <= min(hi_a, hi_b) # inclusive interval overlap
job1 = (10002, 19999) # the page's pinned worker range
# disjoint neighbour -> safe to co-schedule on shared nodes
assert overlaps(job1, (20000, 29999)) is False
# adversarial: both jobs left ports at the default range -> guaranteed collision
assert overlaps(job1, (10002, 19999)) is True
# boundary: ranges that merely touch at one endpoint still share that port
assert overlaps(job1, (19999, 29999)) is True
assert overlaps((10002, 19998), (19999, 29999)) is False # a 1-port gap is safe
# equivalence to a slow set-intersection reference over random ranges
def overlaps_slow(a: tuple[int, int], b: tuple[int, int]) -> bool:
return len(set(range(a[0], a[1] + 1)) & set(range(b[0], b[1] + 1))) > 0
rng = np.random.default_rng(0)
for _ in range(2000):
lo_a, lo_b = (int(x) for x in rng.integers(6000, 20000, size=2))
a = (lo_a, lo_a + int(rng.integers(0, 500)))
b = (lo_b, lo_b + int(rng.integers(0, 500)))
assert overlaps(a, b) == overlaps_slow(a, b), (a, b)
print("C ok: default-vs-default collide =", overlaps(job1, job1))
- Observe. Scrape the Ray dashboard/Prometheus metrics endpoint and the node exporters for the allocation duration (telemetry and alerting).
- Plan for head loss. The head is a single point of failure for the job's lifetime, and the allocation is wall-clock bounded and preemptible. That is acceptable for batch training; for HA serving move to Kubernetes
RayService, not this pattern.
How to maintain it¶
- The
sleepis load-bearing. Workers thatray start --addressbefore GCS is up fail to register. Keep the head sleep (10-20 s) and a short per-worker sleep; on large jobs prefer pollingray statusuntil--min-nodesare present over a fixed sleep. - Always
ray stopon teardown. The trailingsrun ... ray stop(orsymmetric-run's auto-stop) prevents orphaned Ray processes holding GPUs into the next allocation on shared nodes.
How to scale it¶
- Add nodes, not control planes. Bump
--nodes; the worker fan-out loop andray.cluster_resources()grow linearly (GPU total = nodes x--gpus-per-node, as the wiring check asserts). There is still no second control plane to stand up. - Retire the fixed sleep at size. On large N, GCS registration is staggered and a constant
sleepeither wastes time or races; pollray status/ray.nodes()until--min-nodesraylets are live, which is exactly whatsymmetric-run --min-nodesdoes for you. - Scale scheduling with placement groups. Size bundles to the packed topology:
STRICT_PACKto keep a collective inside one node's NVLink domain,STRICT_SPREADto place one rank per node (topology placement). - RL controllers at scale. verl, slime, SkyRL and NeMo-RL run long-lived rollout and trainer actor groups; scaling them means adding worker nodes to the same allocation, keeping the
--exclusive, topology-packed node set that lets NCCL ride the fabric (distributed training recipes).
Failure modes¶
| Symptom | Cause | Fix |
|---|---|---|
CUDA invalid device / OOM on some ranks |
Ray --num-gpus exceeds Slurm's granted GPUs, so Ray over-schedules onto devices that do not exist (see Block B) |
Set --num-gpus == --gpus-per-node; never inflate the logical count |
Workers never join; ray status shows the head only |
A worker ran ray start --address before the head's GCS was up |
Keep the head sleep (10-20 s); on large jobs poll ray status until --min-nodes raylets register |
| Raylet bind / port errors when two jobs share nodes | Both jobs used default or overlapping worker-port ranges (see Block C) | Give each job disjoint --min-worker-port/--max-worker-port and distinct GCS/object/dashboard ports |
| GPUs still busy at the start of the next allocation | Orphaned Ray processes from a job that exited without ray stop |
Always run the trailing srun ... ray stop; symmetric-run auto-stops |
| NCCL throughput craters in Ray Train | RDMA not engaged; NCCL fell back to TCP | Set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; confirm [GDRDMA] under NCCL_DEBUG=INFO; validate with nccl-tests |
| The whole job dies when the head node fails | The head is a single point of failure for the job's lifetime, and Slurm can preempt the wall-clock-bounded allocation | Accept it for batch; for HA online serving move to Kubernetes RayService, not this pattern |
References¶
- Ray on Slurm (deploying Ray with Slurm): https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
ray startCLI reference: https://docs.ray.io/en/latest/cluster/cli.html#ray-start- Ray cluster launcher / community integrations: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/index.html
- KubeRay (RayCluster / RayJob / RayService): https://docs.ray.io/en/latest/cluster/kubernetes/index.html · repo https://github.com/ray-project/kuberay
- Placement groups: https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html
- Slurm
sbatch/srun/ GRES: https://slurm.schedmd.com/sbatch.html · https://slurm.schedmd.com/srun.html · https://slurm.schedmd.com/gres.html - NCCL environment variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Related: Ray · Slurm · Slurm vs Kubernetes · Kubernetes · Orchestration · Topology placement · Distributed training recipes · Glossary