Skip to content
Markdown

Ray on Slurm

Scope: Composing Ray with a Slurm allocation: starting a Ray head and workers inside one sbatch job, address/port wiring, GPU binding, clean teardown, and when this beats KubeRay or bare Slurm.

Reference templates on real APIs; pin versions and validate before production use. Not hardware-tested here.

What it is

A pattern that runs the Ray runtime (Ray) inside a single Slurm allocation (Slurm). One sbatch job reserves N nodes; the job script starts a Ray head on the first node and a Ray worker on each remaining node, all pointed at the head's address. Your Python entrypoint then runs against that ephemeral Ray cluster and Ray tears down when the job ends. Slurm owns gang scheduling, GRES (gpu) binding and topology placement (topology-aware placement); Ray owns task/actor scheduling, placement groups, and the Train/Serve/Data/RLlib libraries on top.

Two supported launch styles:

  • ray symmetric-run (Ray 2.49+) runs one srun across all tasks; Ray starts on every node and runs your entrypoint only on the head. Less boilerplate.
  • Manual ray start uses explicit head/worker srun invocations with pinned ports. More verbose but works on older Ray and shared/multi-tenant partitions where you must avoid port collisions.

Why use it

  • No second control plane. On an HPC cluster that already runs Slurm, you get a Ray cluster without standing up Kubernetes + KubeRay. Slurm remains the single point of quota, accounting and topology.
  • Tightly-coupled allocation. --exclusive plus gang scheduling gives Ray a contiguous, topology-packed node set, so NCCL collectives in Ray Train ride the fabric instead of fighting for it (distributed training recipes).
  • Python-native on bare metal. RL post-training libraries (verl, slime, SkyRL, NeMo-RL) drive long-lived rollout and trainer actor groups via Ray; this pattern hosts that controller on an HPC partition without containers.
  • Ephemeral and cheap to reason about. The cluster lives and dies with the job, with no leaked head node and no GCS to babysit between runs.

When to use it (and when not)

flowchart LR
  Q["Need a Ray cluster?"] --> A{"Cluster manager already present?"}
  A -->|"Kubernetes"| K["Use KubeRay (RayCluster / RayJob)"]
  A -->|"Slurm only"| S{"Workload shape?"}
  A -->|"None / raw VMs"| V["Ray cluster launcher on VMs"]
  S -->|"Python actors, RL, Ray Train/Serve/Data"| R["Ray on Slurm (this page)"]
  S -->|"Pure torchrun / MPI pretraining"| B["Bare Slurm + srun torchrun"]
  • Use Ray on Slurm when the workload needs Ray's actor model, placement groups, or RL controller pattern, and the only scheduler on the cluster is Slurm.
  • Prefer KubeRay when the cluster is Kubernetes-first (Kubernetes): you get declarative RayCluster/RayJob, the in-tree autoscaler, and RayService HA, none of which the Slurm pattern provides. See orchestration overview and Slurm vs Kubernetes.
  • Prefer bare Slurm for tightly-coupled torchrun/MPI pretraining with no distributed-Python coordination; Ray adds a runtime you do not need (Slurm for GPU Clusters).
  • Not for long-running services. A Slurm job is wall-clock bounded; an online inference SLO (SLO/SLI catalog) wants RayService on Kubernetes, not an sbatch job that the scheduler can preempt.

Architecture

One sbatch allocation holds the whole cluster. Node 0 runs ray start --head, which brings up the GCS (the cluster's metadata store) plus a raylet, object store, client server and dashboard on pinned ports. Every other node runs ray start --address and registers its raylet with the head's GCS. The entrypoint runs on the head, calls ray.init(address="auto"), and submits tasks, actors and placement groups that Ray schedules across the raylets. Slurm has already bound each task's GPUs via GRES and set CUDA_VISIBLE_DEVICES, so Ray's logical --num-gpus is just a bookkeeping mirror of what Slurm physically granted.

flowchart TB
  SB["sbatch (one --exclusive allocation)"] --> ALLOC["Slurm allocation: N nodes, GRES gpu bound"]
  ALLOC --> HEAD["node 0: ray start --head :6379 (GCS + raylet + dashboard)"]
  ALLOC --> W1["node 1: ray start --address ip_head (raylet)"]
  ALLOC --> WN["node N-1: ray start --address ip_head (raylet)"]
  W1 -->|"register with GCS"| HEAD
  WN -->|"register with GCS"| HEAD
  ENTRY["train.py on head: ray.init(address='auto')"] --> HEAD
  ENTRY --> PG["tasks / actors / placement groups"]
  PG -->|"num_gpus=1 per task"| GPUS["GPUs (CUDA_VISIBLE_DEVICES set by Slurm)"]
  SLURM["Slurm owns: gang scheduling, GRES binding, topology"] -.-> ALLOC
  RAY["Ray owns: task/actor scheduling, placement groups, Train/Serve/Data/RLlib"] -.-> PG

Ports the head opens (pinned so concurrent jobs on a shared partition do not collide): GCS 6379, node manager 6700, object manager 6701, Ray client server 10001, and the worker-port range 10002-19999.

How to use it

Manual head and worker (portable)

The robust pattern: derive the head IP from the allocation, start the head on node 0, sleep, fan workers out one-per-node with srun -w, then run the entrypoint on the head only. Ports are pinned so concurrent jobs on a shared partition do not collide.

#!/bin/bash
#SBATCH --job-name=ray-job
#SBATCH --partition=blackwell
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
#SBATCH --time=02:00:00

set -euo pipefail

# --- discover nodes; node 0 is the head ---
nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
nodes_array=($nodes)
head_node=${nodes_array[0]}
# resolve the head's routable IP (use the fabric interface if multi-homed)
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
port=6379
ip_head="$head_node_ip:$port"
export ip_head
echo "Head at $ip_head"

# --- start the head on node 0 (pinned ports for multi-tenant safety) ---
srun --nodes=1 --ntasks=1 -w "$head_node" \
  ray start --head --node-ip-address="$head_node_ip" \
    --port=6379 \
    --node-manager-port=6700 --object-manager-port=6701 \
    --ray-client-server-port=10001 \
    --min-worker-port=10002 --max-worker-port=19999 \
    --num-cpus="${SLURM_CPUS_PER_TASK}" \
    --num-gpus="${SLURM_GPUS_PER_NODE}" --block &
sleep 15  # let GCS come up before workers attach

# --- start one worker per remaining node ---
worker_num=$((SLURM_JOB_NUM_NODES - 1))
for ((i = 1; i <= worker_num; i++)); do
  node_i=${nodes_array[$i]}
  srun --nodes=1 --ntasks=1 -w "$node_i" \
    ray start --address="$ip_head" \
      --num-cpus="${SLURM_CPUS_PER_TASK}" \
      --num-gpus="${SLURM_GPUS_PER_NODE}" --block &
  sleep 5
done

# --- run the entrypoint on the head only; it connects to the running cluster ---
python -u train.py

# --- teardown: stop Ray on every node, then the job exits ---
srun --nodes="$SLURM_JOB_NUM_NODES" --ntasks="$SLURM_JOB_NUM_NODES" ray stop

The entrypoint attaches to the cluster Slurm just built and sees every GPU. This block is a reference template (needs ray + torch, not installed here, so not executed); the numpy checks below validate the core resource math it relies on.

# train.py
import ray

ray.init(address="auto")  # connect to the head started by the sbatch script
print(ray.cluster_resources())  # {'GPU': 32.0, 'CPU': 384.0, ...} for 4x8

@ray.remote(num_gpus=1)
def probe() -> str:
    import torch
    return torch.cuda.get_device_name(0)

print(ray.get([probe.remote() for _ in range(int(ray.cluster_resources()["GPU"]))]))

The aggregate ray.cluster_resources() the head reports is exactly the sum over the allocation, and the head/worker split is deterministic from the nodelist. This numpy-only check reproduces that wiring and the {'GPU': 32.0, 'CPU': 384.0} figure, with edge and adversarial cases:

# Validate the head/worker wiring the sbatch script builds and the aggregate
# cluster resources train.py reads back via ray.cluster_resources().
from __future__ import annotations
import numpy as np


def wire_cluster(nodelist: list[str], gpus_per_node: int, cpus_per_task: int):
    if not nodelist:
        raise ValueError("empty allocation: Slurm granted no nodes")
    head, workers = nodelist[0], nodelist[1:]
    n = len(nodelist)
    resources = {
        "GPU": float(np.sum(np.full(n, gpus_per_node))),
        "CPU": float(np.sum(np.full(n, cpus_per_task))),
    }
    return head, workers, f"{head}:6379", resources


# happy path: 4 nodes x 8 GPUs x 96 CPUs -> the {'GPU':32,'CPU':384} the page prints
nodes = [f"gpu-node-{i}" for i in range(4)]
head, workers, ip_head, res = wire_cluster(nodes, gpus_per_node=8, cpus_per_task=96)
assert head == "gpu-node-0"
assert workers == ["gpu-node-1", "gpu-node-2", "gpu-node-3"]
assert len(workers) == len(nodes) - 1 == 3          # worker_num = N - 1
assert res == {"GPU": 32.0, "CPU": 384.0}           # matches the train.py comment
assert ip_head == "gpu-node-0:6379"

# edge: single-node allocation -> head only, zero workers, one node's GPUs
h1, w1, _, r1 = wire_cluster(["solo"], gpus_per_node=8, cpus_per_task=96)
assert w1 == [] and r1["GPU"] == 8.0

# adversarial: an empty nodelist must fail fast, not build a headless cluster
try:
    wire_cluster([], 8, 96)
    raise AssertionError("empty allocation should have raised")
except ValueError:
    pass

print("A ok:", head, ip_head, res, "| workers:", len(workers))

The probe.remote() fan-out asks for one task per GPU (num_gpus=1), which fills the cluster in a single scheduling wave only when Ray's logical GPU count equals what Slurm granted. This check validates that invariant, including the --num-gpus=8 on a 4-GPU task failure, cross-checked against a slow device-id reference:

# Validate Ray's logical --num-gpus against Slurm's physically-granted GPUs, plus
# the one-task-per-GPU fan-out in train.py. Oversubscription = Ray believing in
# GPUs Slurm never granted (the --num-gpus=8 on a 4-GPU task failure).
from __future__ import annotations
import numpy as np


def concurrent_tasks(logical_gpus: int, gpus_per_task: int) -> int:
    return logical_gpus // gpus_per_task            # tasks Ray runs at once


def oversubscribed_devices(logical_gpus: int, physical_gpus: int, gpus_per_task: int = 1) -> int:
    used = concurrent_tasks(logical_gpus, gpus_per_task) * gpus_per_task
    return max(0, used - physical_gpus)             # tasks landing on absent devices


# happy path: logical == physical, 1 GPU/task -> 32 probe.remote() fill 4x8 exactly
assert concurrent_tasks(32, 1) == 32
assert oversubscribed_devices(32, 32, 1) == 0
assert oversubscribed_devices(8, 8, 1) == 0         # per-node view, 8 == 8

# adversarial: --num-gpus=8 declared but Slurm granted only 4 -> 4 tasks hit
# devices that do not exist (CUDA invalid device / OOM)
assert oversubscribed_devices(8, 4, 1) == 4

# boundary: 2-GPU actors on an 8-GPU node -> 4 concurrent, none oversubscribed
assert concurrent_tasks(8, 2) == 4
assert oversubscribed_devices(8, 8, 2) == 0

# equivalence to a slow, independent reference: hand out logical device ids and
# count those outside the physical set {0..physical-1}
def oversub_slow(logical_gpus: int, physical_gpus: int, gpus_per_task: int = 1) -> int:
    running = logical_gpus // gpus_per_task
    handed_out = np.arange(running * gpus_per_task)  # logical device ids Ray assigns
    physical = set(range(physical_gpus))
    return int(sum(int(d) not in physical for d in handed_out))

for lg, pg, g in [(32, 32, 1), (8, 4, 1), (8, 8, 2), (16, 8, 1), (9, 9, 3), (7, 4, 1)]:
    assert oversubscribed_devices(lg, pg, g) == oversub_slow(lg, pg, g), (lg, pg, g)

print("B ok: oversub(logical=8, physical=4) =", oversubscribed_devices(8, 4, 1))

Symmetric run (Ray 2.49+, less boilerplate)

One srun over all tasks; Ray boots on every node and runs the entrypoint only on the head. The cluster stops automatically when the script finishes.

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=96
#SBATCH --exclusive
set -euo pipefail

nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
ip_head="${nodes_array[0]}:6379"

srun --nodes="$SLURM_JOB_NUM_NODES" --ntasks="$SLURM_JOB_NUM_NODES" \
  ray symmetric-run \
    --address "$ip_head" \
    --min-nodes "$SLURM_JOB_NUM_NODES" \
    --num-cpus="${SLURM_CPUS_PER_TASK}" \
    --num-gpus="${SLURM_GPUS_PER_NODE}" \
    -- python -u train.py

How to integrate it

  • GPU binding. Slurm's --gpus-per-node sets CUDA_VISIBLE_DEVICES per task; pass the same count to --num-gpus so Ray's logical resource matches what Slurm granted. Mismatch (e.g. --num-gpus=8 on a 4-GPU task) lets Ray over-schedule and CUDA OOM/invalid device follows (reproduced in the check above).
  • Fabric / NCCL. Ray Train collectives run over NCCL; on IB/RoCE set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS and confirm [GDRDMA] in NCCL_DEBUG=INFO. Without RDMA, NCCL falls back to TCP and throughput craters; validate with nccl-tests first (fabric bring-up and benchmarking, workload bring-up recipes).
  • Topology. Let Slurm pack the allocation (topology.conf, --exclusive) so co-scheduled ranks are switch-local; then use Ray placement groups (STRICT_PACK for an in-node NVLink collective, STRICT_SPREAD for one-rank-per-node) on top (topology placement).
  • Health gating. Wrap submission so unhealthy nodes never enter the allocation (GPU health gating, smoke tests).

How to run it in production

  • Pin ports on shared partitions. Two Ray jobs on overlapping nodes with default ports collide; give each job non-overlapping --min-worker-port/--max-worker-port and dashboard/object ports. The overlap test below is exactly the check a submit wrapper should run before co-scheduling two jobs:
# Validate multi-tenant port pinning: two Ray jobs sharing nodes need disjoint
# worker-port ranges or their raylets collide. Inclusive integer ranges.
from __future__ import annotations
import numpy as np


def overlaps(a: tuple[int, int], b: tuple[int, int]) -> bool:
    (lo_a, hi_a), (lo_b, hi_b) = a, b
    assert lo_a <= hi_a and lo_b <= hi_b, "each range must be low <= high"
    return max(lo_a, lo_b) <= min(hi_a, hi_b)       # inclusive interval overlap


job1 = (10002, 19999)                               # the page's pinned worker range

# disjoint neighbour -> safe to co-schedule on shared nodes
assert overlaps(job1, (20000, 29999)) is False
# adversarial: both jobs left ports at the default range -> guaranteed collision
assert overlaps(job1, (10002, 19999)) is True
# boundary: ranges that merely touch at one endpoint still share that port
assert overlaps(job1, (19999, 29999)) is True
assert overlaps((10002, 19998), (19999, 29999)) is False   # a 1-port gap is safe

# equivalence to a slow set-intersection reference over random ranges
def overlaps_slow(a: tuple[int, int], b: tuple[int, int]) -> bool:
    return len(set(range(a[0], a[1] + 1)) & set(range(b[0], b[1] + 1))) > 0

rng = np.random.default_rng(0)
for _ in range(2000):
    lo_a, lo_b = (int(x) for x in rng.integers(6000, 20000, size=2))
    a = (lo_a, lo_a + int(rng.integers(0, 500)))
    b = (lo_b, lo_b + int(rng.integers(0, 500)))
    assert overlaps(a, b) == overlaps_slow(a, b), (a, b)

print("C ok: default-vs-default collide =", overlaps(job1, job1))
  • Observe. Scrape the Ray dashboard/Prometheus metrics endpoint and the node exporters for the allocation duration (telemetry and alerting).
  • Plan for head loss. The head is a single point of failure for the job's lifetime, and the allocation is wall-clock bounded and preemptible. That is acceptable for batch training; for HA serving move to Kubernetes RayService, not this pattern.

How to maintain it

  • The sleep is load-bearing. Workers that ray start --address before GCS is up fail to register. Keep the head sleep (10-20 s) and a short per-worker sleep; on large jobs prefer polling ray status until --min-nodes are present over a fixed sleep.
  • Always ray stop on teardown. The trailing srun ... ray stop (or symmetric-run's auto-stop) prevents orphaned Ray processes holding GPUs into the next allocation on shared nodes.

How to scale it

  • Add nodes, not control planes. Bump --nodes; the worker fan-out loop and ray.cluster_resources() grow linearly (GPU total = nodes x --gpus-per-node, as the wiring check asserts). There is still no second control plane to stand up.
  • Retire the fixed sleep at size. On large N, GCS registration is staggered and a constant sleep either wastes time or races; poll ray status/ray.nodes() until --min-nodes raylets are live, which is exactly what symmetric-run --min-nodes does for you.
  • Scale scheduling with placement groups. Size bundles to the packed topology: STRICT_PACK to keep a collective inside one node's NVLink domain, STRICT_SPREAD to place one rank per node (topology placement).
  • RL controllers at scale. verl, slime, SkyRL and NeMo-RL run long-lived rollout and trainer actor groups; scaling them means adding worker nodes to the same allocation, keeping the --exclusive, topology-packed node set that lets NCCL ride the fabric (distributed training recipes).

Failure modes

Symptom Cause Fix
CUDA invalid device / OOM on some ranks Ray --num-gpus exceeds Slurm's granted GPUs, so Ray over-schedules onto devices that do not exist (see Block B) Set --num-gpus == --gpus-per-node; never inflate the logical count
Workers never join; ray status shows the head only A worker ran ray start --address before the head's GCS was up Keep the head sleep (10-20 s); on large jobs poll ray status until --min-nodes raylets register
Raylet bind / port errors when two jobs share nodes Both jobs used default or overlapping worker-port ranges (see Block C) Give each job disjoint --min-worker-port/--max-worker-port and distinct GCS/object/dashboard ports
GPUs still busy at the start of the next allocation Orphaned Ray processes from a job that exited without ray stop Always run the trailing srun ... ray stop; symmetric-run auto-stops
NCCL throughput craters in Ray Train RDMA not engaged; NCCL fell back to TCP Set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; confirm [GDRDMA] under NCCL_DEBUG=INFO; validate with nccl-tests
The whole job dies when the head node fails The head is a single point of failure for the job's lifetime, and Slurm can preempt the wall-clock-bounded allocation Accept it for batch; for HA online serving move to Kubernetes RayService, not this pattern

References

  • Ray on Slurm (deploying Ray with Slurm): https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
  • ray start CLI reference: https://docs.ray.io/en/latest/cluster/cli.html#ray-start
  • Ray cluster launcher / community integrations: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/index.html
  • KubeRay (RayCluster / RayJob / RayService): https://docs.ray.io/en/latest/cluster/kubernetes/index.html · repo https://github.com/ray-project/kuberay
  • Placement groups: https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html
  • Slurm sbatch / srun / GRES: https://slurm.schedmd.com/sbatch.html · https://slurm.schedmd.com/srun.html · https://slurm.schedmd.com/gres.html
  • NCCL environment variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Related: Ray · Slurm · Slurm vs Kubernetes · Kubernetes · Orchestration · Topology placement · Distributed training recipes · Glossary