Markdown

GPU orchestration decision guide¶

Scope: choosing the orchestrator for GPU work, Slurm vs Kubernetes vs Ray (and hybrids), by workload shape (batch HPC vs services vs Python-native), team, and multi-tenancy; a decision matrix and flow that delegate implementation to the per-technology pages.

Decision aid, not an install guide. The runnable artifact below is a labeller you point at your own workload facts; per-technology setup lives on Slurm, Kubernetes, k3s, and Ray. Pin versions and validate before production use.

What it is¶

A way to pick the orchestration layer (the system that decides what runs where on a GPU fleet) before committing to a platform. Three families dominate, with different native assumptions (see orchestration overview for the architecture detail):

Slurm. HPC batch workload manager: bare-metal, gang-scheduled, topology-aware MPI/torchrun jobs. Deterministic node-level allocation. The training default in HPC (Slurm).
Kubernetes / k3s. Cloud-native: containers, services, multi-tenant, declarative/GitOps. Needs an add-on gang scheduler (Volcano/Kueue/KAI) for distributed training (Kubernetes, k3s).
Ray. Python-native distributed runtime: tasks/actors with Train/Serve/Data/RLlib on top. The substrate most LLM-RL stacks build on (Ray).

These are not mutually exclusive. Ray runs on Kubernetes (KubeRay) or on Slurm; many sites run Slurm and Kubernetes side by side over partitioned node pools. The decision is therefore "primary control plane, plus what rides on it" rather than "pick exactly one".

SchedMD frames the split directly: Kubernetes "excels at scheduling workloads that ... run for an indefinite amount of time with potentially vague resource requirements on a single node with loose policy, but can scale its resource pool infinitely"; Slurm "excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known".⁴

Why use it¶

The wrong default costs goodput and operator time:

The stock Kubernetes scheduler has no gang scheduling: a multi-pod training job can partially place, hold some GPUs, and deadlock against other partial jobs, never reaching the full worker set.¹ Distributed training on K8s requires a gang scheduler (Volcano, Kueue, or KAI) before it is safe.
Slurm gives deterministic, topology-aware node allocation that suits tightly-coupled multi-node pretraining, but it does not natively run long-lived containerised services or multi-tenant GitOps the way K8s does.²
Ray as a second, ungoverned stack beside K8s duplicates scheduling and loses quota; run it through KubeRay so it reuses the gang scheduler and GPU Operator (orchestration overview).

Choosing by workload shape, not dogma, is what keeps utilization high and the SLO/SLI catalog green.

When to use it (and when not)¶

Use this guide when you are standing up a new GPU platform, onboarding a new workload class (e.g. adding online inference to a training-only cluster), or reconciling two teams that each brought their own orchestrator.

Workload / constraint	Primary choice	Notes
Tightly-coupled multi-node pretraining, bare metal	Slurm	Gang + topology native; `srun torchrun` (FSDP, Distributed Training Recipes)
Multi-tenant platform, online services, GitOps	Kubernetes	Add Volcano/Kueue/KAI for batch gang (Kubernetes for GPU Clusters, Kubernetes and Helm: GPU Platform)
Edge / small / CI / dev cluster	k3s	Same API, smaller footprint (k3s)
RL, data + train + rollout pipelines, Python-native	Ray	On K8s via KubeRay, or on Slurm (Ray for GPU Clusters, DiLoCo)
Inference at scale	K8s + Ray Serve / KServe	(Inference Serving and Optimization, Serving Open-Weight Models)
Both batch training and services on one fleet	Hybrid	Partition nodes; Slurm-on-K8s or Slurm+K8s side by side²

When not to switch: do not migrate a working Slurm pretraining cluster to K8s merely for fashion; the gang/topology story is harder there. Do not adopt Ray as a parallel control plane if you only need batch jobs; a Volcano job on existing K8s is simpler. Do not run k3s as the control plane for a large datacentre it was never sized for.

Architecture¶

Two structures matter: the decision flow (how you arrive at a primary control plane) and the composition stack (what rides on what once you have chosen). They are separate because the answer is rarely one orchestrator: it is a primary plus the layers it hosts.

Decision flow. Reason in this order: workload coupling, then tenancy/services, then Python-native runtime, then the hybrid boundary.

flowchart LR
  WL["GPU workload"] --> C{"Tightly coupled<br/>multi-node training?"}
  C -->|"yes, bare metal"| SLURM["Slurm<br/>(gang + topology)"]
  C -->|"no"| SVC{"Multi-tenant services<br/>or GitOps?"}
  SVC -->|"yes"| K8S["Kubernetes / k3s"]
  SVC -->|"no"| PY{"Python-native<br/>RL / pipelines?"}
  PY -->|"yes"| RAY["Ray"]
  PY -->|"no, simple batch"| K8S
  K8S --> GANG{"Distributed<br/>training on K8s?"}
  GANG -->|"yes"| VOL["Add Volcano /<br/>Kueue / KAI gang"]
  RAY -.->|"KubeRay"| K8S
  RAY -.->|"Ray-on-Slurm"| SLURM
  SLURM --> HYB{"Also need<br/>online services?"}
  HYB -->|"yes"| BOTH["Hybrid: partition nodes,<br/>Slurm + K8s"]

Composition stack. The families layer rather than compete. Ray is a runtime that needs a resource manager underneath: KubeRay makes Kubernetes that manager, and Ray-on-Slurm makes Slurm that manager. Slurm and Kubernetes can also co-exist on one fleet, either side by side over partitioned node pools or with Slurm itself running as Kubernetes workloads via Slinky (slurm-operator) or Soperator, so a single Kubernetes substrate hosts Slurm's batch scheduling.⁴ Every path lands on the same NVIDIA node stack (driver, GPU Operator, NCCL, RDMA fabric).

flowchart TB
  subgraph RUNTIME["Runtime / app layer"]
    RAYRT["Ray: Train / Serve / Data / RLlib"]
    TORCH["torchrun / MPI (FSDP, TP, PP)"]
    SVC2["Long-lived services / inference"]
  end
  subgraph CTRL["Control plane (primary)"]
    S2["Slurm<br/>(gang + topology native)"]
    K2["Kubernetes / k3s<br/>+ Volcano / Kueue / KAI gang"]
  end
  NODE["NVIDIA node stack:<br/>driver, GPU Operator, DRA, NCCL, RDMA fabric"]
  RAYRT -->|"Ray-on-Slurm"| S2
  RAYRT -->|"KubeRay + Volcano gang"| K2
  TORCH -->|"srun torchrun"| S2
  TORCH -->|"gang PodGroup"| K2
  SVC2 --> K2
  S2 -. "Slinky / Soperator:<br/>Slurm as K8s workloads" .-> K2
  S2 --> NODE
  K2 --> NODE

The load-bearing invariant across both diagrams: anything that ends up on the Kubernetes scheduler and spans multiple nodes tightly-coupled must go through a gang scheduler, whether it is native K8s training, a Volcano minAvailable PodGroup, or a KubeRay cluster (KubeRay integrates Volcano for exactly this).¹³ Slurm gets this for free because a job's allocation is granted whole.

How to use it: encode the decision as a checkable rule¶

Point this at your workload facts; it returns the recommendation, so the choice is reproducible and reviewable in CI rather than tribal. No hardware required. The rules encode the matrix above: tightly-coupled multi-node on bare metal goes to Slurm; a Python-native pipeline goes to Ray hosted on a platform; small/edge goes to k3s; everything else (services, multi-tenant, containerised batch) goes to Kubernetes. Crucially, any coupled multi-node job that lands on a Kubernetes scheduler (native, or Ray-on-K8s) has the gang scheduler forced on as REQUIRED.

#!/usr/bin/env python3
"""Orchestration decision labeller. Inputs are workload facts; output is the
recommended primary control plane plus any required add-on. Illustrative rules
encoding this page's matrix: tune thresholds to your site, do not treat as law."""
from __future__ import annotations

from dataclasses import dataclass
from itertools import product


@dataclass(frozen=True)
class Workload:
    nodes: int                    # nodes a single job spans
    tightly_coupled: bool         # NCCL/MPI all-reduce across nodes (training)
    bare_metal: bool              # can run without a container platform
    long_lived_services: bool     # serving / always-on endpoints
    multi_tenant: bool            # many teams share the fleet
    python_native_pipeline: bool  # RL / rollout+train actors / Ray Data
    edge_or_dev: bool             # small / CI / single node


def recommend(w: Workload) -> dict[str, str]:
    # 1. tightly-coupled multi-node training on bare metal -> Slurm
    if w.tightly_coupled and w.nodes > 1 and w.bare_metal:
        rec = {"primary": "slurm", "addon": "topology.conf for rail-aware placement"}
    # 2. Python-native RL/pipeline -> Ray (hosted on a platform)
    elif w.python_native_pipeline:
        host = "k3s" if w.edge_or_dev else "kubernetes"
        rec = {"primary": "ray", "addon": f"KubeRay on {host}"}
    # 3. small / dev / edge -> k3s
    elif w.edge_or_dev:
        rec = {"primary": "k3s", "addon": "GPU Operator"}
    # 4. services / multi-tenant / containerised batch -> Kubernetes
    else:
        rec = {"primary": "kubernetes", "addon": "GPU Operator"}

    # Anything that ends up on a Kubernetes scheduler and is tightly-coupled multi-node
    # needs a gang scheduler; the stock kube-scheduler has none, so it partial-places and
    # deadlocks. This covers native K8s/k3s AND Ray-on-K8s (KubeRay integrates Volcano for
    # exactly this). A Ray pipeline routed to a k3s/kubernetes host is on kube-scheduler too.
    on_kube = rec["primary"] in {"kubernetes", "k3s"} or (
        rec["primary"] == "ray" and not w.edge_or_dev)
    if on_kube and w.tightly_coupled and w.nodes > 1:
        rec["addon"] = "Volcano/Kueue/KAI gang scheduler (REQUIRED) + " + rec["addon"]

    # both batch training AND services on one fleet -> hybrid, partition nodes
    if rec["primary"] == "slurm" and w.long_lived_services:
        rec["hybrid"] = "partition nodes: Slurm pool + K8s pool, clear boundary"
    return rec


def _wl(**over: object) -> Workload:
    base = dict(nodes=1, tightly_coupled=False, bare_metal=False,
                long_lived_services=False, multi_tenant=False,
                python_native_pipeline=False, edge_or_dev=False)
    base.update(over)
    return Workload(**base)  # type: ignore[arg-type]


CASES = {
    "8-node FSDP pretraining (bare metal)": _wl(nodes=8, tightly_coupled=True, bare_metal=True),
    "multi-tenant inference platform": _wl(long_lived_services=True, multi_tenant=True),
    "GRPO RL (rollout + trainer actors)": _wl(nodes=4, tightly_coupled=True, multi_tenant=True,
                                              python_native_pipeline=True),
    "single-node dev / CI": _wl(edge_or_dev=True),
}

# --- Adversarial / edge assertions: the rules must hold, not just the happy path ---

# Happy path: the four documented cases land where the matrix says.
assert recommend(CASES["8-node FSDP pretraining (bare metal)"])["primary"] == "slurm"
assert recommend(CASES["multi-tenant inference platform"])["primary"] == "kubernetes"
assert recommend(CASES["GRPO RL (rollout + trainer actors)"])["primary"] == "ray"
assert recommend(CASES["single-node dev / CI"])["primary"] == "k3s"

# Edge 1 (the load-bearing safety rule): ANY multi-node tightly-coupled job that lands
# on a Kubernetes scheduler MUST carry the gang scheduler as REQUIRED, or it partial-places
# and deadlocks. Containerised (not bare_metal) multi-node training falls through to K8s:
k8s_train = recommend(_wl(nodes=16, tightly_coupled=True, bare_metal=False, multi_tenant=True))
assert k8s_train["primary"] == "kubernetes"
assert "REQUIRED" in k8s_train["addon"] and "gang" in k8s_train["addon"].lower()
# A Ray RL job that is itself tightly-coupled multi-node on K8s inherits the gang rule too:
ray_train = recommend(_wl(nodes=4, tightly_coupled=True, python_native_pipeline=True))
assert ray_train["primary"] == "ray" and "REQUIRED" in ray_train["addon"]

# Edge 2 (boundary value at nodes == 1): a single-node tightly-coupled job needs NO gang
# scheduler (there is nothing to co-allocate across nodes). Crossing 1->2 flips it on.
one = recommend(_wl(nodes=1, tightly_coupled=True, multi_tenant=True))
two = recommend(_wl(nodes=2, tightly_coupled=True, multi_tenant=True))
assert "REQUIRED" not in one["addon"], "no gang needed at nodes==1"
assert "REQUIRED" in two["addon"], "gang required the instant nodes>1"

# Edge 3 (host routing): a Python-native pipeline on edge/dev hosts Ray on k3s, not K8s.
assert recommend(_wl(python_native_pipeline=True, edge_or_dev=True))["addon"] == "KubeRay on k3s"
assert recommend(_wl(python_native_pipeline=True))["addon"] == "KubeRay on kubernetes"

# Edge 4 (hybrid only where it belongs): the hybrid split appears iff Slurm is primary AND
# long-lived services coexist. It must never appear for a pure-services K8s fleet.
hy = recommend(_wl(nodes=8, tightly_coupled=True, bare_metal=True, long_lived_services=True))
assert "hybrid" in hy and hy["primary"] == "slurm"
assert "hybrid" not in recommend(CASES["multi-tenant inference platform"])
assert "hybrid" not in recommend(CASES["8-node FSDP pretraining (bare metal)"])  # no services

# Edge 5 (determinism / no hidden state): same facts -> byte-identical recommendation, always.
w = CASES["GRPO RL (rollout + trainer actors)"]
assert recommend(w) == recommend(w)

# Edge 6 (adversarial contradiction): bare_metal + edge_or_dev single node is NOT coupled
# multi-node, so rule 1 must not fire; it must fall through to k3s, not misclassify as Slurm.
assert recommend(_wl(nodes=1, tightly_coupled=True, bare_metal=True, edge_or_dev=True))["primary"] == "k3s"

# Edge 7 (equivalence to an INDEPENDENT slow reference over the WHOLE input space): a second,
# deliberately plain oracle re-derives primary/addon/hybrid with different structure; the two must
# agree on every combination of the six booleans x representative node counts (256 cases), or a
# refactor has silently changed a routing decision. This is the strongest guard on the matrix.
def _oracle(w: Workload) -> dict[str, str]:
    coupled_multinode = w.tightly_coupled and w.nodes > 1
    if coupled_multinode and w.bare_metal:
        primary, addon = "slurm", "topology.conf for rail-aware placement"
    elif w.python_native_pipeline:
        primary, addon = "ray", ("KubeRay on k3s" if w.edge_or_dev else "KubeRay on kubernetes")
    elif w.edge_or_dev:
        primary, addon = "k3s", "GPU Operator"
    else:
        primary, addon = "kubernetes", "GPU Operator"
    lands_on_kube = primary in {"kubernetes", "k3s"} or (primary == "ray" and not w.edge_or_dev)
    if lands_on_kube and coupled_multinode:
        addon = "Volcano/Kueue/KAI gang scheduler (REQUIRED) + " + addon
    out = {"primary": primary, "addon": addon}
    if primary == "slurm" and w.long_lived_services:
        out["hybrid"] = "partition nodes: Slurm pool + K8s pool, clear boundary"
    return out


for _combo in product((False, True), repeat=6):
    _tc, _bm, _ls, _mt, _pn, _ed = _combo
    for _n in (1, 2, 8, 16):
        _w = _wl(nodes=_n, tightly_coupled=_tc, bare_metal=_bm, long_lived_services=_ls,
                 multi_tenant=_mt, python_native_pipeline=_pn, edge_or_dev=_ed)
        assert recommend(_w) == _oracle(_w), f"reference mismatch at {_w}"

if __name__ == "__main__":
    for name, wl in CASES.items():
        print(f"{name:42s} -> {recommend(wl)}")
    print("all assertions passed")

Run it:

python3 orchestration_decision.py
# 8-node FSDP pretraining (bare metal)      -> {'primary': 'slurm', 'addon': 'topology.conf ...'}
# multi-tenant inference platform           -> {'primary': 'kubernetes', 'addon': 'GPU Operator'}
# GRPO RL (rollout + trainer actors)        -> {'primary': 'ray', 'addon': 'Volcano/Kueue/KAI gang scheduler (REQUIRED) + KubeRay on kubernetes'}
# single-node dev / CI                      -> {'primary': 'k3s', 'addon': 'GPU Operator'}
# all assertions passed

Note the GRPO line: because that job is tightly-coupled across 4 nodes and lands on a Kubernetes scheduler via KubeRay, the labeller correctly forces the gang scheduler on, rather than silently omitting the one add-on that keeps it from deadlocking.

Why gang admission is all-or-nothing (the core math, numpy-only)¶

The whole guide rests on one claim: the stock per-pod scheduler can partial-place two contending jobs into a mutual deadlock (each holds GPUs, neither reaches its full worker set), whereas gang admission (Volcano minAvailable/minMember) is all-or-nothing, so a job's GPUs are never stranded.¹³ This block proves exactly that on a fixed GPU pool, with an adversarial pod-arrival interleaving, a boundary case, and a property test.

#!/usr/bin/env python3
"""Numpy-only proof: the stock per-pod scheduler can PARTIAL-PLACE two gang jobs into a
mutual deadlock, whereas gang admission (Volcano `minAvailable`) is all-or-nothing."""
from __future__ import annotations

import numpy as np


def greedy_per_pod(total: int, need: int, order: np.ndarray) -> tuple[np.ndarray, bool]:
    """Default kube-scheduler: bind one pod at a time in arrival order, first-fit, no
    preemption. Returns (held[job], deadlocked). One pod == one GPU here for simplicity."""
    held = np.zeros(2, dtype=np.int64)
    free = total
    for job in order:                      # interleaved pod arrivals across the two jobs
        if held[job] < need and free > 0:  # greedily grab a GPU if the job still wants one
            held[job] += 1
            free -= 1
    running = held >= need                 # a job runs only when it reached its full set
    # deadlock: nobody is running, yet all GPUs are held by partially-placed jobs
    deadlocked = (not running.any()) and (held.sum() == total) and (held > 0).all()
    return held, bool(deadlocked)


def gang_admission(total: int, need: int) -> np.ndarray:
    """Volcano `minAvailable == need`: admit a job's whole PodGroup or none of it. A job is
    admitted only if `need` GPUs are free at once (all-or-nothing), never partial-placed."""
    held = np.zeros(2, dtype=np.int64)
    free = total
    for job in (0, 1):
        if free >= need:        # atomic gang check: the FULL worker set must fit now
            held[job] = need    # place the whole group
            free -= need
    return held


total, need = 8, 5              # need*2 = 10 > 8: only one job can fully fit
assert need * 2 > total, "precondition: the two jobs contend for the same GPUs"

# Adversarial: interleave the two jobs' pod arrivals to force partial placement.
order = np.array([0, 1, 0, 1, 0, 1, 0, 1], dtype=np.int64)  # A,B,A,B,... 8 arrivals
held, deadlocked = greedy_per_pod(total, need, order)
assert deadlocked, f"expected per-pod deadlock, got held={held}"
assert (held == np.array([4, 4])).all()        # 4+4 GPUs held, neither reaches 5
assert held.sum() == total                      # every GPU stranded
assert (held < need).all()                      # NEITHER job runnable -> deadlock

# Gang admission on the identical contention: exactly one job runs whole, none stranded.
g = gang_admission(total, need)
assert (g == need).sum() == 1, f"gang must admit exactly one whole job, got {g}"
assert g.sum() == need                          # only `need` GPUs used, rest left free
assert (total - g.sum()) == total - need        # 3 GPUs remain FREE, not deadlock-held
# Equivalence to a slow reference: gang admission is the first-feasible whole packing.
ref = np.zeros(2, dtype=np.int64)
for j in range(2):
    if ref.sum() + need <= total:
        ref[j] = need
assert (g == ref).all()

# Boundary: when the pool DOES fit both (need*2 <= total) gang admits both, no waste.
assert (gang_admission(total=10, need=5) == np.array([5, 5])).all()

# Property: gang admission NEVER partial-places (each job holds 0 or exactly `need`).
rng = np.random.default_rng(0)
for _ in range(1000):
    t = int(rng.integers(1, 33)); n = int(rng.integers(1, 17))
    gg = gang_admission(t, n)
    assert np.isin(gg, (0, n)).all(), f"partial placement leaked: t={t} n={n} -> {gg}"
    assert gg.sum() <= t                          # never oversubscribe the pool

print("gang all-or-nothing vs per-pod deadlock: all assertions passed")

This is why the matrix marks the gang scheduler REQUIRED for K8s distributed training: without minAvailable, the deadlock above is a live failure mode, not a hypothetical.

How to integrate¶

Once chosen, hand off to the per-technology page for setup: Slurm for GPU Clusters, Kubernetes for GPU Clusters + Volcano Gang Scheduler/Kueue + Volcano Job, k3s, Ray for GPU Clusters. Lay the GPU platform with Kubernetes and Helm: GPU Platform; place jobs topology-aware via Topology-Aware GPU Scheduling in Kubernetes.

On K8s, run distributed training only behind a gang scheduler so the full worker set places atomically or stays queued. Volcano's minAvailable/minMember PodGroup is the standard control.³ The Kubernetes equivalents of Slurm's cons_tres/fair-share are split across components: DRA for flexible device requests, KAI (NVIDIA) or Volcano for gang + topology-aware placement with the PodGroup as the atomic gang unit, and Kueue for quota and all-or-nothing admission. Each is a separate install on top of the GPU Operator.² For Ray, host it through KubeRay so it reuses that gang scheduler and the GPU Operator rather than standing up a second, ungoverned control plane; run Ray on Slurm where Slurm is the primary.

How to run it in production¶

Confirm the gang admitted as a unit, not partial placement. Under a gang scheduler either all pods of the PodGroup are Running or none are scheduled; there should never be a state with some workers Running (holding GPUs) and the rest Pending. That mixed state is the canonical Kubernetes-for-GPUs failure: the gang scheduler is not actually in the path and jobs went through the default scheduler.¹ On Slurm, a correctly placed job shows all requested nodes RUNNING under one job ID with each rank seeing its devices, because the allocation is granted whole.

Validate the fast interconnect from inside the workload regardless of orchestrator: NCCL_DEBUG=INFO must report the GPUDirect RDMA path, not a TCP fallback, and topology packing must land the job on the fewest leaf switches your topology.conf (or topology-aware scheduler) defines, since cross-spine hops inflate all-reduce time (networking fabric, diagnostics, Fabric Bring-Up, Validation and Benchmarking). Gate the platform with Smoke Tests: GPU Platform and GPU Health Gating before admitting production traffic.

How to maintain it¶

Re-run the decision when the workload mix shifts (e.g. inference added to a training cluster, or two teams' orchestrators merged). Validate the NCCL/RDMA path under your chosen orchestrator with nccl-tests (Fabric Bring-Up, Validation and Benchmarking, Workload and Bring-Up Recipes). Watch utilization and SLOs in Telemetry, Monitoring and Alerting/Observability and Monitoring against the SLO/SLI Catalog and Error-Budget Alerts; a partial-placement deadlock or NCCL-on-TCP fallback surfaces as an MFU regression or an inference SLO breach, and a job stuck queued shows up via the scheduler pending-GPU-job runbook. Cost trade-offs (own vs rent, neocloud) feed back via Cloud, Neoclouds and Cost/Capacity and Vendor Sourcing and Procurement Logistics.

How to scale it (multi-tenancy and the hybrid boundary)¶

Scaling here means adding tenants and workload classes without one starving another, not adding GPUs to a single job. On Slurm, multi-tenancy is the accounting DB (slurmdbd) plus the multifactor priority plugin with non-zero fair-share/QOS weights. On Kubernetes, it is quota via Kueue cohorts plus gang placement via Volcano/KAI, each a separate control on the GPU Operator.² When one fleet must carry both tightly-coupled batch training and long-lived services, do not force one orchestrator to do both jobs badly: partition the nodes into a Slurm pool and a K8s pool with a clear boundary, or run Slurm-on-Kubernetes via Slinky/Soperator so a single substrate hosts both.⁴ The labeller emits exactly this hybrid recommendation when Slurm is primary and services coexist. Keep Ray inside KubeRay under that same K8s pool so it inherits quota and gang scheduling rather than becoming a third, ungoverned plane.

Failure modes¶

Distributed job on the Kubernetes default scheduler. Pods bind one at a time, the job partial-places, GPUs sit idle held by pods that can never all start, and it deadlocks. Always run a gang scheduler for multi-pod jobs; the numpy proof above is this exact failure.¹ The tell is workers split Running/Pending on the same PodGroup (scheduler pending-GPU-job runbook).
"Gang scheduling" terminology mismatch. In Slurm, gang scheduling means time-sliced suspend/resume of multiple jobs sharing nodes, not single-job co-allocation; the Kubernetes "gang" (all-or-nothing PodGroup) maps to Slurm's default single-job co-allocation. Do not conflate them when comparing the two, or you will mis-scope the add-on you need.²
Ray as a second, ungoverned control plane. Standing Ray beside K8s duplicates scheduling and loses quota; run it through KubeRay so it reuses the gang scheduler and GPU Operator.²
Migrating a working Slurm pretraining cluster to K8s for fashion. The gang/topology story is harder on K8s; you inherit the deadlock risk above and gain nothing for a pure coupled-batch workload. Switch only when you actually need services, multi-tenancy, or GitOps.
NCCL falling back to TCP. Topology-blind placement or a broken RDMA path drops the collective onto TCP and inflates all-reduce time; it surfaces as an MFU regression. Verify NCCL_DEBUG=INFO shows the GPUDirect RDMA path and the fewest-leaf-switch packing landed (networking fabric).
k3s past its size envelope. Running k3s as the control plane for a large datacentre it was never sized for; use it for edge/small/CI/dev, and full Kubernetes for the datacentre control plane.²

References¶

Slurm documentation: https://slurm.schedmd.com/documentation.html
Kubernetes scheduling (default scheduler, no native gang): https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
Volcano (gang scheduling, PodGroup minAvailable): https://volcano.sh/en/docs/podgroup/ · https://github.com/volcano-sh/volcano
Kueue (job queueing / quota): https://kueue.sigs.k8s.io/docs/
Ray cluster docs · KubeRay on Kubernetes: https://docs.ray.io/en/latest/cluster/getting-started.html · https://docs.ray.io/en/latest/cluster/kubernetes/index.html
KubeRay + Volcano integration: https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html
Ray on Slurm: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
k3s documentation: https://docs.k3s.io/

The default Kubernetes scheduler has no concept of gang scheduling, which can lead to deadlocks where multiple jobs each hold some GPUs while waiting for others: distributed training on K8s requires an add-on gang scheduler. Kubernetes scheduling overview: https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ ; Volcano PodGroup gang model: https://volcano.sh/en/docs/podgroup/ ↩↩↩↩↩
Slurm assumes a dedicated, pre-provisioned cluster and gives deterministic node-level allocation suited to large multi-node training; Kubernetes excels at elastic scaling, services, and multi-tenancy. Common multi-tenant practice is K8s for inference/serving and Slurm (often Slurm-on-K8s) for distributed training, with node pools partitioned. Slurm: https://slurm.schedmd.com/documentation.html ; Kubernetes scheduling: https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/ ↩↩↩↩↩↩↩
Volcano's PodGroup minAvailable/minMember is the minimum number of pods that must be schedulable together; if cluster resources cannot satisfy it, the scheduler keeps the whole group queued rather than partially placing it. https://volcano.sh/en/docs/podgroup/ ↩↩↩
SchedMD frames the Slurm/Kubernetes split and ships Slinky to run Slurm inside Kubernetes: the slurm-operator runs an actual Slurm cluster (slurmctld, slurmd, slurmdbd, slurmrestd) as Kubernetes Pods/CRDs, so Slurm's batch scheduling runs on a Kubernetes-managed control plane. Soperator (Nebius) is an alternative operator turning a SlurmCluster custom resource into a working Slurm cluster with the GPU/driver/NCCL stack. Slurm on Kubernetes (Slinky): https://slurm.schedmd.com/slinky.html ; Slurm documentation: https://slurm.schedmd.com/documentation.html ↩↩↩