Markdown

Slurm vs Kubernetes for GPUs¶

Scope: a decision guide for picking the workload manager on a GPU cluster, batch HPC (Slurm) versus service orchestration (Kubernetes), across gang scheduling, multi-tenancy, topology placement, the operational model, and the hybrid (Slurm-on-Kubernetes) path. This is the comparison page; the per-technology deep dives live in Slurm, Kubernetes, and the family overview orchestration overview. Read those for the actual job scripts and manifests; this page only decides between them.

Every command and manifest below is a reference template, not hardware-tested. Scheduler behaviour, plugin names, API versions, and feature gates vary by Slurm and Kubernetes release. Verify against the cited docs for your installed versions before scripting against them. Treat all printed output as illustrative, never as a target.

What it is¶

Two different answers to "given this fleet of GPU nodes, what runs where, and when does it start".

Slurm (slurm.schedmd.com) is an HPC batch workload manager. You submit a finite job with declared resources (sbatch/srun), it queues, and the scheduler grants the whole allocation at once and launches all tasks together over MPI/PMIx. Resources are tracked as TRES (trackable resources); GPUs are GRES. The resource pool is fixed and known; the scheduler's job is to pack finite jobs onto it well. Slurm is the default for bare-metal pretraining (Slurm, distributed training).
Kubernetes is a declarative container orchestrator. You POST desired state (objects) to the api-server; controllers reconcile actual toward desired in a continuous loop; the scheduler binds Pods to nodes one Pod at a time. It targets long-running services with loose, often elastic resource requirements, and grows the node pool on demand. GPUs are not understood by core Kubernetes; the NVIDIA GPU Operator and the device plugin / DRA make a node GPU-aware (Kubernetes, Kubernetes for GPUs).

SchedMD frames the split directly: Kubernetes "excels at scheduling workloads that ... run for an indefinite amount of time with potentially vague resource requirements on a single node with loose policy, but can scale its resource pool infinitely"; Slurm "excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known."¹

flowchart LR
  WL["GPU workload"] --> Q{"Shape?"}
  Q -->|"finite, tightly-coupled, bare metal"| S["Slurm: batch, gang, topology"]
  Q -->|"long-running services, multi-tenant, containers"| K["Kubernetes: declarative, GitOps"]
  Q -->|"both: real Slurm on K8s"| H["Slinky / Soperator"]

Why it's needed (and when)¶

The choice is load-bearing because the two schedulers make opposite default assumptions, and forcing the wrong one is expensive in idle GPU-hours.

Tightly-coupled synchronous training (data/tensor/pipeline parallel, one all-reduce per step) needs all ranks placed at once or not at all, and placed close on the fabric. Slurm does this natively; on Kubernetes it is an add-on (see gang scheduling below).
Long-running online inference needs stable Service VIPs, DNS, ingress, autoscaling, and rolling updates, the native Kubernetes substrate (inference serving). Slurm is a batch scheduler, not a service mesh, and fits offline/batch scoring far better than low-latency serving.
Multi-tenant platforms mixing many teams, services, and batch benefit from Kubernetes namespaces/RBAC/quota and GitOps. A single-purpose training cluster is simpler under Slurm.
Operational model: Slurm is imperative (submit, query, cancel) and bare-metal-first; Kubernetes is declarative (apply YAML, reconcile) and container-first. Teams already fluent in one pay a real tax to operate the other.

Rule of thumb: Slurm for the finite, coupled, topology-sensitive batch job on known hardware; Kubernetes for the indefinite, loosely-coupled, multi-tenant service on elastic hardware. Most large sites end up running both (orchestration overview).

Comparison¶

Dimension	Slurm	Kubernetes
Primary workload	Finite batch / HPC jobs	Long-running services + batch
Unit of work	Job (tasks launched together)	Pod (scheduled one at a time)
Operating model	Imperative (`sbatch`/`srun`), bare-metal-first	Declarative reconcile, container-first
Gang / co-allocation	Native: a job's allocation is granted whole²	Add-on: KAI / Volcano / Coscheduling PodGroup⁶⁷
GPU model	GRES / TRES; `--gres=gpu:N`, `cons_tres` for sharing²	Device plugin (`nvidia.com/gpu`) or DRA (GA in 1.34)⁹
Topology placement	`topology.conf` (`topology/tree`, `topology/block`)⁵	Topology Manager + topology-aware scheduling (KAI/Volcano)
Multi-tenancy	Associations, QOS, fair-share via accounting DB⁴	Namespaces, RBAC, ResourceQuota; Kueue for fair-share⁸
Quota / queue fairness	Multifactor priority + fair-share⁴	Kueue (quota, cohorts, all-or-nothing admission)⁸
Online serving	Weak (no native ingress/autoscale)	Strong (KServe, Service, HPA/KEDA) (inference serving)
Node elasticity	Fixed pool (cloud bursting via plugins)	Cluster Autoscaler / Karpenter grow node groups
State / SPOF	`slurmctld` (+ optional backup), `slurmdbd`	etcd quorum; api-server
GitOps	Not native	Native (Argo CD / Flux)

How it's set up & managed¶

The two pages Slurm and Kubernetes carry the full configs; below is only the comparison-relevant surface: how each expresses the same intent ("run an 8-GPU-per-node, multi-node, all-or-nothing training job").

Slurm: co-allocation is the default. The scheduler grants the whole node set and srun launches every rank together; nothing extra is needed for all-or-nothing:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=blackwell
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
srun torchrun --nnodes="$SLURM_NNODES" --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint="$MASTER_ADDR:23456" train.py --fsdp --bf16

Topology-aware packing is declarative in slurm.conf + topology.conf (rail-aware, fewest leaf switches)⁵; GPU sharing within a node requires SelectType=select/cons_tres². Multi-tenancy is the accounting DB (slurmdbd) plus the multifactor priority plugin with non-zero fair-share/QOS weights: fair-share needs recorded usage from the DB, and QOS limits override association limits.⁴

# slurm.conf — multi-tenant fairness + fine-grained GPU sharing
SelectType=select/cons_tres
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightQOS=10000
AccountingStorageEnforce=associations,limits,qos

Kubernetes: the default scheduler binds Pods one at a time, so a multi-Pod distributed job must run under a gang scheduler, or it partial-places and deadlocks (idle GPUs held by Pods that can never all start).⁶ Volcano expresses all-or-nothing with minAvailable:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: ddp-train, namespace: ml }
spec:
  minAvailable: 16            # all-or-nothing: 2 nodes x 8 GPUs
  schedulerName: volcano
  tasks:
    - { replicas: 16, name: worker, template: { spec: {
        containers: [ { name: trainer,
          image: nvcr.io/nvidia/pytorch:25.05-py3,   # pin to a real NGC tag
          command: ["torchrun", "--nproc_per_node=8", "train.py"],
          resources: { limits: { nvidia.com/gpu: 1 } } } ] } } }

The equivalent of Slurm's cons_tres/fair-share on Kubernetes is split across components: DRA for flexible device requests (GA in 1.34, default API resource.k8s.io/v1)⁹; KAI Scheduler (NVIDIA, Apache-2.0, CNCF Sandbox) or Volcano for gang + topology-aware placement, the PodGroup being the atomic gang unit⁷; Kueue for quota, cohorts, and all-or-nothing admission, which suspends a Job until its quota and resources are free⁸. Each is a separate install on top of the GPU Operator (Kubernetes for GPUs, the Kubernetes platform).

Hybrid: real Slurm on Kubernetes¶

The two need not be rival clusters. Slinky is SchedMD's project set "to enable interoperability between Slurm and Kubernetes": the slurm-operator runs an actual Slurm cluster (slurmctld, slurmd, slurmdbd, slurmrestd) as Kubernetes Pods/CRDs, so Slurm's batch scheduling runs inside a Kubernetes-managed control plane.¹ NVIDIA documents Slinky integrating the GPU Operator and DRA/ComputeDomains for topology-aware multi-node GPU scheduling.¹⁰ (SchedMD is now part of NVIDIA.¹⁰) Soperator (Nebius, open source) is an alternative operator that turns a SlurmCluster custom resource into a working Slurm cluster with the GPU/driver/NCCL stack and health checks.¹¹ This is the "both" answer: run Kubernetes as the substrate, let Slurm schedule the coupled batch jobs on it.

flowchart LR
  K8S["Kubernetes substrate"] --> OP["Slinky slurm-operator"]
  OP --> CTLD["slurmctld + slurmdbd pods"]
  OP --> D["slurmd pods, GPU"]
  GPUOP["GPU Operator + DRA"] -.->|"GPUs, topology"| D

A simpler split, when an operator is more than you want: statically partition the fleet: some nodes in a Slurm pool, some in a Kubernetes pool, with a hard boundary so the two schedulers never contend for the same node (orchestration overview).

Validated usage & tests¶

Reference templates; describe what the output should show, do not assume specific numbers.

Slurm, confirm the gang is co-allocated and GPUs are visible. A correctly placed job shows all requested nodes RUNNING under one job ID, and each rank sees its 8 devices:

squeue -j "$SLURM_JOB_ID" -o "%i %T %N"   # one job ID, state RUNNING, all nodes listed
srun --jobid="$SLURM_JOB_ID" bash -lc 'hostname; nvidia-smi -L | wc -l'
# expect: each node prints its hostname and the per-node GPU count you requested

Verify topology packing landed the job on the fewest leaf switches your topology.conf defines (cross-spine hops inflate all-reduce time); confirm the NCCL fast path with NCCL_DEBUG=INFO showing [GDRDMA] rather than a TCP fallback (networking fabric, diagnostics).

Kubernetes, confirm gang admission, not partial placement. Under a gang scheduler, either all Pods of the PodGroup are Running or none are scheduled; there should be no state with some workers Running and the rest Pending waiting on GPUs:

kubectl get pods -n ml -l app=ddp-train -o wide   # all Running together, or all pending
kubectl get podgroups -n ml                       # gang admitted as a unit
kubectl exec -n ml deploy/ddp-train -- nvidia-smi -L   # devices visible in the Pod

If you instead see some workers Running holding GPUs while peers stay Pending, the gang scheduler is not actually in the path (jobs went through the default scheduler), the canonical Kubernetes-for-GPUs failure.⁶ Validate the RDMA/NCCL path from inside a Pod exactly as on Slurm: NCCL_DEBUG=INFO must report [GDRDMA] (Kubernetes for GPUs).

Failure modes¶

Brief; each links its deeper treatment.

Online serving forced onto Slurm: no native ingress/autoscale; push latency-SLO inference to Kubernetes (inference serving).
Distributed job on the Kubernetes default scheduler: partial placement, GPUs idle, deadlock. Always run a gang scheduler for multi-Pod jobs⁶ (Kubernetes for GPUs).
"Gang scheduling" terminology mismatch: in Slurm, gang scheduling means time-sliced suspend/resume of multiple jobs sharing nodes, not single-job co-allocation³; the Kubernetes "gang" (all-or-nothing PodGroup) maps to Slurm's default single-job co-allocation. Do not conflate them when comparing.
Slurm and Kubernetes contending for the same nodes: no clear partition boundary leads to double-scheduling. Partition the fleet or use Slinky/Soperator (orchestration overview).
Topology ignored on either side: no topology.conf (Slurm) or no topology-aware scheduler/Topology Manager (Kubernetes) scatters a job across distant switches, inflating every all-reduce (performance tuning, networking fabric).
Time-slicing mistaken for isolation on Kubernetes: no per-tenant memory cap; use MIG for hard isolation (security and multi-tenancy).

References¶

Slurm documentation: https://slurm.schedmd.com/documentation.html
Slurm gang scheduling (time-sliced suspend/resume semantics): https://slurm.schedmd.com/gang_scheduling.html
Slurm cons_tres (fine-grained GPU/CPU allocation, co-scheduling): https://slurm.schedmd.com/cons_tres.html
Slurm multifactor priority / fair-share: https://slurm.schedmd.com/priority_multifactor.html · QOS: https://slurm.schedmd.com/qos.html
Slurm topology (topology/tree, topology/block): https://slurm.schedmd.com/topology.html
Slinky (Slurm on Kubernetes, SchedMD): https://slurm.schedmd.com/slinky.html · slurm-operator: https://github.com/SlinkyProject/slurm-operator
NVIDIA: running large-scale GPU workloads on Kubernetes with Slurm (Slinky + GPU Operator + DRA): https://developer.nvidia.com/blog/running-large-scale-gpu-workloads-on-kubernetes-with-slurm/
Soperator (Nebius, Slurm-on-Kubernetes operator): https://github.com/nebius/soperator
Kubernetes concepts: https://kubernetes.io/docs/concepts/ · DRA GA in v1.34: https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/
KAI Scheduler (NVIDIA, CNCF Sandbox): https://github.com/NVIDIA/KAI-Scheduler · Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/

Related: Slurm · Kubernetes · Orchestration · Glossary

Slinky overview — "SchedMD's set of projects to enable interoperability between Slurm and Kubernetes"; slurm-operator: "Run Slurm on Kubernetes. Manage and scale Slurm clusters on Kubernetes as pods." The Slurm-vs-Kubernetes framing ("indefinite ... vague resource requirements ... loose policy ... scale its resource pool infinitely" vs "finite ... well defined resource requirements and topology ... strict policy ... resource pool is known") is from the slurm-operator project docs. https://slurm.schedmd.com/slinky.html · https://github.com/SlinkyProject/slurm-operator ↩↩
Slurm SelectType=select/cons_tres allocates individual cores/GPUs/memory rather than whole nodes ("jobs can be co-scheduled on nodes when resources permit it"), with --gpus=, --gpus-per-node=, --mem-per-gpu=. https://slurm.schedmd.com/cons_tres.html ↩↩↩
Slurm gang scheduling is "timesliced gang scheduling in which two or more jobs are allocated to the same resources in the same partition and these jobs are alternately suspended" — time-slicing of multiple jobs, distinct from co-allocating one job's tasks. https://slurm.schedmd.com/gang_scheduling.html ↩
Slurm multifactor priority plugin (PriorityType=priority/multifactor) sums weighted factors including Fairshare and QOS; fair-share requires the Slurm accounting DB for assigned shares and consumed usage; QOS limits take precedence over association limits. https://slurm.schedmd.com/priority_multifactor.html · https://slurm.schedmd.com/qos.html ↩↩↩
Slurm topology plugins — topology/tree (switch hierarchy) and topology/block, declared in slurm.conf with the fabric in topology.conf, pack jobs onto the fewest switches. https://slurm.schedmd.com/topology.html ↩↩
The Kubernetes default scheduler schedules pod-by-pod and does not provide job-level (gang) scheduling; all-or-nothing requires the Coscheduling plugin, Volcano, KAI, or similar via a PodGroup. https://kubedl.io/docs/training/gangscheduling/ · https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319 ↩↩↩↩
KAI Scheduler — NVIDIA, Apache-2.0, accepted as a CNCF Sandbox project; PodGroups are the atomic gang unit; supports topology-aware (and hierarchical) scheduling. Originated in Run:ai. https://github.com/NVIDIA/KAI-Scheduler ↩↩
Kueue — "a kubernetes-native system that manages quotas and how jobs consume them"; provides quota, cohort fair-sharing, and all-or-nothing admission, suspending Jobs until quota/resources are available. https://kueue.sigs.k8s.io/docs/overview/ ↩↩↩
Dynamic Resource Allocation (DRA) core graduated to GA in Kubernetes v1.34 with the stable resource.k8s.io/v1 API enabled by default; workloads declare device properties and the scheduler allocates the actual devices. https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/ ↩↩
NVIDIA documents the Slinky slurm-operator running Slurm components as CRDs on Kubernetes with NVIDIA GPU Operator and DRA/ComputeDomains integration for topology-aware multinode GPU scheduling, and notes SchedMD is part of NVIDIA. https://developer.nvidia.com/blog/running-large-scale-gpu-workloads-on-kubernetes-with-slurm/ ↩↩
Soperator (Nebius, open source) is a Kubernetes operator that turns a SlurmCluster custom resource into a working Slurm cluster including the GPU driver / CUDA / NCCL stack, shared storage, health checks, and accounting. https://github.com/nebius/soperator ↩