Skip to content
Markdown

Slurm vs Kubernetes for GPUs

Scope: a decision guide for picking the workload manager on a GPU cluster, batch HPC (Slurm) versus service orchestration (Kubernetes), across gang scheduling, multi-tenancy, topology placement, the operational model, and the hybrid (Slurm-on-Kubernetes) path. This is the comparison page; the per-technology deep dives live in Slurm, Kubernetes, and the family overview orchestration overview. Read those for the actual job scripts and manifests; this page only decides between them.

Every command and manifest below is a reference template, not hardware-tested. Scheduler behaviour, plugin names, API versions, and feature gates vary by Slurm and Kubernetes release. Verify against the cited docs for your installed versions before scripting against them. Treat all printed output as illustrative, never as a target.

What it is

Two different answers to "given this fleet of GPU nodes, what runs where, and when does it start".

  • Slurm (slurm.schedmd.com) is an HPC batch workload manager. You submit a finite job with declared resources (sbatch/srun), it queues, and the scheduler grants the whole allocation at once and launches all tasks together over MPI/PMIx. Resources are tracked as TRES (trackable resources); GPUs are GRES. The resource pool is fixed and known; the scheduler's job is to pack finite jobs onto it well. Slurm is the default for bare-metal pretraining (Slurm, distributed training).
  • Kubernetes is a declarative container orchestrator. You POST desired state (objects) to the api-server; controllers reconcile actual toward desired in a continuous loop; the scheduler binds Pods to nodes one Pod at a time. It targets long-running services with loose, often elastic resource requirements, and grows the node pool on demand. GPUs are not understood by core Kubernetes; the NVIDIA GPU Operator and the device plugin / DRA make a node GPU-aware (Kubernetes, Kubernetes for GPUs).

SchedMD frames the split directly: Kubernetes "excels at scheduling workloads that ... run for an indefinite amount of time with potentially vague resource requirements on a single node with loose policy, but can scale its resource pool infinitely"; Slurm "excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known."1

flowchart LR
  WL["GPU workload"] --> Q{"Shape?"}
  Q -->|"finite, tightly-coupled, bare metal"| S["Slurm: batch, gang, topology"]
  Q -->|"long-running services, multi-tenant, containers"| K["Kubernetes: declarative, GitOps"]
  Q -->|"both: real Slurm on K8s"| H["Slinky / Soperator"]

Why it's needed (and when)

The choice is load-bearing because the two schedulers make opposite default assumptions, and forcing the wrong one is expensive in idle GPU-hours.

  • Tightly-coupled synchronous training (data/tensor/pipeline parallel, one all-reduce per step) needs all ranks placed at once or not at all, and placed close on the fabric. Slurm does this natively; on Kubernetes it is an add-on (see gang scheduling below).
  • Long-running online inference needs stable Service VIPs, DNS, ingress, autoscaling, and rolling updates, the native Kubernetes substrate (inference serving). Slurm is a batch scheduler, not a service mesh, and fits offline/batch scoring far better than low-latency serving.
  • Multi-tenant platforms mixing many teams, services, and batch benefit from Kubernetes namespaces/RBAC/quota and GitOps. A single-purpose training cluster is simpler under Slurm.
  • Operational model: Slurm is imperative (submit, query, cancel) and bare-metal-first; Kubernetes is declarative (apply YAML, reconcile) and container-first. Teams already fluent in one pay a real tax to operate the other.

Rule of thumb: Slurm for the finite, coupled, topology-sensitive batch job on known hardware; Kubernetes for the indefinite, loosely-coupled, multi-tenant service on elastic hardware. Most large sites end up running both (orchestration overview).

Comparison

Dimension Slurm Kubernetes
Primary workload Finite batch / HPC jobs Long-running services + batch
Unit of work Job (tasks launched together) Pod (scheduled one at a time)
Operating model Imperative (sbatch/srun), bare-metal-first Declarative reconcile, container-first
Gang / co-allocation Native: a job's allocation is granted whole2 Add-on: KAI / Volcano / Coscheduling PodGroup67
GPU model GRES / TRES; --gres=gpu:N, cons_tres for sharing2 Device plugin (nvidia.com/gpu) or DRA (GA in 1.34)9
Topology placement topology.conf (topology/tree, topology/block)5 Topology Manager + topology-aware scheduling (KAI/Volcano)
Multi-tenancy Associations, QOS, fair-share via accounting DB4 Namespaces, RBAC, ResourceQuota; Kueue for fair-share8
Quota / queue fairness Multifactor priority + fair-share4 Kueue (quota, cohorts, all-or-nothing admission)8
Online serving Weak (no native ingress/autoscale) Strong (KServe, Service, HPA/KEDA) (inference serving)
Node elasticity Fixed pool (cloud bursting via plugins) Cluster Autoscaler / Karpenter grow node groups
State / SPOF slurmctld (+ optional backup), slurmdbd etcd quorum; api-server
GitOps Not native Native (Argo CD / Flux)

How it's set up & managed

The two pages Slurm and Kubernetes carry the full configs; below is only the comparison-relevant surface: how each expresses the same intent ("run an 8-GPU-per-node, multi-node, all-or-nothing training job").

Slurm: co-allocation is the default. The scheduler grants the whole node set and srun launches every rank together; nothing extra is needed for all-or-nothing:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=blackwell
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
srun torchrun --nnodes="$SLURM_NNODES" --nproc_per_node=8 \
  --rdzv_backend=c10d --rdzv_endpoint="$MASTER_ADDR:23456" train.py --fsdp --bf16

Topology-aware packing is declarative in slurm.conf + topology.conf (rail-aware, fewest leaf switches)5; GPU sharing within a node requires SelectType=select/cons_tres2. Multi-tenancy is the accounting DB (slurmdbd) plus the multifactor priority plugin with non-zero fair-share/QOS weights: fair-share needs recorded usage from the DB, and QOS limits override association limits.4

# slurm.conf — multi-tenant fairness + fine-grained GPU sharing
SelectType=select/cons_tres
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightQOS=10000
AccountingStorageEnforce=associations,limits,qos

Kubernetes: the default scheduler binds Pods one at a time, so a multi-Pod distributed job must run under a gang scheduler, or it partial-places and deadlocks (idle GPUs held by Pods that can never all start).6 Volcano expresses all-or-nothing with minAvailable:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata: { name: ddp-train, namespace: ml }
spec:
  minAvailable: 16            # all-or-nothing: 2 nodes x 8 GPUs
  schedulerName: volcano
  tasks:
    - { replicas: 16, name: worker, template: { spec: {
        containers: [ { name: trainer,
          image: nvcr.io/nvidia/pytorch:25.05-py3,   # pin to a real NGC tag
          command: ["torchrun", "--nproc_per_node=8", "train.py"],
          resources: { limits: { nvidia.com/gpu: 1 } } } ] } } }

The equivalent of Slurm's cons_tres/fair-share on Kubernetes is split across components: DRA for flexible device requests (GA in 1.34, default API resource.k8s.io/v1)9; KAI Scheduler (NVIDIA, Apache-2.0, CNCF Sandbox) or Volcano for gang + topology-aware placement, the PodGroup being the atomic gang unit7; Kueue for quota, cohorts, and all-or-nothing admission, which suspends a Job until its quota and resources are free8. Each is a separate install on top of the GPU Operator (Kubernetes for GPUs, the Kubernetes platform).

Hybrid: real Slurm on Kubernetes

The two need not be rival clusters. Slinky is SchedMD's project set "to enable interoperability between Slurm and Kubernetes": the slurm-operator runs an actual Slurm cluster (slurmctld, slurmd, slurmdbd, slurmrestd) as Kubernetes Pods/CRDs, so Slurm's batch scheduling runs inside a Kubernetes-managed control plane.1 NVIDIA documents Slinky integrating the GPU Operator and DRA/ComputeDomains for topology-aware multi-node GPU scheduling.10 (SchedMD is now part of NVIDIA.10) Soperator (Nebius, open source) is an alternative operator that turns a SlurmCluster custom resource into a working Slurm cluster with the GPU/driver/NCCL stack and health checks.11 This is the "both" answer: run Kubernetes as the substrate, let Slurm schedule the coupled batch jobs on it.

flowchart LR
  K8S["Kubernetes substrate"] --> OP["Slinky slurm-operator"]
  OP --> CTLD["slurmctld + slurmdbd pods"]
  OP --> D["slurmd pods, GPU"]
  GPUOP["GPU Operator + DRA"] -.->|"GPUs, topology"| D

A simpler split, when an operator is more than you want: statically partition the fleet: some nodes in a Slurm pool, some in a Kubernetes pool, with a hard boundary so the two schedulers never contend for the same node (orchestration overview).

Validated usage & tests

Reference templates; describe what the output should show, do not assume specific numbers.

Slurm, confirm the gang is co-allocated and GPUs are visible. A correctly placed job shows all requested nodes RUNNING under one job ID, and each rank sees its 8 devices:

squeue -j "$SLURM_JOB_ID" -o "%i %T %N"   # one job ID, state RUNNING, all nodes listed
srun --jobid="$SLURM_JOB_ID" bash -lc 'hostname; nvidia-smi -L | wc -l'
# expect: each node prints its hostname and the per-node GPU count you requested

Verify topology packing landed the job on the fewest leaf switches your topology.conf defines (cross-spine hops inflate all-reduce time); confirm the NCCL fast path with NCCL_DEBUG=INFO showing [GDRDMA] rather than a TCP fallback (networking fabric, diagnostics).

Kubernetes, confirm gang admission, not partial placement. Under a gang scheduler, either all Pods of the PodGroup are Running or none are scheduled; there should be no state with some workers Running and the rest Pending waiting on GPUs:

kubectl get pods -n ml -l app=ddp-train -o wide   # all Running together, or all pending
kubectl get podgroups -n ml                       # gang admitted as a unit
kubectl exec -n ml deploy/ddp-train -- nvidia-smi -L   # devices visible in the Pod

If you instead see some workers Running holding GPUs while peers stay Pending, the gang scheduler is not actually in the path (jobs went through the default scheduler), the canonical Kubernetes-for-GPUs failure.6 Validate the RDMA/NCCL path from inside a Pod exactly as on Slurm: NCCL_DEBUG=INFO must report [GDRDMA] (Kubernetes for GPUs).

Failure modes

Brief; each links its deeper treatment.

  • Online serving forced onto Slurm: no native ingress/autoscale; push latency-SLO inference to Kubernetes (inference serving).
  • Distributed job on the Kubernetes default scheduler: partial placement, GPUs idle, deadlock. Always run a gang scheduler for multi-Pod jobs6 (Kubernetes for GPUs).
  • "Gang scheduling" terminology mismatch: in Slurm, gang scheduling means time-sliced suspend/resume of multiple jobs sharing nodes, not single-job co-allocation3; the Kubernetes "gang" (all-or-nothing PodGroup) maps to Slurm's default single-job co-allocation. Do not conflate them when comparing.
  • Slurm and Kubernetes contending for the same nodes: no clear partition boundary leads to double-scheduling. Partition the fleet or use Slinky/Soperator (orchestration overview).
  • Topology ignored on either side: no topology.conf (Slurm) or no topology-aware scheduler/Topology Manager (Kubernetes) scatters a job across distant switches, inflating every all-reduce (performance tuning, networking fabric).
  • Time-slicing mistaken for isolation on Kubernetes: no per-tenant memory cap; use MIG for hard isolation (security and multi-tenancy).

References

  • Slurm documentation: https://slurm.schedmd.com/documentation.html
  • Slurm gang scheduling (time-sliced suspend/resume semantics): https://slurm.schedmd.com/gang_scheduling.html
  • Slurm cons_tres (fine-grained GPU/CPU allocation, co-scheduling): https://slurm.schedmd.com/cons_tres.html
  • Slurm multifactor priority / fair-share: https://slurm.schedmd.com/priority_multifactor.html · QOS: https://slurm.schedmd.com/qos.html
  • Slurm topology (topology/tree, topology/block): https://slurm.schedmd.com/topology.html
  • Slinky (Slurm on Kubernetes, SchedMD): https://slurm.schedmd.com/slinky.html · slurm-operator: https://github.com/SlinkyProject/slurm-operator
  • NVIDIA: running large-scale GPU workloads on Kubernetes with Slurm (Slinky + GPU Operator + DRA): https://developer.nvidia.com/blog/running-large-scale-gpu-workloads-on-kubernetes-with-slurm/
  • Soperator (Nebius, Slurm-on-Kubernetes operator): https://github.com/nebius/soperator
  • Kubernetes concepts: https://kubernetes.io/docs/concepts/ · DRA GA in v1.34: https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/
  • KAI Scheduler (NVIDIA, CNCF Sandbox): https://github.com/NVIDIA/KAI-Scheduler · Volcano: https://volcano.sh/en/docs/ · Kueue: https://kueue.sigs.k8s.io/

Related: Slurm · Kubernetes · Orchestration · Glossary


  1. Slinky overview — "SchedMD's set of projects to enable interoperability between Slurm and Kubernetes"; slurm-operator: "Run Slurm on Kubernetes. Manage and scale Slurm clusters on Kubernetes as pods." The Slurm-vs-Kubernetes framing ("indefinite ... vague resource requirements ... loose policy ... scale its resource pool infinitely" vs "finite ... well defined resource requirements and topology ... strict policy ... resource pool is known") is from the slurm-operator project docs. https://slurm.schedmd.com/slinky.html · https://github.com/SlinkyProject/slurm-operator 

  2. Slurm SelectType=select/cons_tres allocates individual cores/GPUs/memory rather than whole nodes ("jobs can be co-scheduled on nodes when resources permit it"), with --gpus=, --gpus-per-node=, --mem-per-gpu=. https://slurm.schedmd.com/cons_tres.html 

  3. Slurm gang scheduling is "timesliced gang scheduling in which two or more jobs are allocated to the same resources in the same partition and these jobs are alternately suspended" — time-slicing of multiple jobs, distinct from co-allocating one job's tasks. https://slurm.schedmd.com/gang_scheduling.html 

  4. Slurm multifactor priority plugin (PriorityType=priority/multifactor) sums weighted factors including Fairshare and QOS; fair-share requires the Slurm accounting DB for assigned shares and consumed usage; QOS limits take precedence over association limits. https://slurm.schedmd.com/priority_multifactor.html · https://slurm.schedmd.com/qos.html 

  5. Slurm topology plugins — topology/tree (switch hierarchy) and topology/block, declared in slurm.conf with the fabric in topology.conf, pack jobs onto the fewest switches. https://slurm.schedmd.com/topology.html 

  6. The Kubernetes default scheduler schedules pod-by-pod and does not provide job-level (gang) scheduling; all-or-nothing requires the Coscheduling plugin, Volcano, KAI, or similar via a PodGroup. https://kubedl.io/docs/training/gangscheduling/ · https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319 

  7. KAI Scheduler — NVIDIA, Apache-2.0, accepted as a CNCF Sandbox project; PodGroups are the atomic gang unit; supports topology-aware (and hierarchical) scheduling. Originated in Run:ai. https://github.com/NVIDIA/KAI-Scheduler 

  8. Kueue — "a kubernetes-native system that manages quotas and how jobs consume them"; provides quota, cohort fair-sharing, and all-or-nothing admission, suspending Jobs until quota/resources are available. https://kueue.sigs.k8s.io/docs/overview/ 

  9. Dynamic Resource Allocation (DRA) core graduated to GA in Kubernetes v1.34 with the stable resource.k8s.io/v1 API enabled by default; workloads declare device properties and the scheduler allocates the actual devices. https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/ 

  10. NVIDIA documents the Slinky slurm-operator running Slurm components as CRDs on Kubernetes with NVIDIA GPU Operator and DRA/ComputeDomains integration for topology-aware multinode GPU scheduling, and notes SchedMD is part of NVIDIA. https://developer.nvidia.com/blog/running-large-scale-gpu-workloads-on-kubernetes-with-slurm/ 

  11. Soperator (Nebius, open source) is a Kubernetes operator that turns a SlurmCluster custom resource into a working Slurm cluster including the GPU driver / CUDA / NCCL stack, shared storage, health checks, and accounting. https://github.com/nebius/soperator