Skip to content
Markdown

Cluster orchestration: Kubernetes, k3s, Ray, Slurm

Scope: overview, decision, and index page for the orchestration layer that decides what runs where: the HPC batch world (Slurm), the cloud-native world (Kubernetes/k3s), and the Python-native distributed runtime (Ray), and how they compose. This page frames the families and the trade-offs; the implementable HOW lives in the focused pages below. Deepens the scheduling overview in provisioning and scheduling and the K8s GPU platform in Kubernetes for GPUs/the Kubernetes platform.

Reference templates on real APIs (Slurm, KubeRay). Pin versions; the same workload often runs under more than one of these, so choose by workload shape, not dogma.

Focused pages

  • Orchestration decision guide: use this when you need the full decision matrix to pick Slurm vs Kubernetes vs k3s vs Ray for a given workload.
  • KubeRay integration: use this when running Ray on Kubernetes via the operator (RayCluster/RayJob/RayService CRDs).
  • Ray on Slurm: use this when launching a Ray cluster inside a Slurm allocation without Kubernetes.
  • Slurm: use this when implementing HPC batch: partitions, gang scheduling, topology.conf, sbatch/srun, MPI.
  • Kubernetes: use this when standing up the cloud-native GPU platform control plane.
  • k3s: use this when you want lightweight single-binary Kubernetes for edge, small, CI, or dev clusters.
  • Ray: use this when building on the Python-native distributed runtime (Train/Serve/Data/RLlib).

Paradigm: three orchestration families, composable

An orchestrator answers "given this fleet, run this work here". Three families dominate, with different native assumptions:

  • Slurm, HPC batch: bare-metal, gang-scheduled, topology-aware MPI jobs. The training default in HPC.
  • Kubernetes / k3s, cloud-native: containers, services, multi-tenant, declarative. The platform default (Kubernetes for GPUs).
  • Ray, a Python-native distributed runtime: tasks/actors, with libraries for training, serving, data, and RL. The substrate most RL stacks build on (RL libraries).

They are not mutually exclusive: Ray runs on Kubernetes (KubeRay) or Slurm, and many sites run Slurm and Kubernetes side by side.

Architecture: choosing and composing orchestrators

flowchart TB
  WL["GPU workload"] --> Q{"Workload shape?"}
  Q -->|"tightly-coupled training, bare metal"| SLURM["Slurm"]
  Q -->|"containers, multi-tenant, services"| K8S["Kubernetes / k3s"]
  Q -->|"Python distributed, RL, pipelines"| RAY["Ray"]
  RAY -.->|"KubeRay"| K8S
  RAY -.->|"Ray-on-Slurm"| SLURM

Slurm (HPC batch)

The dominant HPC workload manager (provisioning and scheduling): partitions, gang scheduling, topology.conf for rail-aware placement, sbatch/srun, native MPI. Best for tightly-coupled multi-node training on bare metal. For sbatch/srun/torchrun templates and partition config, see Slurm.

Kubernetes & k3s

Full GPU platform in Kubernetes for GPUs/the Kubernetes platform: GPU Operator, gang scheduling (KAI/Volcano), DRA, RDMA. k3s is a single-binary, lightweight Kubernetes (CNCF) with the same API and a smaller footprint, for edge nodes, small clusters, CI, and dev. Use full K8s for the datacentre control plane; k3s where the overhead of full K8s is not worth it.

Ray (Python-native distributed runtime)

Ray runs Python tasks (stateless) and actors (stateful) across a cluster from a head node (global control store + scheduler) and worker nodes. On top sit Ray Train (distributed training, wraps torch/FSDP), Ray Serve (inference, inference serving), Ray Data (distributed data), and RLlib (classic RL). It is the coordination layer most LLM-RL libraries use (RL libraries) because RL needs to juggle a rollout engine and a trainer as separate actor groups.

Architecture: Ray cluster

flowchart TB
  subgraph Head["Ray head node"]
    GCS["GCS: global control store"]
    SCHED["Scheduler"]
  end
  Head --> W1["Worker: actors + tasks (GPU)"]
  Head --> W2["Worker: actors + tasks (GPU)"]
  W1 --- W2

Composition: KubeRay and Ray-on-Slurm

KubeRay is the operator that runs Ray on Kubernetes via three CRDs: RayCluster (a head+worker cluster), RayJob (run-to-completion), RayService (long-running serving). This is how Ray-based training/RL/serving lands on a GPU K8s platform (the Kubernetes platform). For the CRD manifests, GPU/RDMA resource wiring, and operator topology, see KubeRay integration.

Ray-on-Slurm: a Slurm sbatch launches the Ray head on the first node and workers on the rest, then runs the Ray program. This brings Ray's actor model to an HPC cluster without Kubernetes. For the launcher script and head/worker discovery, see Ray on Slurm.

Decision matrix

Pick by workload shape: Slurm for tightly-coupled bare-metal training, Kubernetes for multi-tenant platforms/services, k3s for edge/small/dev, Ray for RL/pipelines/Python-native distributed (on K8s via KubeRay or on Slurm), and Ray Serve / KServe / Dynamo for inference at scale on K8s (inference serving, disaggregated inference). For the full matrix with caveats, see orchestration decision guide.

Don't-miss checklist

  • Choose by workload shape: Slurm for tightly-coupled HPC, K8s for platform/services, Ray for RL/pipelines.
  • Run Ray on the existing platform via KubeRay rather than a parallel stack; reuse the gang scheduler and GPU Operator.
  • Keep topology-awareness wherever the scheduler supports it (topology.conf, K8s topology manager, performance tuning).
  • Pin Ray/k3s/Slurm versions; expose RDMA into Ray workers as into any pod.

Failure modes

  • Running Ray as a second, ungoverned stack beside K8s: duplicate scheduling, no quota.
  • Slurm and K8s fighting over the same nodes without a clear partition boundary.
  • Ray workers without RDMA → NCCL on TCP for Ray Train (performance tuning).
  • k3s used for a large datacentre control plane it was not sized for.

Open questions & validation

  • Confirm KubeRay CRD version (ray.io/v1) and the Ray image tag for the target release.
  • Validate Ray Train NCCL/RDMA path inside KubeRay pods with nccl-tests (workload recipes).
  • Decide the Slurm-vs-K8s (or both) boundary and node partitioning for the site.

References

  • Slurm: https://slurm.schedmd.com/documentation.html
  • Ray: https://docs.ray.io/en/latest/ · KubeRay: https://docs.ray.io/en/latest/cluster/kubernetes/index.html
  • k3s: https://docs.k3s.io/
  • Ray Train / Serve: https://docs.ray.io/en/latest/train/train.html · https://docs.ray.io/en/latest/serve/index.html
  • Ray on Slurm: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html

Related: Provisioning · Kubernetes · Inference · K8s Platform · Fine-tuning · RL Libraries · Glossary