Provisioning & scheduling¶
Scope: overview and decision index for turning bare-metal nodes into a managed, schedulable cluster. This page frames the provisioning lifecycle and routes each stage to its focused HOW page; out-of-band management, imaging, and the HPC scheduler world, where cloud-native and traditional HPC conventions meet.
flowchart LR
OOB["OOB management"] --> IMAGE["PXE or image"]
IMAGE --> BASELINE["Firmware and driver baseline"]
BASELINE --> SCHED["Scheduler advertises GPUs"]
SCHED --> DRAIN["Health gates and drain"]
Overview¶
Once hardware is racked and healthy, it must be provisioned at fleet scale and made schedulable. This is where Kubernetes-native experience meets the traditional HPC stack, which has its own conventions (Slurm, bare-metal imaging, out-of-band management) distinct from cloud-native patterns.
Focused pages¶
Each lifecycle stage below links the focused page that implements it. Use this index to jump straight to the HOW.
- OOB management. OOB management & BMC: use when you need lights-out access to nodes independent of the OS.
- OOB protocol (legacy). IPMI: use when working with legacy BMCs or
ipmitoolpower/sensor/SOL flows. - OOB protocol (modern). Redfish: use when scripting modern REST/JSON management, firmware updates, or inventory.
- OOB network. OOB network infrastructure: use when designing or troubleshooting the separate management network (SN2201, VXLAN).
- OS imaging. PXE / network boot: use when standing up DHCP/TFTP/PXE to boot and install nodes at scale.
- Image baseline. image & config management: use when enforcing a uniform driver/CUDA/firmware image across the fleet.
- Tooling choice. provisioning tooling: use when choosing between Base Command, Warewulf, MAAS, xCAT, Ironic, etc.
- Health gating. GPU health gating: use when wiring health checks so unhealthy nodes are drained, not scheduled.
- Topology placement. Slurm topology placement: use when configuring
topology.conffor rail-local, topology-aware scheduling. - Scheduler decision. Slurm vs Kubernetes: use when deciding which scheduler (or both) fits a given workload.
Core knowledge¶
Out-of-band management¶
- BMC gives lights-out access independent of the OS; IPMI is the legacy protocol, Redfish the modern REST/JSON successor to reach for now. The OOB network is physically separate (in SuperPOD, SN2201 switches, carried as a VXLAN). See networking fabric.
- HOW: OOB management & BMC, IPMI, Redfish, OOB network infrastructure.
Bare-metal provisioning¶
- PXE / network boot plus DHCP/TFTP for OS imaging at scale, image and config management for uniform driver/CUDA/firmware baselines (see commissioning), with a tooling choice spanning NVIDIA Base Command / Mission Control and open tooling (Warewulf, MAAS, xCAT, Ironic).
- HOW: PXE / network boot, image & config management, provisioning tooling.
Scheduling¶
- Slurm is the dominant HPC workload manager (partitions, gang scheduling, topology-aware placement), the classic HPC-ops skill and the most likely gap relative to a Kubernetes background. Kubernetes (and k3s) with the NVIDIA GPU Operator is the cloud-native path. In one line: Slurm for tightly-coupled training and topology-aware placement; Kubernetes for heterogeneous, multi-tenant, service-style workloads (Kubernetes for GPUs). Many sites run both.
- HOW: Slurm vs Kubernetes, Slurm topology placement, GPU health gating.
Don't-miss checklist¶
- Confirm OOB reachability to every node before anything else.
- Drive a uniform image and firmware baseline; drift is the root of intermittent faults.
- Make scheduling topology-aware so collectives stay rail-local where possible.
- Integrate GPU health gating so unhealthy nodes are drained, not scheduled (see GPU performance and health).
Failure modes¶
- OOB network treated as an afterthought, leaving no lights-out path when a node hangs.
- Image drift across the fleet causing non-reproducible failures.
- Topology-unaware scheduling spreading a tightly-coupled job across the spine and starving it.
Open questions & validation¶
- Slurm: partitions, gang scheduling, and topology-aware (
topology.conf) placement, the HPC counterpart to the Kubernetes gang schedulers in Kubernetes for GPUs. - Redfish and PXE/imaging at bare-metal scale: the metal-level workflow cloud abstractions hide (playbooks in Ansible bring-up).
Deep-dive pages¶
- Out-of-band: OOB management & BMC · IPMI · Redfish · OOB network infrastructure
- Bare-metal provisioning: PXE / network boot · image & config management · provisioning tooling
- Scheduling & health: GPU health gating · Slurm topology placement · Slurm vs Kubernetes. The schedulers themselves: Slurm · Kubernetes · k3s.
- Runbooks: OOB / BMC unreachable · image drift · topology-unaware scheduling
References¶
- DGX SuperPOD components and management networks: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
Related: Fabric · Commissioning · Performance · Glossary