Markdown

Provisioning & scheduling¶

Scope: overview and decision index for turning bare-metal nodes into a managed, schedulable cluster. This page frames the provisioning lifecycle and routes each stage to its focused HOW page; out-of-band management, imaging, and the HPC scheduler world, where cloud-native and traditional HPC conventions meet.

flowchart LR
  OOB["OOB management"] --> IMAGE["PXE or image"]
  IMAGE --> BASELINE["Firmware and driver baseline"]
  BASELINE --> SCHED["Scheduler advertises GPUs"]
  SCHED --> DRAIN["Health gates and drain"]

Overview¶

Once hardware is racked and healthy, it must be provisioned at fleet scale and made schedulable. This is where Kubernetes-native experience meets the traditional HPC stack, which has its own conventions (Slurm, bare-metal imaging, out-of-band management) distinct from cloud-native patterns.

Focused pages¶

Each lifecycle stage below links the focused page that implements it. Use this index to jump straight to the HOW.

OOB management. OOB management & BMC: use when you need lights-out access to nodes independent of the OS.
OOB protocol (legacy). IPMI: use when working with legacy BMCs or ipmitool power/sensor/SOL flows.
OOB protocol (modern). Redfish: use when scripting modern REST/JSON management, firmware updates, or inventory.
OOB network. OOB network infrastructure: use when designing or troubleshooting the separate management network (SN2201, VXLAN).
OS imaging. PXE / network boot: use when standing up DHCP/TFTP/PXE to boot and install nodes at scale.
Image baseline. image & config management: use when enforcing a uniform driver/CUDA/firmware image across the fleet.
Tooling choice. provisioning tooling: use when choosing between Base Command, Warewulf, MAAS, xCAT, Ironic, etc.
Health gating. GPU health gating: use when wiring health checks so unhealthy nodes are drained, not scheduled.
Topology placement. Slurm topology placement: use when configuring topology.conf for rail-local, topology-aware scheduling.
Scheduler decision. Slurm vs Kubernetes: use when deciding which scheduler (or both) fits a given workload.

Core knowledge¶

Out-of-band management¶

BMC gives lights-out access independent of the OS; IPMI is the legacy protocol, Redfish the modern REST/JSON successor to reach for now. The OOB network is physically separate (in SuperPOD, SN2201 switches, carried as a VXLAN). See networking fabric.
HOW: OOB management & BMC, IPMI, Redfish, OOB network infrastructure.

Bare-metal provisioning¶

PXE / network boot plus DHCP/TFTP for OS imaging at scale, image and config management for uniform driver/CUDA/firmware baselines (see commissioning), with a tooling choice spanning NVIDIA Base Command / Mission Control and open tooling (Warewulf, MAAS, xCAT, Ironic).
HOW: PXE / network boot, image & config management, provisioning tooling.

Scheduling¶

Slurm is the dominant HPC workload manager (partitions, gang scheduling, topology-aware placement), the classic HPC-ops skill and the most likely gap relative to a Kubernetes background. Kubernetes (and k3s) with the NVIDIA GPU Operator is the cloud-native path. In one line: Slurm for tightly-coupled training and topology-aware placement; Kubernetes for heterogeneous, multi-tenant, service-style workloads (Kubernetes for GPUs). Many sites run both.
HOW: Slurm vs Kubernetes, Slurm topology placement, GPU health gating.

Don't-miss checklist¶

Confirm OOB reachability to every node before anything else.
Drive a uniform image and firmware baseline; drift is the root of intermittent faults.
Make scheduling topology-aware so collectives stay rail-local where possible.
Integrate GPU health gating so unhealthy nodes are drained, not scheduled (see GPU performance and health).

Failure modes¶

OOB network treated as an afterthought, leaving no lights-out path when a node hangs.
Image drift across the fleet causing non-reproducible failures.
Topology-unaware scheduling spreading a tightly-coupled job across the spine and starving it.

Open questions & validation¶

Slurm: partitions, gang scheduling, and topology-aware (topology.conf) placement, the HPC counterpart to the Kubernetes gang schedulers in Kubernetes for GPUs.
Redfish and PXE/imaging at bare-metal scale: the metal-level workflow cloud abstractions hide (playbooks in Ansible bring-up).

Deep-dive pages¶

Out-of-band: OOB management & BMC · IPMI · Redfish · OOB network infrastructure
Bare-metal provisioning: PXE / network boot · image & config management · provisioning tooling
Scheduling & health: GPU health gating · Slurm topology placement · Slurm vs Kubernetes. The schedulers themselves: Slurm · Kubernetes · k3s.
Runbooks: OOB / BMC unreachable · image drift · topology-unaware scheduling

References¶

DGX SuperPOD components and management networks: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html

Related: Fabric · Commissioning · Performance · Glossary