Skip to content
Markdown

Provisioning & scheduling

Scope: overview and decision index for turning bare-metal nodes into a managed, schedulable cluster. This page frames the provisioning lifecycle and routes each stage to its focused HOW page; out-of-band management, imaging, and the HPC scheduler world, where cloud-native and traditional HPC conventions meet.

flowchart LR
  OOB["OOB management"] --> IMAGE["PXE or image"]
  IMAGE --> BASELINE["Firmware and driver baseline"]
  BASELINE --> SCHED["Scheduler advertises GPUs"]
  SCHED --> DRAIN["Health gates and drain"]

Overview

Once hardware is racked and healthy, it must be provisioned at fleet scale and made schedulable. This is where Kubernetes-native experience meets the traditional HPC stack, which has its own conventions (Slurm, bare-metal imaging, out-of-band management) distinct from cloud-native patterns.

Focused pages

Each lifecycle stage below links the focused page that implements it. Use this index to jump straight to the HOW.

  • OOB management. OOB management & BMC: use when you need lights-out access to nodes independent of the OS.
  • OOB protocol (legacy). IPMI: use when working with legacy BMCs or ipmitool power/sensor/SOL flows.
  • OOB protocol (modern). Redfish: use when scripting modern REST/JSON management, firmware updates, or inventory.
  • OOB network. OOB network infrastructure: use when designing or troubleshooting the separate management network (SN2201, VXLAN).
  • OS imaging. PXE / network boot: use when standing up DHCP/TFTP/PXE to boot and install nodes at scale.
  • Image baseline. image & config management: use when enforcing a uniform driver/CUDA/firmware image across the fleet.
  • Tooling choice. provisioning tooling: use when choosing between Base Command, Warewulf, MAAS, xCAT, Ironic, etc.
  • Health gating. GPU health gating: use when wiring health checks so unhealthy nodes are drained, not scheduled.
  • Topology placement. Slurm topology placement: use when configuring topology.conf for rail-local, topology-aware scheduling.
  • Scheduler decision. Slurm vs Kubernetes: use when deciding which scheduler (or both) fits a given workload.

Core knowledge

Out-of-band management

Bare-metal provisioning

Scheduling

  • Slurm is the dominant HPC workload manager (partitions, gang scheduling, topology-aware placement), the classic HPC-ops skill and the most likely gap relative to a Kubernetes background. Kubernetes (and k3s) with the NVIDIA GPU Operator is the cloud-native path. In one line: Slurm for tightly-coupled training and topology-aware placement; Kubernetes for heterogeneous, multi-tenant, service-style workloads (Kubernetes for GPUs). Many sites run both.
  • HOW: Slurm vs Kubernetes, Slurm topology placement, GPU health gating.

Don't-miss checklist

  • Confirm OOB reachability to every node before anything else.
  • Drive a uniform image and firmware baseline; drift is the root of intermittent faults.
  • Make scheduling topology-aware so collectives stay rail-local where possible.
  • Integrate GPU health gating so unhealthy nodes are drained, not scheduled (see GPU performance and health).

Failure modes

  • OOB network treated as an afterthought, leaving no lights-out path when a node hangs.
  • Image drift across the fleet causing non-reproducible failures.
  • Topology-unaware scheduling spreading a tightly-coupled job across the spine and starving it.

Open questions & validation

  • Slurm: partitions, gang scheduling, and topology-aware (topology.conf) placement, the HPC counterpart to the Kubernetes gang schedulers in Kubernetes for GPUs.
  • Redfish and PXE/imaging at bare-metal scale: the metal-level workflow cloud abstractions hide (playbooks in Ansible bring-up).

Deep-dive pages

References

  • DGX SuperPOD components and management networks: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html

Related: Fabric · Commissioning · Performance · Glossary