Playbook: end-to-end workload bring-up¶
Scope: the ordered, step-by-step playbook that takes a freshly built cluster to its first running workload (facility -> fabric proof -> nodes -> K8s GPU stack -> telemetry -> fabric validation -> training/inference acceptance) with the command and the pass criterion for each step. The connective tissue between the per-layer pages; the runnable manifests for the individual workloads live in workload bring-up recipes.
Reference templates, drawn from the NVIDIA GPU Operator / Network Operator docs, the
NVIDIA/nccl-testsrepo, Volcano, vLLM, and the Prometheus/DCGM docs. Nothing here was executed on hardware. Pin every chart, image, and CRD version; substitute real node names, RDMA resource keys (rdma/rdma_shared_device_avsrdma/ib), HCA filters, and model/secret names before applying. Drive the sequence in order. Each step is the acceptance test for the one before it; skipping forward only moves the failure later, where it is harder to localize.
flowchart LR
S1["1 Facility ready"] --> S2["2 Fabric up (SM + links)"]
S2 --> S3["3 Nodes provisioned"]
S3 --> S4["4 K8s GPU stack"]
S4 --> S5["5 Telemetry"]
S5 --> S6["6 Fabric validation (NCCL)"]
S6 --> S7["7 Training acceptance (MFU)"]
S6 --> S8["8 Inference acceptance (TTFT/TPOT)"]
S7 --> DONE["First workload accepted"]
S8 --> DONE
What it is¶
A linear bring-up runbook. The cluster is built (racked, cabled, powered) but unproven: nothing has run a real collective, no GPU has been scheduled through Kubernetes, no telemetry is flowing. This playbook walks the stack bottom-up, gating at each layer on a single observable pass criterion before the next layer is trusted.
It differs from its neighbours by scope:
- workload bring-up recipes holds the full manifests (MPIJob, Volcano Job, vLLM Deployment); this page sequences and gates them.
- Ansible site playbook is the node-convergence layer (steps 2-3 here): driver/OFED/FM/MIG roles in rolling waves.
- fabric bring-up and benchmarking is the host-level fabric proof (step 6's deeper form); smoke tests is the Kubernetes acceptance suite (step 4's deeper form).
The contract: a layer is "up" only when its pass criterion is observed, not when its install command exited 0.
Why it matters¶
Bring-up failures are cheap to fix at the layer they originate and expensive everywhere above it. A degraded InfiniBand link looks identical to a slow model if you discover it during a training benchmark (step 7) instead of an NCCL run (step 6). You will profile the trainer, suspect the GPUs, and burn days before checking the fabric (runbook: MFU regression, runbook: NCCL hang). The ordering exists so each failure surfaces against the smallest possible suspect set.
The other failure class is partial placement: a distributed job with no gang scheduler half-places, holds GPUs, and deadlocks (cluster orchestration). Gating step 4 (gang scheduler installed and proven) before step 7 (gang-scheduled training) removes that class entirely.
When it is needed (and when not)¶
Needed:
- First bring-up of a new cluster, or a new island/superpod added to an existing one (runbook: capacity add).
- After a fleet-wide driver or firmware roll, re-run from step 6 (fabric proof) forward; the host stack changed underneath.
- Commissioning/acceptance sign-off against a vendor or neocloud, where recorded thresholds are contractual (cloud & neoclouds cost, vendor sourcing & procurement).
Not needed:
- Routine workload deploys on an already-accepted cluster: go straight to the recipe (serving OSS models, distributed training recipes).
- Single-node or workstation setups with no fabric and no gang scheduling: steps 2 and 6 collapse to a local
nvidia-smicheck. - Slurm-first clusters: steps 4-5 differ (Slurm + Pyxis instead of the K8s GPU stack); see cluster: Slurm. The facility/fabric/node/validation spine is identical.
How: implement, integrate, maintain¶
Each step below gives the command and the pass criterion. Do not advance on a non-pass; localize at the current layer.
| # | Step | Command (representative) | Pass criterion |
|---|---|---|---|
| 1 | Facility ready | facility sign-off (datacentre physical) | power/cooling/weight budget signed; PDUs energized |
| 2 | Fabric up | ibdiagnet; sminfo |
ibdiagnet no errors, SM converged, all links at rated speed (fabric bring-up & benchmarking) |
| 3 | Nodes provisioned | ansible-playbook site.yml (Ansible site playbook) |
nvidia-smi lists all GPUs; ibstat State: Active; dcgmi diag -r 3 no Fail (gpu health gating) |
| 4 | K8s GPU stack | helm install nvidia/gpu-operator (below) |
kubectl get clusterpolicy Ready; GPUs allocatable; gang scheduler present (K8s GPU platform) |
| 5 | Telemetry | dcgm-exporter + Prometheus (telemetry & monitoring) | DCGM_FI_DEV_GPU_UTIL scraping in Prometheus; an alert fires on a test condition |
| 6 | Fabric validation | all_reduce_perf (NCCL, below) |
busbw near line rate for large sizes; NCCL_DEBUG=INFO shows GDR/IB path, not TCP (smoke tests) |
| 7 | Training acceptance | gang-scheduled torchrun (manifest: Volcano Job) |
all ranks start together; MFU within target; step time stable (train: FSDP, runbook: MFU regression) |
| 8 | Inference acceptance | vLLM Deployment + smoke curl (inference serving) |
/health ready; TTFT/TPOT within SLO (SLO/SLI catalog, runbook: inference SLO breach) |
Step 4: K8s GPU stack (install + gate)¶
Install the GPU Operator (driver, container toolkit, device plugin, DCGM, validator) and the gang scheduler, then gate on observable readiness, not on helm exit code.12
# GPU Operator: NVIDIA-managed driver/toolkit/device-plugin/DCGM stack.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --version=v26.3.3 # pin to your validated release
# Gang scheduler so distributed jobs place all-or-nothing (step 7 depends on it).
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts && helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace --wait
# --- Gate: do not advance until all three pass ---
kubectl get clusterpolicy # STATE must be: ready
kubectl get nodes -o json | jq '.items[]
| select(.status.allocatable["nvidia.com/gpu"] != null)
| {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}' # GPUs allocatable
kubectl get pods -n volcano-system # volcano-scheduler Running
Pass: clusterpolicy is ready, every GPU node reports allocatable nvidia.com/gpu, and the Volcano scheduler pod is Running. The deeper per-layer acceptance suite is smoke tests.
Step 6: Fabric validation (NCCL collective)¶
Run an all-reduce across two nodes (16 GPUs) and read bus bandwidth against the topology ceiling. Requires the MPI Operator; full MPIJob manifest in workload bring-up recipes. The command the launcher runs:34
# Sweep 8 B -> 8 GiB, doubling each step, 1 GPU per thread, IB HCA filter + verbose path.
mpirun --allow-run-as-root -np 16 -bind-to none \
-x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO \
/workspace/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
Pass: the busbw column approaches the fabric's line rate at large message sizes, and NCCL_DEBUG=INFO logs an IB/GDR transport (look for NET/IB and GDR engagement), not a TCP/socket fallback. A TCP fallback means the RDMA resource was not requested in the pod or the HCA filter is wrong (RDMA/RoCE tuning, runbook: NCCL hang). Achieved busbw is always below the nominal line rate; protocol overhead is expected.
Step 5: Telemetry gate (PromQL)¶
Before any benchmark, confirm metrics flow so steps 6-8 are observable. With dcgm-exporter scraped by Prometheus, this returns a non-empty vector:5
# Per-GPU utilization is being scraped from every GPU node.
count(DCGM_FI_DEV_GPU_UTIL) by (Hostname)
Pass: one series per expected GPU host. Wire the SLO/burn-rate alerts from SLO/SLI catalog and confirm one fires on a synthetic condition (observability & monitoring).
Integrate¶
- The training and inference manifests for steps 7-8 are owned by workload bring-up recipes; this playbook only sequences and gates them. Gang scheduling for step 7 uses manifest: Volcano Job / helm: Volcano scheduler; FSDP specifics in train: FSDP, low-bandwidth/multi-region in train: DiLoCo.
- Ray-based stacks substitute step 4's scheduler with a RayCluster (cluster: Ray); k3s edge clusters in cluster: k3s; the orchestration trade-offs in cluster orchestration and cluster: Kubernetes.
- Record the step-6 busbw and step-7/8 MFU and TTFT/TPOT figures as the acceptance baseline; later regressions are measured against them (runbook: MFU regression, runbook: inference SLO breach).
Maintain¶
- Re-run from step 6 after any driver, firmware, or fabric change; the host stack moved underneath the recorded baseline.
- Keep step 3's health gate continuous, not one-shot: a node that passed at commissioning can degrade in service (gpu health gating).
- Treat the recorded baselines as living SLIs: drift in busbw or MFU is a fabric/scheduling regression, not a model problem (slo-sli catalog).
Failure modes¶
- Step skipped, failure deferred. Training launched before NCCL validation; a degraded link is misread as a model/GPU fault and profiled for days (runbook: MFU regression).
- No gang scheduler at step 7. Distributed job half-places, pins GPUs, deadlocks. Step 4's gate (Volcano
Running) prevents it. - TCP fallback at step 6. RDMA resource not requested in the pod spec, or wrong HCA filter; busbw collapses (RDMA and RoCE Performance Tuning).
- Telemetry blind (step 5 skipped). Steps 6-8 run but nothing is recorded; "pass" is unfalsifiable. Gate on a non-empty PromQL vector first.
- vLLM OOM at step 8 from
--max-model-len/ batch exceeding KV-cache memory (inference serving).
References¶
- NVIDIA GPU Operator — install & verify (helm command,
clusterpolicy, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html - NVIDIA Network Operator (RDMA device plugin,
rdma/...resources): https://docs.nvidia.com/networking/display/cokan/network+operator - NVIDIA
nccl-tests(all_reduce_perf, busbw,-b -e -f -gflags): https://github.com/NVIDIA/nccl-tests - NCCL environment variables (
NCCL_IB_HCA,NCCL_DEBUG, GDR): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - Volcano gang scheduler: https://volcano.sh/en/docs/
- Kubeflow MPI Operator (MPIJob for NCCL): https://github.com/kubeflow/mpi-operator
- vLLM serving (OpenAI-compatible endpoint, health): https://docs.vllm.ai/en/latest/
- dcgm-exporter / DCGM field IDs (
DCGM_FI_DEV_GPU_UTIL): https://docs.nvidia.com/datacenter/dcgm/latest/ - Prometheus querying basics (PromQL): https://prometheus.io/docs/prometheus/latest/querying/basics/
Related: Workload Bring-Up Recipes · Ansible Site Playbook · Fabric Bring-Up & Benchmarking · Smoke Tests · K8s GPU Platform · Telemetry & Monitoring · SLO/SLI Catalog · Glossary
-
NVIDIA GPU Operator getting-started: add the
nvidiahelm repo, thenhelm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=<release>; verify withkubectl get clusterpolicy(stateready) and by confirming nodes report allocatablenvidia.com/gpu. Version pin shown is illustrative — pin to your validated release. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩ -
Volcano provides gang scheduling so all pods of a job are scheduled together or none are. https://volcano.sh/en/docs/ ↩
-
nccl-testsall_reduce_perfreports abusbw(bus bandwidth) column documented on the project's performance page; total ranks = processes x threads x GPUs-per-thread. https://github.com/NVIDIA/nccl-tests ↩ -
nccl-testsflags:-b/--minbytesstart size,-e/--maxbytesend size,-f/--stepfactormultiplier between sizes,-g/--ngpusGPUs per thread (default 1). https://github.com/NVIDIA/nccl-tests ↩ -
DCGM_FI_DEV_GPU_UTILis the DCGM field for per-GPU utilization, exported by dcgm-exporter for Prometheus scraping. https://docs.nvidia.com/datacenter/dcgm/latest/ ↩