Skip to content
Markdown

Playbook: end-to-end workload bring-up

Scope: the ordered, step-by-step playbook that takes a freshly built cluster to its first running workload (facility -> fabric proof -> nodes -> K8s GPU stack -> telemetry -> fabric validation -> training/inference acceptance) with the command and the pass criterion for each step. The connective tissue between the per-layer pages; the runnable manifests for the individual workloads live in workload bring-up recipes.

Reference templates, drawn from the NVIDIA GPU Operator / Network Operator docs, the NVIDIA/nccl-tests repo, Volcano, vLLM, and the Prometheus/DCGM docs. Nothing here was executed on hardware. Pin every chart, image, and CRD version; substitute real node names, RDMA resource keys (rdma/rdma_shared_device_a vs rdma/ib), HCA filters, and model/secret names before applying. Drive the sequence in order. Each step is the acceptance test for the one before it; skipping forward only moves the failure later, where it is harder to localize.

flowchart LR
  S1["1 Facility ready"] --> S2["2 Fabric up (SM + links)"]
  S2 --> S3["3 Nodes provisioned"]
  S3 --> S4["4 K8s GPU stack"]
  S4 --> S5["5 Telemetry"]
  S5 --> S6["6 Fabric validation (NCCL)"]
  S6 --> S7["7 Training acceptance (MFU)"]
  S6 --> S8["8 Inference acceptance (TTFT/TPOT)"]
  S7 --> DONE["First workload accepted"]
  S8 --> DONE

What it is

A linear bring-up runbook. The cluster is built (racked, cabled, powered) but unproven: nothing has run a real collective, no GPU has been scheduled through Kubernetes, no telemetry is flowing. This playbook walks the stack bottom-up, gating at each layer on a single observable pass criterion before the next layer is trusted.

It differs from its neighbours by scope:

The contract: a layer is "up" only when its pass criterion is observed, not when its install command exited 0.

Why it matters

Bring-up failures are cheap to fix at the layer they originate and expensive everywhere above it. A degraded InfiniBand link looks identical to a slow model if you discover it during a training benchmark (step 7) instead of an NCCL run (step 6). You will profile the trainer, suspect the GPUs, and burn days before checking the fabric (runbook: MFU regression, runbook: NCCL hang). The ordering exists so each failure surfaces against the smallest possible suspect set.

The other failure class is partial placement: a distributed job with no gang scheduler half-places, holds GPUs, and deadlocks (cluster orchestration). Gating step 4 (gang scheduler installed and proven) before step 7 (gang-scheduled training) removes that class entirely.

When it is needed (and when not)

Needed:

  • First bring-up of a new cluster, or a new island/superpod added to an existing one (runbook: capacity add).
  • After a fleet-wide driver or firmware roll, re-run from step 6 (fabric proof) forward; the host stack changed underneath.
  • Commissioning/acceptance sign-off against a vendor or neocloud, where recorded thresholds are contractual (cloud & neoclouds cost, vendor sourcing & procurement).

Not needed:

  • Routine workload deploys on an already-accepted cluster: go straight to the recipe (serving OSS models, distributed training recipes).
  • Single-node or workstation setups with no fabric and no gang scheduling: steps 2 and 6 collapse to a local nvidia-smi check.
  • Slurm-first clusters: steps 4-5 differ (Slurm + Pyxis instead of the K8s GPU stack); see cluster: Slurm. The facility/fabric/node/validation spine is identical.

How: implement, integrate, maintain

Each step below gives the command and the pass criterion. Do not advance on a non-pass; localize at the current layer.

# Step Command (representative) Pass criterion
1 Facility ready facility sign-off (datacentre physical) power/cooling/weight budget signed; PDUs energized
2 Fabric up ibdiagnet; sminfo ibdiagnet no errors, SM converged, all links at rated speed (fabric bring-up & benchmarking)
3 Nodes provisioned ansible-playbook site.yml (Ansible site playbook) nvidia-smi lists all GPUs; ibstat State: Active; dcgmi diag -r 3 no Fail (gpu health gating)
4 K8s GPU stack helm install nvidia/gpu-operator (below) kubectl get clusterpolicy Ready; GPUs allocatable; gang scheduler present (K8s GPU platform)
5 Telemetry dcgm-exporter + Prometheus (telemetry & monitoring) DCGM_FI_DEV_GPU_UTIL scraping in Prometheus; an alert fires on a test condition
6 Fabric validation all_reduce_perf (NCCL, below) busbw near line rate for large sizes; NCCL_DEBUG=INFO shows GDR/IB path, not TCP (smoke tests)
7 Training acceptance gang-scheduled torchrun (manifest: Volcano Job) all ranks start together; MFU within target; step time stable (train: FSDP, runbook: MFU regression)
8 Inference acceptance vLLM Deployment + smoke curl (inference serving) /health ready; TTFT/TPOT within SLO (SLO/SLI catalog, runbook: inference SLO breach)

Step 4: K8s GPU stack (install + gate)

Install the GPU Operator (driver, container toolkit, device plugin, DCGM, validator) and the gang scheduler, then gate on observable readiness, not on helm exit code.12

# GPU Operator: NVIDIA-managed driver/toolkit/device-plugin/DCGM stack.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator --version=v26.3.3      # pin to your validated release

# Gang scheduler so distributed jobs place all-or-nothing (step 7 depends on it).
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts && helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace --wait

# --- Gate: do not advance until all three pass ---
kubectl get clusterpolicy                                  # STATE must be: ready
kubectl get nodes -o json | jq '.items[]
  | select(.status.allocatable["nvidia.com/gpu"] != null)
  | {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}'   # GPUs allocatable
kubectl get pods -n volcano-system                         # volcano-scheduler Running

Pass: clusterpolicy is ready, every GPU node reports allocatable nvidia.com/gpu, and the Volcano scheduler pod is Running. The deeper per-layer acceptance suite is smoke tests.

Step 6: Fabric validation (NCCL collective)

Run an all-reduce across two nodes (16 GPUs) and read bus bandwidth against the topology ceiling. Requires the MPI Operator; full MPIJob manifest in workload bring-up recipes. The command the launcher runs:34

# Sweep 8 B -> 8 GiB, doubling each step, 1 GPU per thread, IB HCA filter + verbose path.
mpirun --allow-run-as-root -np 16 -bind-to none \
  -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO \
  /workspace/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Pass: the busbw column approaches the fabric's line rate at large message sizes, and NCCL_DEBUG=INFO logs an IB/GDR transport (look for NET/IB and GDR engagement), not a TCP/socket fallback. A TCP fallback means the RDMA resource was not requested in the pod or the HCA filter is wrong (RDMA/RoCE tuning, runbook: NCCL hang). Achieved busbw is always below the nominal line rate; protocol overhead is expected.

Step 5: Telemetry gate (PromQL)

Before any benchmark, confirm metrics flow so steps 6-8 are observable. With dcgm-exporter scraped by Prometheus, this returns a non-empty vector:5

# Per-GPU utilization is being scraped from every GPU node.
count(DCGM_FI_DEV_GPU_UTIL) by (Hostname)

Pass: one series per expected GPU host. Wire the SLO/burn-rate alerts from SLO/SLI catalog and confirm one fires on a synthetic condition (observability & monitoring).

Integrate

Maintain

  • Re-run from step 6 after any driver, firmware, or fabric change; the host stack moved underneath the recorded baseline.
  • Keep step 3's health gate continuous, not one-shot: a node that passed at commissioning can degrade in service (gpu health gating).
  • Treat the recorded baselines as living SLIs: drift in busbw or MFU is a fabric/scheduling regression, not a model problem (slo-sli catalog).

Failure modes

  • Step skipped, failure deferred. Training launched before NCCL validation; a degraded link is misread as a model/GPU fault and profiled for days (runbook: MFU regression).
  • No gang scheduler at step 7. Distributed job half-places, pins GPUs, deadlocks. Step 4's gate (Volcano Running) prevents it.
  • TCP fallback at step 6. RDMA resource not requested in the pod spec, or wrong HCA filter; busbw collapses (RDMA and RoCE Performance Tuning).
  • Telemetry blind (step 5 skipped). Steps 6-8 run but nothing is recorded; "pass" is unfalsifiable. Gate on a non-empty PromQL vector first.
  • vLLM OOM at step 8 from --max-model-len / batch exceeding KV-cache memory (inference serving).

References

  • NVIDIA GPU Operator — install & verify (helm command, clusterpolicy, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
  • NVIDIA Network Operator (RDMA device plugin, rdma/... resources): https://docs.nvidia.com/networking/display/cokan/network+operator
  • NVIDIA nccl-tests (all_reduce_perf, busbw, -b -e -f -g flags): https://github.com/NVIDIA/nccl-tests
  • NCCL environment variables (NCCL_IB_HCA, NCCL_DEBUG, GDR): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
  • Volcano gang scheduler: https://volcano.sh/en/docs/
  • Kubeflow MPI Operator (MPIJob for NCCL): https://github.com/kubeflow/mpi-operator
  • vLLM serving (OpenAI-compatible endpoint, health): https://docs.vllm.ai/en/latest/
  • dcgm-exporter / DCGM field IDs (DCGM_FI_DEV_GPU_UTIL): https://docs.nvidia.com/datacenter/dcgm/latest/
  • Prometheus querying basics (PromQL): https://prometheus.io/docs/prometheus/latest/querying/basics/

Related: Workload Bring-Up Recipes · Ansible Site Playbook · Fabric Bring-Up & Benchmarking · Smoke Tests · K8s GPU Platform · Telemetry & Monitoring · SLO/SLI Catalog · Glossary


  1. NVIDIA GPU Operator getting-started: add the nvidia helm repo, then helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=<release>; verify with kubectl get clusterpolicy (state ready) and by confirming nodes report allocatable nvidia.com/gpu. Version pin shown is illustrative — pin to your validated release. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html 

  2. Volcano provides gang scheduling so all pods of a job are scheduled together or none are. https://volcano.sh/en/docs/ 

  3. nccl-tests all_reduce_perf reports a busbw (bus bandwidth) column documented on the project's performance page; total ranks = processes x threads x GPUs-per-thread. https://github.com/NVIDIA/nccl-tests 

  4. nccl-tests flags: -b/--minbytes start size, -e/--maxbytes end size, -f/--stepfactor multiplier between sizes, -g/--ngpus GPUs per thread (default 1). https://github.com/NVIDIA/nccl-tests 

  5. DCGM_FI_DEV_GPU_UTIL is the DCGM field for per-GPU utilization, exported by dcgm-exporter for Prometheus scraping. https://docs.nvidia.com/datacenter/dcgm/latest/