Markdown

Playbook: end-to-end workload bring-up¶

Scope: the ordered, step-by-step playbook that takes a freshly built cluster to its first running workload (facility -> fabric proof -> nodes -> K8s GPU stack -> telemetry -> fabric validation -> training/inference acceptance) with the command and the pass criterion for each step. The connective tissue between the per-layer pages; the runnable manifests for the individual workloads live in workload bring-up recipes.

Reference templates, drawn from the NVIDIA GPU Operator / Network Operator docs, the NVIDIA/nccl-tests repo, Volcano, vLLM, and the Prometheus/DCGM docs. Nothing here was executed on hardware. Pin every chart, image, and CRD version; substitute real node names, RDMA resource keys (rdma/rdma_shared_device_a vs rdma/ib), HCA filters, and model/secret names before applying. Drive the sequence in order. Each step is the acceptance test for the one before it; skipping forward only moves the failure later, where it is harder to localize.

flowchart LR
  S1["1 Facility ready"] --> S2["2 Fabric up (SM + links)"]
  S2 --> S3["3 Nodes provisioned"]
  S3 --> S4["4 K8s GPU stack"]
  S4 --> S5["5 Telemetry"]
  S5 --> S6["6 Fabric validation (NCCL)"]
  S6 --> S7["7 Training acceptance (MFU)"]
  S6 --> S8["8 Inference acceptance (TTFT/TPOT)"]
  S7 --> DONE["First workload accepted"]
  S8 --> DONE

What it is¶

A linear bring-up runbook. The cluster is built (racked, cabled, powered) but unproven: nothing has run a real collective, no GPU has been scheduled through Kubernetes, no telemetry is flowing. This playbook walks the stack bottom-up, gating at each layer on a single observable pass criterion before the next layer is trusted.

It differs from its neighbours by scope:

workload bring-up recipes holds the full manifests (MPIJob, Volcano Job, vLLM Deployment); this page sequences and gates them.
Ansible site playbook is the node-convergence layer (steps 2-3 here): driver/OFED/FM/MIG roles in rolling waves.
fabric bring-up and benchmarking is the host-level fabric proof (step 6's deeper form); smoke tests is the Kubernetes acceptance suite (step 4's deeper form).

The contract: a layer is "up" only when its pass criterion is observed, not when its install command exited 0.

Why it matters¶

Bring-up failures are cheap to fix at the layer they originate and expensive everywhere above it. A degraded InfiniBand link looks identical to a slow model if you discover it during a training benchmark (step 7) instead of an NCCL run (step 6). You will profile the trainer, suspect the GPUs, and burn days before checking the fabric (runbook: MFU regression, runbook: NCCL hang). The ordering exists so each failure surfaces against the smallest possible suspect set.

The other failure class is partial placement: a distributed job with no gang scheduler half-places, holds GPUs, and deadlocks (cluster orchestration). Gating step 4 (gang scheduler installed and proven) before step 7 (gang-scheduled training) removes that class entirely.

When it is needed (and when not)¶

Needed:

First bring-up of a new cluster, or a new island/superpod added to an existing one (runbook: capacity add).
After a fleet-wide driver or firmware roll, re-run from step 6 (fabric proof) forward; the host stack changed underneath.
Commissioning/acceptance sign-off against a vendor or neocloud, where recorded thresholds are contractual (cloud & neoclouds cost, vendor sourcing & procurement).

Not needed:

Routine workload deploys on an already-accepted cluster: go straight to the recipe (serving OSS models, distributed training recipes).
Single-node or workstation setups with no fabric and no gang scheduling: steps 2 and 6 collapse to a local nvidia-smi check.
Slurm-first clusters: steps 4-5 differ (Slurm + Pyxis instead of the K8s GPU stack); see cluster: Slurm. The facility/fabric/node/validation spine is identical.

How: implement, integrate, maintain¶

Each step below gives the command and the pass criterion. Do not advance on a non-pass; localize at the current layer.

#	Step	Command (representative)	Pass criterion
1	Facility ready	facility sign-off (datacentre physical)	power/cooling/weight budget signed; PDUs energized
2	Fabric up	`ibdiagnet`; `sminfo`	`ibdiagnet` no errors, SM converged, all links at rated speed (fabric bring-up & benchmarking)
3	Nodes provisioned	`ansible-playbook site.yml` (Ansible site playbook)	`nvidia-smi` lists all GPUs; `ibstat` `State: Active`; `dcgmi diag -r 3` no `Fail` (gpu health gating)
4	K8s GPU stack	`helm install nvidia/gpu-operator` (below)	`kubectl get clusterpolicy` Ready; GPUs allocatable; gang scheduler present (K8s GPU platform)
5	Telemetry	dcgm-exporter + Prometheus (telemetry & monitoring)	`DCGM_FI_DEV_GPU_UTIL` scraping in Prometheus; an alert fires on a test condition
6	Fabric validation	`all_reduce_perf` (NCCL, below)	busbw near line rate for large sizes; `NCCL_DEBUG=INFO` shows GDR/IB path, not TCP (smoke tests)
7	Training acceptance	gang-scheduled `torchrun` (manifest: Volcano Job)	all ranks start together; MFU within target; step time stable (train: FSDP, runbook: MFU regression)
8	Inference acceptance	vLLM Deployment + smoke `curl` (inference serving)	`/health` ready; TTFT/TPOT within SLO (SLO/SLI catalog, runbook: inference SLO breach)

Step 4: K8s GPU stack (install + gate)¶

Install the GPU Operator (driver, container toolkit, device plugin, DCGM, validator) and the gang scheduler, then gate on observable readiness, not on helm exit code.¹²

# GPU Operator: NVIDIA-managed driver/toolkit/device-plugin/DCGM stack.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator --version=v26.3.3      # pin to your validated release

# Gang scheduler so distributed jobs place all-or-nothing (step 7 depends on it).
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts && helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace --wait

# --- Gate: do not advance until all three pass ---
kubectl get clusterpolicy                                  # STATE must be: ready
kubectl get nodes -o json | jq '.items[]
  | select(.status.allocatable["nvidia.com/gpu"] != null)
  | {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}'   # GPUs allocatable
kubectl get pods -n volcano-system                         # volcano-scheduler Running

Pass: clusterpolicy is ready, every GPU node reports allocatable nvidia.com/gpu, and the Volcano scheduler pod is Running. The deeper per-layer acceptance suite is smoke tests.

Step 6: Fabric validation (NCCL collective)¶

Run an all-reduce across two nodes (16 GPUs) and read bus bandwidth against the topology ceiling. Requires the MPI Operator; full MPIJob manifest in workload bring-up recipes. The command the launcher runs:³⁴

# Sweep 8 B -> 8 GiB, doubling each step, 1 GPU per thread, IB HCA filter + verbose path.
mpirun --allow-run-as-root -np 16 -bind-to none \
  -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO \
  /workspace/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Pass: the busbw column approaches the fabric's line rate at large message sizes, and NCCL_DEBUG=INFO logs an IB/GDR transport (look for NET/IB and GDR engagement), not a TCP/socket fallback. A TCP fallback means the RDMA resource was not requested in the pod or the HCA filter is wrong (RDMA/RoCE tuning, runbook: NCCL hang). Achieved busbw is always below the nominal line rate; protocol overhead is expected.

Step 5: Telemetry gate (PromQL)¶

Before any benchmark, confirm metrics flow so steps 6-8 are observable. With dcgm-exporter scraped by Prometheus, this returns a non-empty vector:⁵

# Per-GPU utilization is being scraped from every GPU node.
count(DCGM_FI_DEV_GPU_UTIL) by (Hostname)

Pass: one series per expected GPU host. Wire the SLO/burn-rate alerts from SLO/SLI catalog and confirm one fires on a synthetic condition (observability & monitoring).

Integrate¶

The training and inference manifests for steps 7-8 are owned by workload bring-up recipes; this playbook only sequences and gates them. Gang scheduling for step 7 uses manifest: Volcano Job / helm: Volcano scheduler; FSDP specifics in train: FSDP, low-bandwidth/multi-region in train: DiLoCo.
Ray-based stacks substitute step 4's scheduler with a RayCluster (cluster: Ray); k3s edge clusters in cluster: k3s; the orchestration trade-offs in cluster orchestration and cluster: Kubernetes.
Record the step-6 busbw and step-7/8 MFU and TTFT/TPOT figures as the acceptance baseline; later regressions are measured against them (runbook: MFU regression, runbook: inference SLO breach).

Maintain¶

Re-run from step 6 after any driver, firmware, or fabric change; the host stack moved underneath the recorded baseline.
Keep step 3's health gate continuous, not one-shot: a node that passed at commissioning can degrade in service (gpu health gating).
Treat the recorded baselines as living SLIs: drift in busbw or MFU is a fabric/scheduling regression, not a model problem (slo-sli catalog).

Failure modes¶

Step skipped, failure deferred. Training launched before NCCL validation; a degraded link is misread as a model/GPU fault and profiled for days (runbook: MFU regression).
No gang scheduler at step 7. Distributed job half-places, pins GPUs, deadlocks. Step 4's gate (Volcano Running) prevents it.
TCP fallback at step 6. RDMA resource not requested in the pod spec, or wrong HCA filter; busbw collapses (RDMA and RoCE Performance Tuning).
Telemetry blind (step 5 skipped). Steps 6-8 run but nothing is recorded; "pass" is unfalsifiable. Gate on a non-empty PromQL vector first.
vLLM OOM at step 8 from --max-model-len / batch exceeding KV-cache memory (inference serving).

References¶

NVIDIA GPU Operator — install & verify (helm command, clusterpolicy, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
NVIDIA Network Operator (RDMA device plugin, rdma/... resources): https://docs.nvidia.com/networking/display/cokan/network+operator
NVIDIA nccl-tests (all_reduce_perf, busbw, -b -e -f -g flags): https://github.com/NVIDIA/nccl-tests
NCCL environment variables (NCCL_IB_HCA, NCCL_DEBUG, GDR): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Volcano gang scheduler: https://volcano.sh/en/docs/
Kubeflow MPI Operator (MPIJob for NCCL): https://github.com/kubeflow/mpi-operator
vLLM serving (OpenAI-compatible endpoint, health): https://docs.vllm.ai/en/latest/
dcgm-exporter / DCGM field IDs (DCGM_FI_DEV_GPU_UTIL): https://docs.nvidia.com/datacenter/dcgm/latest/
Prometheus querying basics (PromQL): https://prometheus.io/docs/prometheus/latest/querying/basics/

NVIDIA GPU Operator getting-started: add the nvidia helm repo, then helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=<release>; verify with kubectl get clusterpolicy (state ready) and by confirming nodes report allocatable nvidia.com/gpu. Version pin shown is illustrative — pin to your validated release. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩
Volcano provides gang scheduling so all pods of a job are scheduled together or none are. https://volcano.sh/en/docs/ ↩
nccl-tests all_reduce_perf reports a busbw (bus bandwidth) column documented on the project's performance page; total ranks = processes x threads x GPUs-per-thread. https://github.com/NVIDIA/nccl-tests ↩
nccl-tests flags: -b/--minbytes start size, -e/--maxbytes end size, -f/--stepfactor multiplier between sizes, -g/--ngpus GPUs per thread (default 1). https://github.com/NVIDIA/nccl-tests ↩
DCGM_FI_DEV_GPU_UTIL is the DCGM field for per-GPU utilization, exported by dcgm-exporter for Prometheus scraping. https://docs.nvidia.com/datacenter/dcgm/latest/ ↩