Runbook: add GPU capacity¶
Scope: safely add GPU capacity (new nodes or scale-up) to a running cluster: burn-in, fabric and health validation, then admit to scheduling, with a rollback path.
Run this to commission new GPU nodes into an existing cluster: new racks/nodes or a cloud node-pool scale-up. Severity is a planned, additive change (no impact to running capacity if executed behind cordon). New nodes are admitted only after they pass acceptance; nothing is scheduled onto unproven hardware.
Reference templates on real APIs; pin versions and validate before production use.
This is the longform procedure for RB-2 in operational runbooks. It is the same Ansible and acceptance machinery used at first bring-up, applied incrementally to a live cluster: facility/fabric (datacentre readiness, networking fabric), provisioning (Ansible bring-up, provisioning and scheduling), and acceptance (commissioning).
Trigger¶
- New racks / nodes delivered and physically installed, cabled, and powered.
- Cloud node-pool scale-up: a managed GPU node pool grows and new nodes need to join the scheduler and pass acceptance before taking work.
Pre-checks¶
- Facility signed off (datacentre readiness): power budget, cooling (CDU flow / rack inlet) confirmed for the added draw; rack and PDU capacity headroom.
- Fabric signed off (networking fabric): new node ports cabled to the correct rails/leaf switches, subnet manager sees them,
ibdiagnetclean (no link errors, correct topology, no rate/width degradation). - Hostnames, BMC/IPMI, and DNS in place; nodes reachable over SSH for Ansible.
- Inventory change prepared in git (the new hosts), not applied to the fleet yet (SRE and MLOps practices).
Flow¶
flowchart LR
A["Facility / fabric check"] --> B["Ansible provision (site.yml)"]
B --> C["Join scheduler (k8s / Slurm)"]
C --> D["Acceptance suite"]
D --> E["2-node NCCL test vs existing"]
E -->|"busbw at line rate"| F["Admit to pool"]
E -.->|"degraded / fail"| G["Cordon, hold out of inventory"]
D -.->|"diag fail"| G
Procedure¶
NEW is the new host (or an inventory group of new hosts).
- Fabric clean: from a host on the fabric, confirm the new ports are healthy and the topology is as designed (networking fabric):
- Provision with Ansible: add the host(s) to the inventory and run the same
site.ymlused at bring-up (driver, branch-matched Fabric Manager, DOCA-OFED, container toolkit, DCGM, host tuning, ACS-disable). It is idempotent and ends in thevalidaterole (Ansible bring-up): - Join the scheduler (provisioning and scheduling). On Kubernetes, join, then keep the node cordoned until acceptance passes; the GPU Operator labels and validates it (the Kubernetes platform):
- Run the acceptance suite (commissioning): per-GPU burn-in and the long HW diagnostic:
- 2-node NCCL test against existing nodes: the decisive check that the new node talks to the existing fabric at line rate (workload recipes). Build/launch
nccl-testsall_reduce_perfacross one new + one established node:Read the reported busbw and confirm# 2 nodes x 8 GPUs, scan 8B..8GiB doubling; pin HCA + GDR NCCL_IB_HCA=mlx5 NCCL_NET_GDR_LEVEL=SYS NCCL_DEBUG=INFO \ mpirun -np 16 -H gpu-01:8,gpu-17:8 \ ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1NCCL_DEBUGshows the IB/GDRDMA transport (NET/IB/.../GDRDMA), not aNET/SocketTCP fallback.
Verification¶
- busbw at line rate to the new nodes on the 2-node test (within tolerance of the established fleet baseline, performance tuning, workload recipes); GDR path confirmed in the NCCL log.
- Allocatable
nvidia.com/gpuincreased by the expected per-node GPU count once admitted: - Telemetry flowing (telemetry and monitoring): DCGM exporter scraping the new node, dashboards and alerts populated.
- Only then admit:
kubectl uncordon "$NEW"(Slurm:scontrol update nodename=<n> state=resume).
Rollback¶
Additive change: back out by withholding, not by destroying running capacity. Keep the new node cordoned and remove it from the active inventory/partition until acceptance passes; existing capacity is untouched.
kubectl cordon "$NEW" # or never uncordoned if acceptance failed
# Slurm: scontrol update nodename=<n> state=drain reason="failed acceptance"
If ibdiagnet/2-node busbw shows degradation, treat it as a fabric or node fault before any further attempt (networking fabric, the NCCL-hang runbook); persistent GPU diag failure diverts to RMA (the GPU-fault runbook).
Related runbooks¶
- the driver-upgrade runbook: Rolling driver / CUDA upgrade (shares the Ansible + cordon/drain primitives).
- the GPU-fault runbook: GPU fault, drain, reset, RMA (when a new node fails diag).
- the NCCL-hang runbook: NCCL hang / collective stall (when the 2-node busbw is wrong).
- operational runbooks: Operational runbooks index (RB-2).
References¶
- NVIDIA commissioning / acceptance practices: https://docs.nvidia.com/dgx-superpod/index.html
- InfiniBand diagnostics (ibdiagnet, in the NVIDIA networking / OFED docs): https://docs.nvidia.com/networking/
- nccl-tests (all_reduce_perf, -b/-e/-f/-g flags): https://github.com/NVIDIA/nccl-tests
- DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- kubeadm join (add a node): https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/
- NVIDIA GPU Operator (node validation, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Related: Networking Fabric · Datacentre Physical · Commissioning · Provisioning · Ansible · Workload Bring-Up · Operational Runbooks · Driver Upgrade · Glossary