Markdown

Runbook: add GPU capacity¶

Scope: safely add GPU capacity (new nodes or scale-up) to a running cluster: burn-in, fabric and health validation, then admit to scheduling, with a rollback path.

Run this to commission new GPU nodes into an existing cluster: new racks/nodes or a cloud node-pool scale-up. Severity is a planned, additive change (no impact to running capacity if executed behind cordon). New nodes are admitted only after they pass acceptance; nothing is scheduled onto unproven hardware.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-2 in operational runbooks. It is the same Ansible and acceptance machinery used at first bring-up, applied incrementally to a live cluster: facility/fabric (datacentre readiness, networking fabric), provisioning (Ansible bring-up, provisioning and scheduling), and acceptance (commissioning).

Trigger¶

New racks / nodes delivered and physically installed, cabled, and powered.
Cloud node-pool scale-up: a managed GPU node pool grows and new nodes need to join the scheduler and pass acceptance before taking work.

Pre-checks¶

Facility signed off (datacentre readiness): power budget, cooling (CDU flow / rack inlet) confirmed for the added draw; rack and PDU capacity headroom.
Fabric signed off (networking fabric): new node ports cabled to the correct rails/leaf switches, subnet manager sees them, ibdiagnet clean (no link errors, correct topology, no rate/width degradation).
Hostnames, BMC/IPMI, and DNS in place; nodes reachable over SSH for Ansible.
Inventory change prepared in git (the new hosts), not applied to the fleet yet (SRE and MLOps practices).

Flow¶

flowchart LR
    A["Facility / fabric check"] --> B["Ansible provision (site.yml)"]
    B --> C["Join scheduler (k8s / Slurm)"]
    C --> D["Acceptance suite"]
    D --> E["2-node NCCL test vs existing"]
    E -->|"busbw at line rate"| F["Admit to pool"]
    E -.->|"degraded / fail"| G["Cordon, hold out of inventory"]
    D -.->|"diag fail"| G

Procedure¶

NEW is the new host (or an inventory group of new hosts).

NEW=gpu-17.dc1.internal

Fabric clean: from a host on the fabric, confirm the new ports are healthy and the topology is as designed (networking fabric):

ibdiagnet --pc            # report, clear/check perf counters; expect no errors
ibstat | grep -c "State: Active"

Provision with Ansible: add the host(s) to the inventory and run the same site.yml used at bring-up (driver, branch-matched Fabric Manager, DOCA-OFED, container toolkit, DCGM, host tuning, ACS-disable). It is idempotent and ends in the validate role (Ansible bring-up):
```
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NEW"
```

Join the scheduler (provisioning and scheduling). On Kubernetes, join, then keep the node cordoned until acceptance passes; the GPU Operator labels and validates it (the Kubernetes platform):

# on the new node: kubeadm join <control-plane>:6443 --token ... --discovery-token-ca-cert-hash sha256:...
kubectl cordon "$NEW"
# Slurm equivalent: add to slurm.conf / partition, then
#   scontrol update nodename=<n> state=drain reason="acceptance"

Run the acceptance suite (commissioning): per-GPU burn-in and the long HW diagnostic:

ssh "$NEW" 'dcgmi diag -r 3'     # Long: PCIe/NVLink, GPU memory, mem bandwidth, NCCL, stress
ssh "$NEW" nvidia-smi --query-gpu=name,driver_version,ecc.errors.uncorrected.aggregate.total --format=csv,noheader

2-node NCCL test against existing nodes: the decisive check that the new node talks to the existing fabric at line rate (workload recipes). Build/launch nccl-tests all_reduce_perf across one new + one established node:
```
# 2 nodes x 8 GPUs, scan 8B..8GiB doubling; pin HCA + GDR
NCCL_IB_HCA=mlx5 NCCL_NET_GDR_LEVEL=SYS NCCL_DEBUG=INFO \
mpirun -np 16 -H gpu-01:8,gpu-17:8 \
  ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
```
Read the reported busbw and confirm NCCL_DEBUG shows the IB/GDRDMA transport (NET/IB/.../GDRDMA), not a NET/Socket TCP fallback.

Verification¶

busbw at line rate to the new nodes on the 2-node test (within tolerance of the established fleet baseline, performance tuning, workload recipes); GDR path confirmed in the NCCL log.

Allocatable nvidia.com/gpu increased by the expected per-node GPU count once admitted:

kubectl get nodes -o custom-columns='NODE:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu' | grep "$NEW"

Telemetry flowing (telemetry and monitoring): DCGM exporter scraping the new node, dashboards and alerts populated.
Only then admit: kubectl uncordon "$NEW" (Slurm: scontrol update nodename=<n> state=resume).

Rollback¶

Additive change: back out by withholding, not by destroying running capacity. Keep the new node cordoned and remove it from the active inventory/partition until acceptance passes; existing capacity is untouched.

kubectl cordon "$NEW"     # or never uncordoned if acceptance failed
# Slurm: scontrol update nodename=<n> state=drain reason="failed acceptance"

If ibdiagnet/2-node busbw shows degradation, treat it as a fabric or node fault before any further attempt (networking fabric, the NCCL-hang runbook); persistent GPU diag failure diverts to RMA (the GPU-fault runbook).

the driver-upgrade runbook: Rolling driver / CUDA upgrade (shares the Ansible + cordon/drain primitives).
the GPU-fault runbook: GPU fault, drain, reset, RMA (when a new node fails diag).
the NCCL-hang runbook: NCCL hang / collective stall (when the 2-node busbw is wrong).
operational runbooks: Operational runbooks index (RB-2).

References¶

NVIDIA commissioning / acceptance practices: https://docs.nvidia.com/dgx-superpod/index.html
InfiniBand diagnostics (ibdiagnet, in the NVIDIA networking / OFED docs): https://docs.nvidia.com/networking/
nccl-tests (all_reduce_perf, -b/-e/-f/-g flags): https://github.com/NVIDIA/nccl-tests
DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
kubeadm join (add a node): https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/
NVIDIA GPU Operator (node validation, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html