Skip to content
Markdown

Runbook: add GPU capacity

Scope: safely add GPU capacity (new nodes or scale-up) to a running cluster: burn-in, fabric and health validation, then admit to scheduling, with a rollback path.

Run this to commission new GPU nodes into an existing cluster: new racks/nodes or a cloud node-pool scale-up. Severity is a planned, additive change (no impact to running capacity if executed behind cordon). New nodes are admitted only after they pass acceptance; nothing is scheduled onto unproven hardware.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-2 in operational runbooks. It is the same Ansible and acceptance machinery used at first bring-up, applied incrementally to a live cluster: facility/fabric (datacentre readiness, networking fabric), provisioning (Ansible bring-up, provisioning and scheduling), and acceptance (commissioning).

Trigger

  • New racks / nodes delivered and physically installed, cabled, and powered.
  • Cloud node-pool scale-up: a managed GPU node pool grows and new nodes need to join the scheduler and pass acceptance before taking work.

Pre-checks

  • Facility signed off (datacentre readiness): power budget, cooling (CDU flow / rack inlet) confirmed for the added draw; rack and PDU capacity headroom.
  • Fabric signed off (networking fabric): new node ports cabled to the correct rails/leaf switches, subnet manager sees them, ibdiagnet clean (no link errors, correct topology, no rate/width degradation).
  • Hostnames, BMC/IPMI, and DNS in place; nodes reachable over SSH for Ansible.
  • Inventory change prepared in git (the new hosts), not applied to the fleet yet (SRE and MLOps practices).

Flow

flowchart LR
    A["Facility / fabric check"] --> B["Ansible provision (site.yml)"]
    B --> C["Join scheduler (k8s / Slurm)"]
    C --> D["Acceptance suite"]
    D --> E["2-node NCCL test vs existing"]
    E -->|"busbw at line rate"| F["Admit to pool"]
    E -.->|"degraded / fail"| G["Cordon, hold out of inventory"]
    D -.->|"diag fail"| G

Procedure

NEW is the new host (or an inventory group of new hosts).

NEW=gpu-17.dc1.internal
  1. Fabric clean: from a host on the fabric, confirm the new ports are healthy and the topology is as designed (networking fabric):
    ibdiagnet --pc            # report, clear/check perf counters; expect no errors
    ibstat | grep -c "State: Active"
    
  2. Provision with Ansible: add the host(s) to the inventory and run the same site.yml used at bring-up (driver, branch-matched Fabric Manager, DOCA-OFED, container toolkit, DCGM, host tuning, ACS-disable). It is idempotent and ends in the validate role (Ansible bring-up):
    ansible-playbook -i inventory/hosts.ini site.yml --limit "$NEW"
    
  3. Join the scheduler (provisioning and scheduling). On Kubernetes, join, then keep the node cordoned until acceptance passes; the GPU Operator labels and validates it (the Kubernetes platform):
    # on the new node: kubeadm join <control-plane>:6443 --token ... --discovery-token-ca-cert-hash sha256:...
    kubectl cordon "$NEW"
    # Slurm equivalent: add to slurm.conf / partition, then
    #   scontrol update nodename=<n> state=drain reason="acceptance"
    
  4. Run the acceptance suite (commissioning): per-GPU burn-in and the long HW diagnostic:
    ssh "$NEW" 'dcgmi diag -r 3'     # Long: PCIe/NVLink, GPU memory, mem bandwidth, NCCL, stress
    ssh "$NEW" nvidia-smi --query-gpu=name,driver_version,ecc.errors.uncorrected.aggregate.total --format=csv,noheader
    
  5. 2-node NCCL test against existing nodes: the decisive check that the new node talks to the existing fabric at line rate (workload recipes). Build/launch nccl-tests all_reduce_perf across one new + one established node:
    # 2 nodes x 8 GPUs, scan 8B..8GiB doubling; pin HCA + GDR
    NCCL_IB_HCA=mlx5 NCCL_NET_GDR_LEVEL=SYS NCCL_DEBUG=INFO \
    mpirun -np 16 -H gpu-01:8,gpu-17:8 \
      ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
    
    Read the reported busbw and confirm NCCL_DEBUG shows the IB/GDRDMA transport (NET/IB/.../GDRDMA), not a NET/Socket TCP fallback.

Verification

  • busbw at line rate to the new nodes on the 2-node test (within tolerance of the established fleet baseline, performance tuning, workload recipes); GDR path confirmed in the NCCL log.
  • Allocatable nvidia.com/gpu increased by the expected per-node GPU count once admitted:
    kubectl get nodes -o custom-columns='NODE:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu' | grep "$NEW"
    
  • Telemetry flowing (telemetry and monitoring): DCGM exporter scraping the new node, dashboards and alerts populated.
  • Only then admit: kubectl uncordon "$NEW" (Slurm: scontrol update nodename=<n> state=resume).

Rollback

Additive change: back out by withholding, not by destroying running capacity. Keep the new node cordoned and remove it from the active inventory/partition until acceptance passes; existing capacity is untouched.

kubectl cordon "$NEW"     # or never uncordoned if acceptance failed
# Slurm: scontrol update nodename=<n> state=drain reason="failed acceptance"

If ibdiagnet/2-node busbw shows degradation, treat it as a fabric or node fault before any further attempt (networking fabric, the NCCL-hang runbook); persistent GPU diag failure diverts to RMA (the GPU-fault runbook).

References

  • NVIDIA commissioning / acceptance practices: https://docs.nvidia.com/dgx-superpod/index.html
  • InfiniBand diagnostics (ibdiagnet, in the NVIDIA networking / OFED docs): https://docs.nvidia.com/networking/
  • nccl-tests (all_reduce_perf, -b/-e/-f/-g flags): https://github.com/NVIDIA/nccl-tests
  • DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
  • kubeadm join (add a node): https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/
  • NVIDIA GPU Operator (node validation, allocatable GPUs): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html

Related: Networking Fabric · Datacentre Physical · Commissioning · Provisioning · Ansible · Workload Bring-Up · Operational Runbooks · Driver Upgrade · Glossary