Skip to content
Markdown

Runbook: thermal / cooling emergency

Scope: respond to a thermal or cooling emergency (GPU throttling or a CDU alarm) to protect hardware and restore service.

Run this when GPUs are throttling on temperature across many nodes, or a facility cooling alarm fires (CDU fault, coolant flow/temperature out of range). Severity is high: thermal slowdown precedes thermal shutdown, and at B300 density liquid cooling is mandatory, so a cooling loss escalates fast. Protect the hardware: shed heat proactively, do not wait for an emergency power-off.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-8 in operational runbooks. The facility cooling layer (CDUs, coolant loops, per-rack heat rejection, and why liquid cooling is mandatory at B300 density) is in datacentre readiness; thermal throttle/shutdown behaviour and the power-cap perf/W lever are in reliability and RAS and performance tuning.

Trigger

Pre-checks

  • Confirm scope immediately: this decides everything below. One node (a single GPU or a node-local pump/cold-plate issue) versus hall-wide (a CDU, a coolant loop, or a facility chilled-water failure). Read temps fleet-wide and the facility/BMS alarms together:
    # fleet GPU temps + throttle reasons (one row per GPU per node)
    nvidia-smi --query-gpu=index,temperature.gpu,clocks_throttle_reasons.active --format=csv,noheader
    # PromQL: hottest GPUs across the fleet
    topk(20, DCGM_FI_DEV_GPU_TEMP)
    
    Many nodes hot at once ⇒ facility/CDU. One node hot, neighbours cool ⇒ node-local.
  • Loop in facility/DC operations now if scope is hall-wide: the fix is partly physical (CDU, valves, chilled water) and outside the cluster's control (datacentre readiness). Do not diagnose a CDU from the cluster side alone.
  • Have the drain primitives ready (the GPU-fault runbook): cordon/drain on the scheduler, and root for nvidia-smi power-limit changes.

Flow

stateDiagram-v2
    [*] --> Scope
    Scope --> HallWide: many nodes hot
    Scope --> NodeLocal: one node hot
    HallWide --> Facility: engage DC ops on CDU
    HallWide --> Shed: power-cap and drain to shed heat
    Facility --> Verify: cooling restored
    Shed --> Verify: temps falling
    NodeLocal --> DrainNode: cordon and drain
    DrainNode --> Verify: heat removed from node
    Verify --> Restore: temps under threshold
    Restore --> [*]: power limits restored

Procedure

The split is scope: hall-wide means shed heat across the affected racks while facility fixes the loop; node-local means isolate the one node.

NODE=gpu-07.dc1.internal
  1. Hall-wide (CDU / coolant / facility): coordinate with facility on the physical fault and, from the cluster side, proactively shed heat before any forced shutdown. Reducing the GPU power limit cuts heat output immediately and buys the cooling system margin; the perf/W curve is non-linear, so a large heat cut costs only modest throughput (performance tuning, datacentre readiness). Apply a power cap fleet-wide on the affected racks, then drain to remove load:

    # power-cap every GPU on the node to shed heat (root). Value in watts, between min and max
    # reported by: nvidia-smi --query-gpu=power.min_limit,power.max_limit --format=csv
    ssh "$NODE" 'sudo nvidia-smi -pl <reduced-watts>'        # repeat across affected nodes
    # then drain load off the rack so GPUs idle and cool
    kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=10m
    
    Power-capping first (heat down in seconds) then draining (load off) sheds heat before temperatures reach the thermal-shutdown threshold; a controlled drawdown beats an uncontrolled emergency power-off that risks in-flight jobs and hardware.

  2. Node-local (single GPU / node pump / cold-plate): isolate the one node so its heat source is removed without touching the rest of the hall (the GPU-fault runbook):

    kubectl cordon "$NODE"
    kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=10m
    # Slurm: scontrol update nodename=<n> state=drain reason="thermal"
    
    A persistently hot single GPU after drain (or a node-local pump/flow fault) is a hardware issue, so divert to the GPU-fault / RMA path (the GPU-fault runbook, reliability and RAS); a cold-plate or node pump fault is a serviceable component, not a GPU RMA.

  3. Watch temps fall as load drops and (hall-wide) cooling is restored. Throttle reasons should clear and DCGM_FI_DEV_GPU_TEMP should fall back under the throttle threshold (observability, telemetry and monitoring). Do not restore load or power limits until temperatures are stable with margin.

Verification

  • Temps under the throttle threshold with margin, and clocks_throttle_reasons.active reports no thermal/HW-slowdown bit:
    nvidia-smi --query-gpu=index,temperature.gpu,clocks_throttle_reasons.active --format=csv,noheader
    # expect 0x0000000000000000 (no active throttle) on recovered GPUs
    # newer driver branches expose this as clocks_event_reasons.active (throttle name deprecated)
    
  • No thermal shutdowns: no further GPUs fall off the bus from over-temperature (reliability and RAS); the GPUThermalThrottle alert clears.
  • Cooling within spec (hall-wide): facility confirms CDU coolant flow and supply temperature back in range (datacentre readiness) before any uncap.

Rollback

This is a protective drawdown, not a config change; "rollback" means restore the reduced power limits once cooling is confirmed healthy and temperatures are stable. Reset clocks/power to default, then uncordon and re-admit load gradually, watching temps as load returns:

ssh "$NODE" 'sudo nvidia-smi -pl <default-max-watts>'    # or: sudo nvidia-smi -rgc to reset locked clocks
kubectl uncordon "$NODE"
# re-check temps as load ramps back; do not bulk-uncordon a hall faster than cooling can absorb

Restore power limits and re-admit incrementally, not all at once: a hall-wide simultaneous return to full power can re-trip a cooling system that is only just back in range (datacentre readiness).

References

  • nvidia-smi manual (-pl power limit, -lgc/-rgc clock control, query fields): https://docs.nvidia.com/deploy/nvidia-smi/
  • DCGM (GPU temperature, throttle reasons, exporter fields): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
  • NVIDIA GPU debug guidelines (thermal slowdown vs shutdown): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
  • kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
  • GB300 NVL72 power, cooling, CDU/liquid-cooling requirements: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters

Related: Datacentre Physical · Observability · Reliability · Optimization · Telemetry · GPU Fault/RMA · Operational Runbooks · Glossary