Markdown

Runbook: thermal / cooling emergency¶

Scope: respond to a thermal or cooling emergency (GPU throttling or a CDU alarm) to protect hardware and restore service.

Run this when GPUs are throttling on temperature across many nodes, or a facility cooling alarm fires (CDU fault, coolant flow/temperature out of range). Severity is high: thermal slowdown precedes thermal shutdown, and at B300 density liquid cooling is mandatory, so a cooling loss escalates fast. Protect the hardware: shed heat proactively, do not wait for an emergency power-off.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-8 in operational runbooks. The facility cooling layer (CDUs, coolant loops, per-rack heat rejection, and why liquid cooling is mandatory at B300 density) is in datacentre readiness; thermal throttle/shutdown behaviour and the power-cap perf/W lever are in reliability and RAS and performance tuning.

Trigger¶

GPUThermalThrottle cluster-wide: clocks_throttle_reasons shows thermal/HW slowdown on many GPUs at once (observability, reliability and RAS), or DCGM_FI_DEV_GPU_TEMP rising across a hall (telemetry and monitoring).
A facility cooling alarm: a CDU fault, coolant flow below threshold, or supply-temperature high (datacentre readiness). A liquid-cooling fault typically appears as throttle, then shutdown (reliability and RAS).

Pre-checks¶

Confirm scope immediately: this decides everything below. One node (a single GPU or a node-local pump/cold-plate issue) versus hall-wide (a CDU, a coolant loop, or a facility chilled-water failure). Read temps fleet-wide and the facility/BMS alarms together:
```
# fleet GPU temps + throttle reasons (one row per GPU per node)
nvidia-smi --query-gpu=index,temperature.gpu,clocks_throttle_reasons.active --format=csv,noheader
# PromQL: hottest GPUs across the fleet
topk(20, DCGM_FI_DEV_GPU_TEMP)
```
Many nodes hot at once ⇒ facility/CDU. One node hot, neighbours cool ⇒ node-local.
Loop in facility/DC operations now if scope is hall-wide: the fix is partly physical (CDU, valves, chilled water) and outside the cluster's control (datacentre readiness). Do not diagnose a CDU from the cluster side alone.
Have the drain primitives ready (the GPU-fault runbook): cordon/drain on the scheduler, and root for nvidia-smi power-limit changes.

Flow¶

stateDiagram-v2
    [*] --> Scope
    Scope --> HallWide: many nodes hot
    Scope --> NodeLocal: one node hot
    HallWide --> Facility: engage DC ops on CDU
    HallWide --> Shed: power-cap and drain to shed heat
    Facility --> Verify: cooling restored
    Shed --> Verify: temps falling
    NodeLocal --> DrainNode: cordon and drain
    DrainNode --> Verify: heat removed from node
    Verify --> Restore: temps under threshold
    Restore --> [*]: power limits restored

Procedure¶

The split is scope: hall-wide means shed heat across the affected racks while facility fixes the loop; node-local means isolate the one node.

NODE=gpu-07.dc1.internal

Hall-wide (CDU / coolant / facility): coordinate with facility on the physical fault and, from the cluster side, proactively shed heat before any forced shutdown. Reducing the GPU power limit cuts heat output immediately and buys the cooling system margin; the perf/W curve is non-linear, so a large heat cut costs only modest throughput (performance tuning, datacentre readiness). Apply a power cap fleet-wide on the affected racks, then drain to remove load:
```
# power-cap every GPU on the node to shed heat (root). Value in watts, between min and max
# reported by: nvidia-smi --query-gpu=power.min_limit,power.max_limit --format=csv
ssh "$NODE" 'sudo nvidia-smi -pl <reduced-watts>'        # repeat across affected nodes
# then drain load off the rack so GPUs idle and cool
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=10m
```
Power-capping first (heat down in seconds) then draining (load off) sheds heat before temperatures reach the thermal-shutdown threshold; a controlled drawdown beats an uncontrolled emergency power-off that risks in-flight jobs and hardware.
Node-local (single GPU / node pump / cold-plate): isolate the one node so its heat source is removed without touching the rest of the hall (the GPU-fault runbook):
```
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=10m
# Slurm: scontrol update nodename=<n> state=drain reason="thermal"
```
A persistently hot single GPU after drain (or a node-local pump/flow fault) is a hardware issue, so divert to the GPU-fault / RMA path (the GPU-fault runbook, reliability and RAS); a cold-plate or node pump fault is a serviceable component, not a GPU RMA.
Watch temps fall as load drops and (hall-wide) cooling is restored. Throttle reasons should clear and DCGM_FI_DEV_GPU_TEMP should fall back under the throttle threshold (observability, telemetry and monitoring). Do not restore load or power limits until temperatures are stable with margin.

Verification¶

Temps under the throttle threshold with margin, and clocks_throttle_reasons.active reports no thermal/HW-slowdown bit:

nvidia-smi --query-gpu=index,temperature.gpu,clocks_throttle_reasons.active --format=csv,noheader
# expect 0x0000000000000000 (no active throttle) on recovered GPUs
# newer driver branches expose this as clocks_event_reasons.active (throttle name deprecated)

No thermal shutdowns: no further GPUs fall off the bus from over-temperature (reliability and RAS); the GPUThermalThrottle alert clears.
Cooling within spec (hall-wide): facility confirms CDU coolant flow and supply temperature back in range (datacentre readiness) before any uncap.

Rollback¶

This is a protective drawdown, not a config change; "rollback" means restore the reduced power limits once cooling is confirmed healthy and temperatures are stable. Reset clocks/power to default, then uncordon and re-admit load gradually, watching temps as load returns:

ssh "$NODE" 'sudo nvidia-smi -pl <default-max-watts>'    # or: sudo nvidia-smi -rgc to reset locked clocks
kubectl uncordon "$NODE"
# re-check temps as load ramps back; do not bulk-uncordon a hall faster than cooling can absorb

Restore power limits and re-admit incrementally, not all at once: a hall-wide simultaneous return to full power can re-trip a cooling system that is only just back in range (datacentre readiness).

the GPU-fault runbook: GPU fault, drain, reset, RMA (node-local thermal fault that turns out to be hardware).
the driver-upgrade runbook: Rolling driver / CUDA upgrade (shares the cordon/drain primitives).
the inference-SLO runbook: Inference SLO breach (a throttling GPU surfaces there as a latency regression).
operational runbooks: Operational runbooks index (RB-8).

References¶

nvidia-smi manual (-pl power limit, -lgc/-rgc clock control, query fields): https://docs.nvidia.com/deploy/nvidia-smi/
DCGM (GPU temperature, throttle reasons, exporter fields): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
NVIDIA GPU debug guidelines (thermal slowdown vs shutdown): https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
GB300 NVL72 power, cooling, CDU/liquid-cooling requirements: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters