Markdown

GPU power, clocks, and thermal tuning¶

Scope: controlling GPU clocks, power draw, and heat for deterministic benchmarks and better performance-per-watt: locking core/memory clocks (nvidia-smi -lgc / -lmc), capping board power (-pl), reading thermal throttling out of clock and violation counters, the perf-per-watt tradeoff, and the liquid-cooling envelope that lets dense Blackwell racks hold clocks at all.

flowchart TB
    Tune["Operator tunes the operating point"]
    Clocks["Lock core/mem clocks (nvidia-smi -lgc / -lmc)"]
    Cap["Cap board power (nvidia-smi -pl, watts)"]
    Cool["Liquid cooling: CDU, 150-200 L/min, holds 50-70 C"]
    Work["Workload runs kernels, draws power"]
    Heat["Power dissipated -> GPU temperature rises"]
    Limit{"Over power cap or thermal limit?"}
    Throttle["GPU throttles: clocks drop (clocks_throttle_reasons.* = Active)"]
    Drop["Throughput drops; benchmark numbers wander"]
    Sweet["Hold the sweet spot: max sustained clocks at the perf/watt knee"]
    Observe["Observe: throttle reasons, dmon, DCGM thermal-violation time"]

    Tune --> Clocks
    Tune --> Cap
    Tune --> Cool
    Clocks --> Work
    Cap --> Work
    Cool --> Work
    Work --> Heat
    Heat --> Limit
    Limit -- "Yes" --> Throttle
    Throttle --> Drop
    Limit -- "No" --> Sweet
    Drop --> Observe
    Sweet --> Observe
    Observe --> Tune

What it is¶

A data-center GPU does not run at one fixed frequency. It boosts opportunistically up to a board-power ceiling and a thermal ceiling, and clocks down (throttles) whenever it hits either. Left at defaults, two back-to-back runs of the same kernel can land at different clocks, so benchmark numbers wander even with identical code.

Three independent controls, all exposed by nvidia-smi, let you pin the operating point:

-lgc / --lock-gpu-clocks: clamp the SM/core clock to a single value or a min,max MHz pair, removing boost variance (NVIDIA nvidia-smi).
-lmc / --lock-memory-clocks: clamp the HBM clock the same way. Note: this does not work on NVIDIA Hopper GPUs, and on Blackwell HBM clock control is similarly restricted; treat memory-clock locking as Volta–Ampere-era tooling and verify support on your part (NVIDIA nvidia-smi).
-pl / --power-limit: set the board (TGP) power cap in watts, between the device's reported Min and Max limits (NVIDIA nvidia-smi).

Throttling is the GPU lowering clocks on its own to stay inside a power or temperature limit. It is observable: the driver exposes per-reason clocks_throttle_reasons.* flags and DCGM exposes a cumulative thermal violation time counter. The book's framing: liquid cooling exists precisely so the chip never has to do this: "By keeping the GPU and CPU temperatures much lower than they would be with air, liquid cooling reduces thermal GPU throttling. The GPUs can sustain their maximum clocks without hitting temperature limits" (Fregly, Ch. 2).

Why it matters¶

Reproducibility. Boost clocks float with temperature and instantaneous power headroom, so an unpinned GPU gives noisy throughput. Locking core (and where supported, memory) clocks to a fixed value makes a kernel's time depend only on the kernel, the precondition for trusting an A/B benchmark or a regression gate. See Performance Optimization and Tuning and GPU Diagnostics and Validation.

Performance-per-watt. Frequency-vs-power is nonlinear: the last few hundred MHz of boost cost disproportionate watts, and power leakage rises with temperature, so a hotter chip burns more for the same work (Fregly, Ch. 2). The book is explicit that you can trade a small amount of throughput for a large power saving: "if you're running a smaller job that doesn't need every last drop of performance, you could dial down GPU clocks to improve efficiency (measured in performance per watt) and still meet your throughput requirement. This could save kilowatts of power... Over weeks of training, this can translate to significant savings" (Fregly, Ch. 2). Capping power (-pl) or clocks is how an operator harvests that.

Density makes it physical. A fully loaded GB200/GB300 NVL72 draws up to ~130 kW per rack, with each Blackwell GPU dissipating on the order of ~1,200 W (Fregly, Ch. 2). Air cannot remove that, so the rack is fully liquid-cooled and holds GPU temperatures in the 50–70 °C range under load, cool enough that the GPUs sustain max clocks instead of thermal-throttling (Fregly, Ch. 2). Power and thermal tuning is therefore inseparable from facility design; see Datacentre Physical Readiness and GPU Performance and Health.

Transients. 72 GPUs ramping idle→full simultaneously can pull tens of kW in milliseconds; the book notes the system may "stagger the GPU boost clocks by tiny intervals so they don't all spike at exactly the same microsecond," and that power transients can momentarily engage hardware slowdown (Fregly, Ch. 2). This is why a transient hw_slowdown is not always a fault.

When it is needed (and when not)¶

Lock clocks / cap power when:

Benchmarking or profiling. Always pin clocks before collecting numbers; otherwise variance swamps the effect you are measuring.
Efficiency campaigns. A throughput-satisfied job left at stock boost wastes watts; cap it to the knee of the perf/watt curve.
Power-envelope or thermal constraints. A rack or row over its provisioned circuit/cooling budget can be brought under cap with -pl instead of shedding GPUs.
Transient management. Locking clocks removes the synchronized boost spikes that stress PDUs.

Do not bother (or expect little) when:

You want maximum absolute throughput and have the power and cooling headroom: stock boost is already optimal; locking to a lower fixed clock only loses performance.
The GPU is liquid-cooled and already holding max clocks (NVL72 at 50–70 °C); there is no thermal throttling to recover, so clock-locking buys only determinism, not speed (Fregly, Ch. 2).
Memory-clock locking on Hopper/Blackwell: -lmc is unsupported on Hopper; do not script it as a hard dependency (NVIDIA nvidia-smi).
A higher layer already owns the knob: under Kubernetes the GPU Operator / DCGM and under Slurm the prolog may set power limits per job; setting them twice fights the scheduler. See Containers and Kubernetes for GPUs and Kubernetes for GPU Clusters.

Power/clock control is reference tooling here; this page does not claim hardware-tested values. Validate every limit against your own device's reported Min/Max before scripting it.

How: implement, integrate, maintain¶

All operations below require root and the NVIDIA driver loaded. Enable persistence mode first so the driver stays resident and your settings are not lost when the last client exits (NVIDIA nvidia-smi):

sudo nvidia-smi -pm 1                      # persistence mode on (Linux only; not persistent across reboot)

1. Read the current operating point and supported clocks¶

nvidia-smi -q -d CLOCK,POWER,TEMPERATURE   # current/max clocks, power limits, temps
nvidia-smi --query-supported-clocks=gpu_name,mem,gr --format=csv   # legal clock values
nvidia-smi -q -d POWER | grep -E "Min Power Limit|Max Power Limit|Current Power Limit"

-pl must fall between the reported Min and Max Power Limit; -lgc/-lmc should target a value the device lists as supported.

2. Lock GPU (core) clocks for deterministic benchmarks¶

# Single target frequency (closest supported is applied):
sudo nvidia-smi -i 0 -lgc 1410

# Or an explicit min,max pair:
sudo nvidia-smi -i 0 -lgc 1410,1410

-lgc takes a singular value (e.g. <GpuClockValue>) or a MIN_GPU_CLOCK,MAX_GPU_CLOCK pair (e.g. 1500,1500); requires Volta or newer (NVIDIA nvidia-smi). An optional --mode selects intent: --mode=0 (default) for highest clock accuracy, --mode=1 for improved performance-per-watt (NVIDIA nvidia-smi):

sudo nvidia-smi -i 0 -lgc 1410 --mode=1    # bias the lock toward perf/watt

Reset to default boost behavior when done:

sudo nvidia-smi -i 0 -rgc                   # reset GPU clocks

3. Lock memory clocks (Volta–Ampere era; unsupported on Hopper)¶

sudo nvidia-smi -i 0 -lmc 9501              # singular, or e.g. -lmc 5100,5100
sudo nvidia-smi -i 0 -rmc                   # reset memory clocks

-lmc accepts a singular value or MIN_MEMORY_CLOCK,MAX_MEMORY_CLOCK pair and does not work on NVIDIA Hopper GPUs; guard for it rather than assuming it applies on modern data-center parts (NVIDIA nvidia-smi).

4. Cap board power for efficiency or envelope¶

sudo nvidia-smi -i 0 -pl 700               # set TGP cap in watts (must be within Min..Max)

-pl accepts integer or floating-point watts and requires Kepler or newer (NVIDIA nvidia-smi). To recover the perf/watt win the book describes, step the cap down and measure throughput at each point; stop at the knee where throughput loss exceeds the power saving (Fregly, Ch. 2). Restore by setting the cap back to the reported default/max power limit.

5. Detect throttling: clocks and violation counters¶

Watch the live operating point with dmon (power, temp, SM/mem clocks, util by default):

nvidia-smi dmon -s pcut                     # power, clocks, util, temp; one row/sec

Query why clocks dropped. The driver exposes a per-reason flag for each cause; sustained Active on a thermal reason means you are heat-limited, on sw_power_cap means you hit your -pl cap:

nvidia-smi --query-gpu=clocks_throttle_reasons.sw_power_cap,\
clocks_throttle_reasons.hw_slowdown,\
clocks_throttle_reasons.sw_thermal_slowdown,\
clocks_throttle_reasons.hw_thermal_slowdown \
  --format=csv

Interpretation (NVIDIA nvidia-smi):

sw_power_cap = Active: the SW power-scaling algorithm reduced clocks because the board exceeded its power cap (often the intended effect of your -pl).
sw_thermal_slowdown / hw_thermal_slowdown = Active: clocks reduced because GPU temperature crossed the max-operating / hardware threshold. On a liquid-cooled NVL72 holding 50–70 °C you should not see this; if you do, suspect coolant flow or a CDU fault (Fregly, Ch. 2).
hw_slowdown = Active: hardware slowdown engaged (power-brake or thermal). A brief transient under a load spike can be benign on Volta+ parts; sustained Active is a real fault (NVIDIA nvidia-smi).

For fleet-wide trending use DCGM, which the book names as the operator tool for "GPU utilization percentage, memory usage, temperature, and NVLink throughput" (Fregly, Ch. 2). DCGM exposes a cumulative thermal violation time (ns) field; a nonzero, growing value flags a node that is silently losing clocks to heat (NVIDIA DCGM Field Identifiers):

dcgmi dmon -e 1001,1004,150,155,203          # power, energy, sm clock, temp, power-violation time

See GPU Performance and Health, Reliability, RAS and Failure Modes, and nvidia-smi Reference.

6. Maintain: the cooling context that keeps clocks up¶

Power/clock tuning only works if the heat has somewhere to go. On dense Blackwell racks that path is liquid, not air (Fregly, Ch. 2):

Each Grace Blackwell superchip and each NVSwitch carries a cold plate; coolant circulates through manifolds to a Coolant Distribution Unit (CDU) that exchanges heat into the facility loop.
Facilities supply chilled water at 20–30 °C; "warm-water" designs run ~30 °C in / ~45 °C out, letting evaporative towers cool without active refrigeration (higher facility efficiency).
Removing ~130 kW needs roughly 150–200 L/min of coolant flow at a 10–12 °C rise; this holds GPUs at 50–70 °C so they never thermal-throttle (Fregly, Ch. 2).
Leak, pressure, and coolant-temperature sensors can isolate or shut a section down; the practical maintenance signal is: if *_thermal_slowdown starts firing or DCGM thermal-violation time climbs on a liquid-cooled node, treat it as a cooling/CDU problem, not a GPU defect (Fregly, Ch. 2). See Datacentre Physical Readiness.

For efficiency at the facility level the book frames the same tradeoff at rack scale: 130 kW is high, but per useful FLOP an NVL72 is more efficient than spreading the GPUs across less-densely-cooled racks (Fregly, Ch. 2), the macro version of the per-GPU perf-per-watt knob you set with -pl. See Performance Optimization and Tuning.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 2, "AI System Hardware Overview" — sections "Compute Density and Power Requirements", "Liquid Cooling Versus Air Cooling", and "Performance Monitoring and Utilization in Practice". Source of the ~130 kW rack power, ~1,200 W per-GPU dissipation, 50–70 °C under-load range, CDU / 20–30 °C chilled-water / 30→45 °C warm-water / 150–200 L/min figures, the "liquid cooling reduces thermal throttling / sustain max clocks" claim, the "dial down clocks for performance-per-watt / save kilowatts" guidance, the staggered-boost transient note, and DCGM as the monitoring tool.
NVIDIA, nvidia-smi documentation — exact syntax and constraints for -lgc/--lock-gpu-clocks (singular or MIN,MAX MHz pair; --mode=0|1; Volta+, root), -rgc/--reset-gpu-clocks, -lmc/--lock-memory-clocks (pair or singular; does not work on Hopper), -rmc/--reset-memory-clocks, -pl/--power-limit (watts, Kepler+, within Min..Max), -pm/--persistence-mode, and the clocks_throttle_reasons.{sw_power_cap,hw_slowdown,sw_thermal_slowdown,hw_thermal_slowdown} query fields and their SW-power-scaling / max-operating-temp semantics. https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA, NVML Clocks Throttle Reasons reference — definitions of nvmlClocksThrottleReasonSwPowerCap, ...SwThermalSlowdown, ...HwThermalSlowdown, ...HwSlowdown (power-brake vs thermal). https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksThrottleReasons.html
NVIDIA DCGM, Field Identifiers — power usage, energy, SM/memory clock, GPU temperature, and the cumulative power/thermal violation-time fields used with dcgmi dmon -e. https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html