nvidia-smi reference¶
Scope: nvidia-smi (the NVIDIA System Management Interface) as a working instrument: inspection (-q, --query-gpu ... --format=csv), live monitoring (dmon, pmon), and control (persistence, power limit, locked clocks, compute mode, ECC, reset, MIG). The flags an operator actually types, what each returns, and where the sharp edges are. For the daemon side of persistence see Persistence Mode; for the fabric readiness it reports, Fabric Manager.
Every command below is a reference template, not hardware-tested. Field availability and exact output vary by GPU architecture and driver branch; run
nvidia-smi --help-query-gpuon the target node to confirm supported fields before scripting against them. Verify destructive control flags (-r,-e,-mig) against your driver version first.
What it is¶
nvidia-smi is the command-line front end to NVML (NVIDIA Management Library), shipped inside every NVIDIA datacenter and professional driver. One binary does three jobs:
- Inspect. Board inventory, driver/VBIOS versions, memory, clocks, ECC counters, throttle reasons, MIG layout. Human table (
nvidia-smi), full detail (-q), or machine-parseable (--query-gpu ... --format=csv). - Monitor. Rolling device metrics (
dmon) and per-process accounting (pmon), each up to 16 devices.1 - Control. Persistence, power cap, locked clocks, compute mode, ECC enable, GPU reset, and MIG partitioning. Every control flag requires root.1
It is the lowest-common-denominator tool: present on every node, no agent, no dependencies. For fleet-scale telemetry it is the wrong layer; scrape NVML through DCGM / dcgm-exporter instead (see GPU Diagnostics and Validation). nvidia-smi is for the node in front of you.
The bare command prints driver version, CUDA driver-API version, and a per-GPU table (index, name, persistence, power/temp, memory, utilization, compute mode) plus the running-process list:
Why it's needed (and when)¶
It is the first command on any GPU box and the first command in almost every runbook. Reach for it when:
- Triage. "Is the GPU there, healthy, and at the right clocks?"
nvidia-smianswers driver-loaded, ECC state, throttle reasons, and XID-adjacent symptoms in one screen. If the table itself fails to render, the problem is below CUDA, in the driver/kernel-module; see NVIDIA Kernel Modules and Kernel Upgrade, GPU Missing. - Node bring-up. Set persistence, lock clocks for a benchmark, confirm
Fabric.Stateis ready on NVSwitch parts before trusting NVLink (Fabric Manager, NVSwitch and NVLink). - Provisioning. Enable/disable MIG, set compute mode
EXCLUSIVE_PROCESSfor single-tenant jobs, toggle ECC. - Scripting.
--query-gpu ... --format=csv,noheader,nounitsis the portable way to pull a metric without parsing the human table.
When not to use it: continuous fleet monitoring (use DCGM), or anything needing per-kernel detail (use Nsight / dcgmi). nvidia-smi polling at high frequency adds driver overhead and still only samples NVML counters.
How it's installed & managed¶
nvidia-smi is installed by the driver package; there is no separate package to add. If the driver is present, the binary is on PATH (/usr/bin/nvidia-smi). Its NVML version tracks the installed driver, so the tool and the driver branch are never out of step on a single node (driver branches: Driver Versions and Branches).
Confirm presence and version:
nvidia-smi --version # nvidia-smi + NVML + driver versions
nvidia-smi -L # list GPUs with UUIDs, one line each
-L, --list-gpus lists each NVIDIA GPU in the system along with its UUID.1 Use the UUID (not the index, since indices renumber across reboots and MIG changes) when pinning a control action to a specific board.
There is nothing to "manage" about the binary itself. What you manage through it is node state, and that state is not durable unless you make it so:
- Control flags set runtime driver state that is lost on driver unload / reboot. Persist the ones you need via a oneshot systemd unit or, for persistence specifically, the
nvidia-persistenceddaemon (Persistence Mode). - The one exception is ECC, which
-ewrites as a persistent setting that survives reboot (see below).
Targeting and selection¶
Every command takes -i, --id=ID to target one device. ID may be the GPU index, board serial, UUID, or PCI bus ID.1 Prefer UUID for anything written into automation:
nvidia-smi -i GPU-1d2c3b4a-... -q # full detail for one GPU by UUID
nvidia-smi -i 0,1 --query-gpu=name --format=csv
Validated usage & tests¶
The blocks below are copy-correct reference invocations. Output shapes are described; specific numbers are intentionally omitted because they are node-specific.
Inspection¶
Full detail, optionally narrowed to one report section with -d, --display=TYPE:
nvidia-smi -q # everything NVML knows, per GPU
nvidia-smi -q -d ECC # ECC mode + volatile/aggregate error counts
nvidia-smi -q -d ROW_REMAPPER # row-remap state (Ampere+); pending/failure flags
nvidia-smi -q -d CLOCK,POWER # clocks and power sections only
-d/--display accepts a comma-separated list from: MEMORY, UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK, COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS, PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, ROW_REMAPPER, GSP_FIRMWARE_VERSION, among others.1 Expect a long key/value block per GPU; absent features read N/A.
Scriptable query¶
--query-gpu takes a comma-separated field list and requires --format=csv (mandatory).1 Add noheader,nounits for clean parsing. Discover valid fields with nvidia-smi --help-query-gpu.2
# Fleet-style one-liner: identity, power, thermals, memory, utilization
nvidia-smi --query-gpu=index,name,uuid,driver_version,pstate,\
power.draw,power.limit,temperature.gpu,utilization.gpu,utilization.memory,\
memory.total,memory.used,memory.free \
--format=csv,noheader,nounits
# Why is it slow? Throttle reasons + current vs max clocks
nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm,\
clocks_event_reasons.hw_thermal_slowdown,clocks_event_reasons.sw_power_cap,\
clocks_event_reasons.hw_power_brake_slowdown \
--format=csv
# Health: ECC mode + uncorrectable aggregate, remapped rows, link width
nvidia-smi --query-gpu=ecc.mode.current,\
ecc.errors.uncorrected.aggregate.total,\
remapped_rows.uncorrectable,remapped_rows.pending,\
pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current \
--format=csv
# NVSwitch readiness as seen by the GPU (Hopper/Blackwell HGX)
nvidia-smi --query-gpu=fabric.state,fabric.status --format=csv
All of clocks.current.sm, clocks.max.sm, the clocks_event_reasons.* group, ecc.mode.current, ecc.errors.uncorrected.aggregate.total, remapped_rows.*, pcie.link.gen.current/max, pcie.link.width.current, and fabric.state/status are real fields in the driver's --help-query-gpu output.2 clocks_event_reasons.* is the programmatic form of the human "Clocks Throttle Reasons", the fastest way to attribute a clock drop to thermal, power-cap, or HW power-brake without eyeballing the table.
Repeat sampling with -l SEC (probe every SEC until Ctrl+C) or -lms ms (milliseconds); log to a file with -f FILE:1
nvidia-smi --query-gpu=timestamp,index,power.draw,temperature.gpu,utilization.gpu \
--format=csv -l 1 -f /var/log/gpu-watch.csv
Monitoring: dmon (device) and pmon (process)¶
dmon rolls one line per sample per device (up to 16). Select metric groups with -s, set interval with -d, bound the run with -c:1
nvidia-smi dmon -s pucm -d 1 # power/temp, util, clocks, memory @1s
nvidia-smi dmon -s puct -c 60 -i 0 # GPU0, add PCIe throughput, 60 samples then exit
-s letters: p = power usage (W) and GPU/memory temperature (C); u = utilization (SM, memory, encoder, decoder, JPEG, OFA, %); m = frame buffer / Bar1 / protected memory (MB); c = processor and memory clocks (MHz); e = ECC aggregated single/double-bit errors and PCIe replay errors; v = power violations (%) and thermal violations (boolean flag); t = PCIe Rx/Tx throughput (MB/s).1 Columns include pwr gtemp mtemp sm mem enc dec mclk pclk depending on the groups selected; sm is the percentage of time at least one SM was busy (a rough occupancy proxy, not FLOP utilization).2
pmon attributes utilization to PIDs:
Columns: gpu pid type sm mem enc dec command, where type is C (CUDA/compute) or G (graphics).2 Use it to find which process owns the SMs before killing or reprioritising.
Control¶
All control flags require root and act on all GPUs unless narrowed with -i.
# Persistence (LEGACY — prefer the daemon; see note below)
nvidia-smi -pm 1 # enable persistence mode; -pm 0 disables
# Power cap (Kepler+); watts, integer or float
nvidia-smi -pl 350 # set board power limit to 350 W
nvidia-smi -i 0 -pl 700 # cap GPU 0 only
# Locked clocks for reproducible benchmarks (Volta+)
nvidia-smi -lgc 1410,1410 # lock graphics/SM clock to a fixed MIN,MAX
nvidia-smi -lmc 1593,1593 # lock memory clock MIN,MAX (see Hopper caveat)
nvidia-smi -rgc # reset locked GPU clocks to default
# Compute mode
nvidia-smi -c 3 # EXCLUSIVE_PROCESS: one context per device
nvidia-smi -c 0 # DEFAULT: multiple contexts allowed
# ECC (persistent; needs reboot to take effect)
nvidia-smi -e 1 # enable ECC; -e 0 disables
# Reset (no clients may be using the device)
nvidia-smi -r # === --gpu-reset
nvidia-smi -i 0 -r
What each returns / requires:
-pl POWER: sets the maximum board power limit in watts; supported from Kepler, requires root.1 Verify the accepted range first viapower.min_limit/power.max_limitin--query-gpu; out-of-range values are rejected.-lgc MIN,MAX: locks the graphics/SM clock; Volta+, root.1 Pin MIN=MAX for a stable benchmark clock; readback viaclocks.current.sm.-lmc MIN,MAX: locks the memory clock; does not work on GPUs based on the NVIDIA Hopper architecture per the manual.1 On Hopper, control the SM clock with-lgcand leave memory clock alone.-rgc: resets locked GPU clocks (Volta+, root).1 Always unlock after a benchmark so production reclaims boost.-c MODE: compute mode integers:0=DEFAULT,2=PROHIBITED,3=EXCLUSIVE_PROCESS;1(EXCLUSIVE_THREAD) is deprecated.5 Use3to stop a second process landing on a single-tenant training GPU.-e CONFIG:1=enabled,0=disabled; the change is persistent and takes effect after the next reboot.1 Confirm withecc.mode.currentvsecc.mode.pending, which differ until the reboot lands. Recovery if it sticks: ECC Toggle Recovery.-r / --gpu-reset: resets the GPU; there can be no applications using the device (stop jobs, Fabric Manager, MPS, persistence first), requires root.1
MIG control¶
MIG partitioning is driven through nvidia-smi mig (and the -mig mode flag). Supported only on the MIG-capable tier (MIG). Enable mode, then create instances:3
sudo nvidia-smi -i 0 -mig 1 # enable MIG on GPU 0 (omit -i for all GPUs)
sudo nvidia-smi mig -lgip # list GPU-instance profiles (IDs + names)
sudo nvidia-smi mig -cgi 9,9 -C # create two GPU instances by profile ID; -C adds compute instances
sudo nvidia-smi mig -lgi # list created GPU instances
sudo nvidia-smi mig -lci # list compute instances
sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi # tear down: CIs first, then GIs
sudo nvidia-smi -i 0 -mig 0 # disable MIG on GPU 0
On Ampere, enabling MIG may require a GPU reset to take effect (nvidia-smi --gpu-reset); on Hopper and later it takes effect immediately.3 Profile names look like 1g.5gb, 2g.10gb, 3g.20gb, 7g.40gb (slices.memory); the -cgi argument takes the numeric profile ID from -lgip, not the name.3 Stale MIG state confusing the scheduler is a known failure; see Stale MIG State.
Legacy note: -pm vs nvidia-persistenced¶
nvidia-smi -pm 1 is the legacy way to hold the driver resident. NVIDIA documents persistence-mode-via-nvidia-smi under "Persistence Mode (Legacy)" and ships nvidia-persistenced as the preferred, more reliable replacement: a user-space daemon that holds the NVIDIA character device files open instead of setting a kernel flag, and the legacy mode is "near end-of-life and will be eventually deprecated in favor of the Persistence Daemon."4 Neither survives a reboot on its own; run the daemon from system init for durability. Prefer the daemon on every server; full setup is in Persistence Mode.
Expected behaviour to sanity-check (describe, do not assume numbers): with persistence off, the first CUDA call after an idle period pays a driver re-init cost and clocks ramp from idle; with it on, the device stays initialized and persistence_mode reads Enabled in nvidia-smi. Confirm via nvidia-smi --query-gpu=persistence_mode --format=csv.
Failure modes¶
Brief; each links its runbook.
nvidia-smiitself errors / "No devices were found" / "couldn't communicate with the NVIDIA driver." The driver or kernel module is not loaded (often a kernel upgrade with no DKMS rebuild, or a GSP firmware/driver mismatch). This sits below the CUDA layer; see Kernel Upgrade, GPU Missing and GSP Firmware / Driver Mismatch.Fabric.Statenot ready / GPUs isolated on NVLink. Fabric Manager down or version-mismatched on an NVSwitch system; check it before blaming NVLink or NCCL: Fabric Manager Failure.- ECC change won't apply /
current≠pending. Reboot pending (by design), or it sticks: ECC Toggle Recovery. -rfails with "in use." A client still holds the device; stop jobs, MPS, Fabric Manager, and persistence, then retry: Persistence Mode / Clock Bounce.- MIG layout wrong or scheduler sees stale slices. Tear down CIs/GIs and re-create, or reset: Stale MIG State.
- Persistence keeps resetting to off. Legacy
-pmdoesn't survive reboot; install the daemon: Persistence Mode / Clock Bounce, Persistence Mode.
References¶
- NVIDIA System Management Interface (
nvidia-smi) manual — flag syntax, dmon/pmon, control options: https://docs.nvidia.com/deploy/nvidia-smi/index.html nvidia-smi(1)man page (general options,-sdmon metric groups, compute-mode/ECC values): https://man.archlinux.org/man/nvidia-smi.1.en- Useful nvidia-smi Queries /
--help-query-gpufield reference: https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries - NVML API Reference, Device Enums (
nvmlComputeMode_tinteger values): https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html - Driver Persistence (Persistence Mode "Legacy" vs Persistence Daemon): https://docs.nvidia.com/deploy/driver-persistence/index.html
nvidia-persistencedutility (daemon holds device files open; replaces legacy mode): https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html- NVIDIA Multi-Instance GPU (MIG) User Guide —
-mig,mig -lgip/-cgi/-C/-lgi/-lci/-dci/-dgi, reset behaviour: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/latest/getting-started-with-mig.html
Related: Persistence Mode · Fabric Manager · MIG · ECC · Glossary
-
NVIDIA System Management Interface manual:
-pm/--persistence-mode(Linux only, requires root, does not persist across reboots);-pl/--power-limitwatts (Kepler+, root);-lgc/--lock-gpu-clocks=MIN,MAXand-rgc/--reset-gpu-clocks(Volta+, root);-lmc/--lock-memory-clocks"does not work on GPUs based on NVIDIA Hopper architectures";-c/--compute-mode;-e/--ecc-config"takes effect after the next reboot and is persistent";-r/--gpu-reset(no apps using device, root);-q/--query,-d/--display=TYPE,--query-gpu,--format=csv(MANDATORY);dmon/pmonup to 16 devices; general options-i/--id,-l/--loop=SEC,-lms/--loop-ms,-L/--list-gpus,-f/--filename; dmon-sgroups p/u/m/c/e/v/t,-ddelay,-ccount. https://docs.nvidia.com/deploy/nvidia-smi/index.html ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩ -
nvidia-smi --help-query-gpufield list (real driver output) — confirmsindex,name,uuid,driver_version,pstate,power.draw,power.limit,enforced.power.limit,power.min_limit,power.max_limit,temperature.gpu,temperature.memory,utilization.gpu,utilization.memory,memory.total/used/free,clocks.current.sm,clocks.max.sm,clocks_event_reasons.*,ecc.mode.current/pending,ecc.errors.uncorrected.aggregate.total,remapped_rows.*,pcie.link.gen.current/max,pcie.link.width.current,fabric.state,fabric.status,persistence_mode,compute_mode,mig.mode.current/pending; dmon/pmon column semantics (sm= % time >=1 SM active; pmontypeC=compute/G=graphics). https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries ↩↩↩↩ -
NVIDIA MIG User Guide, Getting Started — enable with
nvidia-smi -i <ids> -mig 1(omit-ifor all GPUs); Ampere may neednvidia-smi --gpu-resetfor mode change, Hopper+ takes effect immediately;mig -lgiplist GPU-instance profiles,mig -cgi <profileID> -Ccreate GI + compute instances,mig -lgi/-lcilist,mig -dci/-dgidestroy; profile names1g.5gb/2g.10gb/3g.20gb/7g.40gb. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/latest/getting-started-with-mig.html ↩↩↩ -
NVIDIA Driver Persistence doc structures persistence as "Persistence Mode (Legacy)" vs "Persistence Daemon"; the "Persistence Mode (Legacy)" page states this solution is "near end-of-life and will be eventually deprecated in favor of the Persistence Daemon." The
nvidia-persistencedREADME separately states the daemon holds the NVIDIA character device files open. https://docs.nvidia.com/deploy/driver-persistence/persistence-mode-legacy.html — https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html ↩ -
NVML API Reference, Device Enums —
nvmlComputeMode_t:NVML_COMPUTEMODE_DEFAULT= 0,NVML_COMPUTEMODE_EXCLUSIVE_THREAD= 1,NVML_COMPUTEMODE_PROHIBITED= 2,NVML_COMPUTEMODE_EXCLUSIVE_PROCESS= 3. The integer compute-mode values originate from this enum. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html ↩