Skip to content
Markdown

GPU firmware & GSP

Scope: the on-board firmware a GPU carries, namely the VBIOS and the GSP (GPU System Processor) firmware the kernel driver loads at init. How to read it, where it lives, and the partial-upgrade failure where GSP and driver branch disagree and the modules refuse to load.

Code below is reference template, not hardware-tested. Pin versions and validate on a canary before touching a fleet.

What it is

A modern NVIDIA GPU runs two distinct pieces of firmware, and they have different owners and different update paths:

  • VBIOS: the board's BIOS, flashed to a SPI ROM on the GPU itself. It comes up with the board, holds power/clock/PCIe init tables, board ID and the InfoROM, and is independent of the host driver. nvidia-smi -q describes the VBIOS Version field plainly as "The BIOS of the GPU board." [1] Updating it means reflashing the board: vendor VBIOS tooling on standalone cards, or the system firmware bundle (nvfwupd) on DGX/HGX where the GPU tray is updated as a unit.

  • GSP firmware: a signed blob the kernel driver pushes onto the GPU System Processor, an on-GPU RISC-V microcontroller, during module init. Turing introduced GSP; from that generation NVIDIA progressively moved driver/RM logic off the CPU and onto the GSP core, and on Hopper/Blackwell-class datacenter GPUs GSP is the normal path, not optional tuning. The GSP firmware is shipped inside the driver package and is not versioned separately. The driver README states the GSP firmware "will be used by default for all Turing and later GPUs." [2] nvidia-smi -q reports it under GSP Firmware Version, described as "Firmware version of GSP. This is an alphanumeric string" [1]; in observed behaviour the field reads N/A when GSP is disabled.

The operational consequence is the whole reason this page exists: VBIOS is board-resident and driver-independent; GSP firmware is part of the driver and must match the driver branch. Get those two facts straight and most "the GPU just disappeared after a driver change" tickets become obvious.

flowchart LR
  ROM["VBIOS in board SPI ROM"] --> BOARD["GPU board init at power-on"]
  PKG["Driver package: .deb / .rpm / runfile"] --> KMOD["nvidia.ko loads at init"]
  PKG --> FWDIR["/lib/firmware/nvidia/DRIVER_VER/gsp_*.bin"]
  KMOD --> PUSH["Driver pushes GSP blob to on-GPU RISC-V core"]
  FWDIR --> PUSH
  PUSH --> READY["GPU usable, GSP Firmware Version populated"]

Why it's needed (and when)

You read and reason about GPU firmware in four situations:

  • Acceptance / commissioning. Record VBIOS and GSP versions per board as part of the inventory baseline, so drift and silent RMA-swaps are detectable later. A board that returns from RMA on a different VBIOS is a real source of "one node behaves differently."
  • Driver upgrades. GSP firmware moves with the driver. The hazard is a partial upgrade (new kernel modules installed but the matching gsp_*.bin not in place, or a stale firmware directory left behind), which produces a module-load failure, not a graceful fallback. This is the failure this page links to a runbook for.
  • Security / errata. VBIOS and platform firmware carry their own advisories (EROT, BMC, NVSwitch on DGX/HGX). These are reflash events, separate from the driver cadence, and on multi-GPU NVSwitch systems they go through the system firmware bundle.
  • Field debugging. When a GPU is missing, throwing Xid/init errors, or one node disagrees with its siblings, VBIOS and GSP versions are early triage fields, cheap to read, high signal.

When not to touch it: GSP firmware is not a knob you tune. Do not chase NVreg_EnableGpuFirmware=0 (the documented switch to disable GSP [2]) as a fix on datacenter GPUs. Disabling GSP on hardware that expects it is a debugging probe, not a configuration, and it changes the driver's operating model. Treat it as out of scope for production unless NVIDIA support directs it.

How it's installed & managed

GSP firmware: it ships with the driver

There is no separate GSP install step. When you install a datacenter driver, the GSP blobs land under a version-stamped firmware directory and the driver loads the one matching the GPU architecture. Per the driver README, the files gsp_*.bin are installed in /lib/firmware/nvidia/<driver-version>/. [2] Each blob is named for the GPU architecture it serves, for example gsp_tu10x.bin (Turing) and gsp_ga10x.bin (Ampere); each file covers one or more architectures. [2][3]

Reference template, not hardware-tested. Inspect what is actually on a node:

# Driver branch the host is running
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# GSP firmware blobs shipped by that driver (directory is named by driver version)
ls -1 /lib/firmware/nvidia/
ls -1 /lib/firmware/nvidia/"$(cat /sys/module/nvidia/version)"/

The single load-bearing invariant: the directory under /lib/firmware/nvidia/ whose name matches the running driver version must exist and contain the architecture's gsp_*.bin. A package state where /sys/module/nvidia/version reports one branch but no matching firmware directory is present is exactly the partial-upgrade trap. Use DKMS so modules rebuild on kernel upgrades, and reinstall the driver as a unit (do not hand-copy firmware files) so modules and firmware stay in lockstep, the same discipline applied to Fabric Manager in the rolling driver upgrade runbook.

VBIOS: board reflash, two paths

VBIOS lives on the board and is updated by reflashing, never by the driver. Which tool depends on the platform:

  • Standalone / PCIe-attached cards: vendor-supplied VBIOS update package and flashing utility, obtained from NVIDIA or the board partner. Treat any VBIOS flash as a maintenance-window, one-board-at-a-time operation with a known-good image and the ability to recover; a bad flash bricks the board. Specifics are board- and advisory-dependent. Verify against the exact update package for your GPU.

  • DGX / HGX and multi-node NVLink systems: the nvfwupd tool drives platform firmware updates, including the GPU tray (VBIOS, NVSwitch, EROT, FPGA, retimers). It is a Linux ELF binary that talks to the system BMC over the Redfish API (out-of-band), targeting the system rather than a single card. [4][5] Valid servertype values include DGX, HGX, GB200, GB300 and related platform types. [4]

Reference template, not hardware-tested. Read installed versions before deciding to update, then apply a vendor firmware package:

# Show installed firmware component versions (compare against the target package)
nvfwupd -t ip=<bmc-ip> user=<bmc-user> password=<bmc-password> show_version

# Apply a vendor PLDM firmware package (-p), prompt-free (-y)
nvfwupd -t ip=<bmc-ip> user=<bmc-user> password=<bmc-password> \
  update_fw -p <firmware-package-file> -y

update_fw communicates over Redfish and prints progress; -b/--background returns a Task ID instead, monitored with show_update_progress. [4] Firmware packages come from the NVIDIA Enterprise Support portal, and the exact components, sequence and reboot/AC-cycle requirements are platform-specific. Always follow the firmware update guide for the precise system (DGX H100/H200, B200, B300, GB200/GB300 NVL) rather than generalising across them. [4][5]

Validated usage & tests

The primary instrument is nvidia-smi -q, which prints both fields in its per-GPU detail. Reference template, not hardware-tested:

# Both firmware fields, full per-GPU detail
nvidia-smi -q | grep -E "VBIOS Version|GSP Firmware Version"

Expected: one VBIOS Version line per GPU showing the board's flashed VBIOS string, and one GSP Firmware Version line. On a healthy datacenter GPU running GSP, the GSP value is a populated alphanumeric version string; where GSP is disabled or unsupported it reads N/A. Do not hard-code an expected version. Read it from the node and compare against your pinned baseline.

Scriptable form for fleet sweeps (the CSV query fields are stable and scriptable):

# VBIOS per GPU, machine-readable
nvidia-smi --query-gpu=name,vbios_version --format=csv,noheader

vbios_version is a valid --query-gpu field; the full list is available via nvidia-smi --help-query-gpu. [6] Confirm the GSP field name on your driver version from that help output rather than assuming it.

Cross-check against the kernel's view of the GPU. This /proc entry reports the loaded GSP firmware independently of nvidia-smi and is useful when nvidia-smi itself is unhappy:

# Per-GPU info node; includes the GSP firmware line
cat /proc/driver/nvidia/gpus/*/information | grep -E "GSP Firmware|Model"

The driver README documents this /proc/driver/nvidia/gpus/<PCI-BUS-ID>/information node and that it shows the GSP firmware version. [2] What to assert in a check: the GSP Firmware Version is non-N/A on hardware that should be running GSP, and the matching gsp_*.bin exists under /lib/firmware/nvidia/<driver-version>/. A fuller node-health diagnostic (PCIe/NVLink, memory, NCCL) is dcgmi diag -r 3, covered in the GPU software stack.

Failure modes

  • GSP / driver mismatch after a partial upgrade, modules fail to load. The headline failure: new nvidia.ko is in place but the matching gsp_*.bin is missing or stale (firmware directory not updated, an interrupted upgrade, or hand-edited firmware). The driver cannot load its GSP firmware and module init fails; nvidia-smi reports no devices, and dmesg typically shows a GSP/firmware load error referencing a path under /lib/firmware/nvidia/<version>/. Recovery is to bring driver modules and GSP firmware back to one consistent branch (reinstall the driver as a unit, rebuild DKMS, reboot). Full procedure in the GSP firmware mismatch runbook.
  • Stale firmware directory from an old branch. Leftover /lib/firmware/nvidia/<old-version>/ directories are harmless on their own but mask intent; the only one that matters is the directory matching the currently loaded driver version, and it must be present and complete.
  • VBIOS drift across nominally identical boards. A board on a different VBIOS (often post-RMA) can behave subtly differently. Caught by recording vbios_version at acceptance and diffing the fleet, not by runtime errors. See reliability and RAS.
  • Bad VBIOS / platform-firmware flash. A failed reflash can brick a board or leave a DGX/HGX GPU tray inconsistent. Always flash from a known-good image in a maintenance window, one board at a time, following the platform's firmware update guide.

References

  1. nvidia-smi manual — VBIOS Version ("The BIOS of the GPU board") and GSP Firmware Version ("Firmware version of GSP. This is an alphanumeric string"): https://docs.nvidia.com/deploy/nvidia-smi/index.html
  2. NVIDIA driver README, GSP Firmware chapter — /lib/firmware/nvidia/<version>/, gsp_*.bin, "used by default for all Turing and later GPUs", NVreg_EnableGpuFirmware, /proc/driver/nvidia/gpus/<PCI-BUS-ID>/information: https://download.nvidia.com/XFree86/Linux-x86_64/570.86.16/README/gsp.html
  3. GSP firmware blob naming (gsp_tu10x.bin Turing, gsp_ga10x.bin Ampere) — Debian firmware-nvidia-gsp package filelist: https://packages.debian.org/sid/amd64/firmware-nvidia-gsp/filelist
  4. About the nvfwupd Tool — DGX H100/H200 Firmware Update Guide (-t/--target with servertype, show_version, update_fw -p, Redfish, --background): https://docs.nvidia.com/dgx/dgxh100-fw-update-guide/nvfwupd-reference.html
  5. NVIDIA Firmware Update Guide (multi-node NVLink: GB200/GB300 NVL, Grace Hopper/Blackwell) — OOB Redfish via BMC: https://docs.nvidia.com/multi-node-nvlink-systems/nvfupd-guide/introduction.html
  6. Useful nvidia-smi Queries — nvidia-smi --query-gpu=vbios_version --format=csv and --help-query-gpu: https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries

Related: GPU Software Stack · Kernel Modules · CUDA Driver · nvidia-smi Reference · Driver Versions & Branches · GSP Firmware Mismatch Runbook · Glossary