Skip to content
Markdown

Out-of-band management & BMC

Scope: the baseboard management controller (BMC) and the lights-out plane it serves (remote power, serial-over-LAN console, sensor telemetry, firmware update, and inventory), reachable independent of the host OS, NIC, and bootloader. The two access protocols (IPMI, Redfish) and why a physically separate OOB network is mandatory for a GPU fleet.

Reference templates, grounded in the ipmitool manual, the DMTF Redfish specification and schema guide, and the NVIDIA DGX H100/H200 BMC documentation. Nothing here was executed on hardware. Substitute real BMC addresses, credentials, system and chassis IDs; confirm the exact URIs and ResetType values your BMC advertises before scripting against a fleet.

This page is the BMC reference the rest of the provisioning stack points back to. The access protocols have their own deep-dives (IPMI and Redfish), and the physical OOB network is covered in OOB network infrastructure; the fleet-level provisioning flow that depends on all three lives in provisioning and scheduling. When the BMC itself is unreachable, jump to the OOB-unreachable runbook.

flowchart LR
  ADMIN["Admin / automation"] -->|"IPMI or Redfish"| BMC["BMC (lights-out)"]
  BMC --> PWR["Power: on / off / cycle"]
  BMC --> SOL["Serial-over-LAN console"]
  BMC --> SENS["Sensors: temp / power / fan"]
  BMC --> FW["Firmware update"]
  BMC --> INV["Inventory: FRU / SEL"]
  BMC -.->|"sideband"| HOST["Host: CPU / GPU / NIC"]

What it is

A BMC is an autonomous service processor on the motherboard with its own CPU, RAM, flash, and dedicated NIC. It runs whenever the chassis has standby power (before POST, while the host OS is hung, and after a kernel panic), which is the entire point: management does not depend on the thing being managed. "IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware" that operates "independently of the operating system, the BIOS and the system's main CPU".1

Two protocols ride the BMC, and a GPU node typically exposes both:

  • IPMI is the legacy binary protocol. ipmitool reaches a remote BMC over lanplus (IPMI v2.0 / RMCP+, UDP 623); the older lan interface is IPMI v1.5.1 IPMI is the reliable path for serial-over-LAN and for low-level sensor/event-log access. See IPMI.
  • Redfish is the DMTF successor: a RESTful, schema-based, JSON-over-HTTPS model defined by DSP0266 as an "interoperable, multivendor, remote, and out-of-band capable interface" for "scalable platform management".3 It is the modern default for power, inventory, telemetry, and firmware; vendors model GPU-specific resources under it. See Redfish.

The capability set is the same regardless of protocol: remote power control, console redirection (serial-over-LAN; KVM/virtual media on richer BMCs), sensor telemetry, system event log (SEL), field-replaceable-unit (FRU) inventory, and out-of-band firmware update. On a DGX/HGX system the BMC also surfaces the GPU baseboard and per-GPU environmental data (power, temperature) without touching the host (see Validated usage).

This is distinct from in-band telemetry. nvidia-smi and DCGM (diagnostics tools) read the GPU through the driver on a running host; the BMC reads board and inlet sensors over sideband and stays up when the host is down. The two are complementary: in-band tells you the GPU is throttling, the BMC tells you the inlet is 40 C and lets you power-cycle a node that has stopped answering SSH.

Why it's needed (and when)

A separate OOB plane is not optional at fleet scale; it is the only management path that survives the failures you actually have.

  • Hung hosts are steady-state, not exceptional. At thousands of GPUs, hardware failure is continuous, and the RAS loop ends in a node reset. The in-band reset nvidia-smi -r requires a host that still schedules; when a GPU has "fallen off the bus" (Xid 79) or the kernel has wedged, the only remaining lever is a BMC power-cycle. Without OOB you are dispatching a human to the rack.
  • Provisioning needs power and console before any OS exists. Network-boot a bare node, watch the firmware screen over serial-over-LAN, and flip the boot order, all before the disk is touched. This is the precondition for bare-metal PXE and the imaging stack in image management; the orchestrators in provisioning tools (MAAS, Ironic, Warewulf, xCAT) drive nodes through exactly this interface.
  • Health gating depends on out-of-band power. Draining an unhealthy node (GPU health gating) and recycling it with a clean power cycle is a BMC operation when the host is uncooperative.
  • Firmware and inventory live below the OS. BMC/BIOS/GPU firmware baselines and FRU inventory are read and written out-of-band, so a node with no working OS can still be audited and re-flashed (install lifecycle).

When you do not lean on it: routine, in-band GPU telemetry and resets on a healthy host belong to DCGM and nvidia-smi, not the BMC. Reach for OOB when the host is unreachable, headless, not yet provisioned, or being power-controlled, and design as if every node will be in that state regularly, because at fleet scale some always are.

The OOB network itself must be physically (or at least logically) isolated from the data plane. NVIDIA's DGX SuperPOD runs out-of-band management as a separate fabric.7 The BMC is a powerful, historically under-hardened attack surface: full pre-auth compromises exist, e.g. the IPMI 2.0 cipher-suite-0 / RAKP authentication bypass (CVE-2013-4782, CVSS 10.0), which lets a remote attacker run arbitrary IPMI commands with any password.8 An exposed BMC is remote, persistent, host-independent code execution. Keep it off the data and tenant networks; details in OOB network infrastructure.

How it's set up & managed

Reference templates, not hardware-tested. Substitute your own addresses, credentials, and IDs.

Credentials and access hygiene

Never pass the BMC password on the command line: it is visible in the process list. The ipmitool manual states that specifying the password as a command-line option is not recommended; use -E to read it from the IPMI_PASSWORD environment variable instead.2

# Reference template, not hardware-tested.
export IPMI_PASSWORD='...'          # ipmitool reads this with -E; keeps it out of `ps`
BMC=10.0.0.21                       # OOB address of the target node's BMC
ipmitool -I lanplus -H "$BMC" -U admin -E mc info   # confirm reachability + BMC firmware

-I lanplus selects IPMI v2.0 (RMCP+, encrypted); plain -I lan is v1.5 and should be avoided.1 Force a strong cipher suite with -C: suites 3 and 17 are the integrity/confidentiality options to use; cipher suite 0 means no authentication and must be disabled on the BMC, per the cipher-zero bypass above.8 Set the session privilege level explicitly with -L (USER, OPERATOR, ADMINISTRATOR).2

Power control

The single most important OOB operation. chassis power (aliased to power) takes status, on, off, cycle, reset, and soft.2

# IPMI. Reference template, not hardware-tested.
ipmitool -I lanplus -H "$BMC" -U admin -E chassis power status   # -> "Chassis Power is on/off"
ipmitool -I lanplus -H "$BMC" -U admin -E chassis power cycle    # hard off, then on
ipmitool -I lanplus -H "$BMC" -U admin -E chassis power reset    # hard reset, no off phase
ipmitool -I lanplus -H "$BMC" -U admin -E chassis power soft     # ACPI graceful (asks the OS)

The Redfish equivalent is a POST to the system's ComputerSystem.Reset action. The canonical target is /redfish/v1/Systems/{id}/Actions/ComputerSystem.Reset, and ResetType is an enumeration: On, ForceOff, GracefulShutdown, GracefulRestart, ForceRestart, Nmi, ForceOn, PushPowerButton, PowerCycle.4 Discover the exact values your BMC supports from .../ResetActionInfo rather than assuming; not every implementation accepts every member. On DGX H100/H200 the system id is DGX and NVIDIA documents ForceRestart, On, GracefulRestart, ForceOff, GracefulShutdown, PowerCycle:6

# Redfish (DGX H100/H200). Reference template, not hardware-tested.
curl -k -u admin:"$IPMI_PASSWORD" --request POST \
  "https://$BMC/redfish/v1/Systems/DGX/Actions/ComputerSystem.Reset" \
  --header 'Content-Type: application/json' \
  --data '{"ResetType": "ForceRestart"}'

Prefer a Redfish session over basic auth for anything scripted: POST credentials to the session service and reuse the returned token. Creating a session at /redfish/v1/SessionService/Sessions with {"UserName": ..., "Password": ...} returns HTTP 201 and an X-Auth-Token header to send on every subsequent request; DELETE the session location when done.5

# Redfish session auth. Reference template, not hardware-tested.
TOKEN=$(curl -sk -D - -o /dev/null \
  -H 'Content-Type: application/json' \
  -d "{\"UserName\":\"admin\",\"Password\":\"$IPMI_PASSWORD\"}" \
  "https://$BMC/redfish/v1/SessionService/Sessions" | awk -F': ' '/X-Auth-Token/{print $2}' | tr -d '\r')
curl -sk -H "X-Auth-Token: $TOKEN" "https://$BMC/redfish/v1/Systems/DGX" | jq '.PowerState, .Status'

Serial-over-LAN console

The headless console. IPMI is the dependable path: NVIDIA does not document a Redfish SOL/console endpoint for DGX H100/H200.6 Configure the SOL channel, then attach:

# IPMI SOL. Reference template, not hardware-tested.
ipmitool -I lanplus -H "$BMC" -U admin -E sol info        # show SOL channel config
ipmitool -I lanplus -H "$BMC" -U admin -E sol activate     # attach console (exit with ~. )
# if a stale session holds the channel:
ipmitool -I lanplus -H "$BMC" -U admin -E sol deactivate

SOL requires the host serial console to be redirected to the right port (BIOS/firmware "Console Redirection" to the same COM/baud the BMC expects, and a kernel console=ttyS0,115200 for boot logs). Without that redirection you attach to a silent channel.

Sensors, inventory, and the event log

# IPMI inventory + telemetry. Reference template, not hardware-tested.
ipmitool -I lanplus -H "$BMC" -U admin -E sensor list     # all sensors with thresholds
ipmitool -I lanplus -H "$BMC" -U admin -E sdr type Temperature   # temperatures only
ipmitool -I lanplus -H "$BMC" -U admin -E sel elist        # extended system event log
ipmitool -I lanplus -H "$BMC" -U admin -E fru print        # board/PSU/chassis FRU data

In Redfish, sensors live in the chassis Sensors collection: /redfish/v1/Chassis/{id}/Sensors.4 The older monolithic Thermal and Power resources are deprecated in favor of the Sensors, ThermalSubsystem, and PowerSubsystem resources, so prefer the Sensors collection on current BMCs.4 On DGX the chassis id is DGX and the collection paginates 75 members at a time (use $skip):6

# Redfish sensors (DGX). Reference template, not hardware-tested.
curl -sk -H "X-Auth-Token: $TOKEN" "https://$BMC/redfish/v1/Chassis/DGX/Sensors" | jq '.Members | length'
# per-GPU power/temperature over the BMC (GPU_SXM_1 .. GPU_SXM_8):
curl -sk -H "X-Auth-Token: $TOKEN" \
  "https://$BMC/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_SXM_1/EnvironmentMetrics" | jq

Network and firmware

DGX H100/H200 ships the BMC preconfigured on the link-local address 169.254.0.17 (within 169.254.0.0/16); set a routable OOB address before fleet use.6 BMC/BIOS/GPU firmware is updated out-of-band via the Redfish UpdateService (vendor-specific image and task flow). Pin versions and stage per the install lifecycle and the image-drift runbook; confirm the exact UpdateService payload and task URIs against your BMC's documentation before scripting.

Validated usage & tests

Reference templates, not hardware-tested. Expected output is described, never invented; run these against your own BMC and compare.

  1. Reachability. ipmitool ... mc info against a healthy BMC returns the management controller's firmware revision, IPMI version, manufacturer and product IDs, and the list of additional device support. A timeout instead means the OOB address, credentials, cipher suite, or UDP 623 path is wrong; start the OOB-unreachable runbook.

  2. Power state agrees in-band and out-of-band. On a running node, chassis power status reports power on (ipmitool prints a single Chassis Power is on line), and the Redfish Systems/{id} PowerState reads On. A mismatch between the two views is itself a finding: trust the BMC for the physical state.

  3. Telemetry is plausible. sdr type Temperature lists inlet and component temperatures with their lower/upper non-critical and critical thresholds; nothing should be reading at or past a critical threshold on a healthy node, and no sensor should report ns (no reading) where one is expected. The Redfish Sensors collection should return the same physical quantities. Treat absent or pegged sensors as a fault, not noise.

  4. Console attaches. sol activate brings up a live serial console; on a booting node you see firmware and kernel messages scroll, and after login a shell prompt. A blank session usually means host console redirection is not set to the BMC's serial port (fix the BIOS/kernel console= config), not that SOL is broken. Detach with the escape sequence (~.); do not leave the channel held, or the next sol activate returns a "SOL session is already active" error until you sol deactivate.

  5. Hard recovery works end-to-end. On a node that has stopped answering SSH, chassis power cycle (or Redfish ForceRestart) returns the node to a clean boot; confirm by watching it over SOL and re-checking power state. This is the path the RAS drain/reset loop falls back to when nvidia-smi -r cannot reach the host.

Do not record absolute sensor readings or firmware strings from this page as ground truth; capture your fleet's real baselines during commissioning and gate against those.

Failure modes

  • BMC unreachable, no response to mc info / Redfish: wrong OOB subnet or VLAN, UDP 623 filtered, cipher-suite mismatch, exhausted BMC sessions, or a wedged BMC needing a cold reset. Triage with the OOB-unreachable runbook.
  • Stale SOL session: sol activate reports the channel already active because a prior session was not cleanly detached; clear it with sol deactivate.
  • Power action no-op or wrong ResetType: the BMC rejects a ResetType it does not implement; enumerate ResetActionInfo and use a supported member. A GracefulShutdown on a hung host may time out, so escalate to ForceOff/PowerCycle.
  • Blank console: SOL attaches but shows nothing, because host serial redirection or the kernel console= line does not match the BMC's expected port/baud.
  • Exposed or default-credential BMC: a BMC on the data/tenant network, or with cipher suite 0 enabled or factory credentials, is a critical pre-auth foothold (CVE-2013-4782). Isolate the OOB plane (OOB network infrastructure) and disable cipher zero.
  • Firmware/inventory drift: BMC/BIOS/GPU firmware diverges across the fleet, causing non-reproducible faults; reconcile via the image-drift runbook.

References

  • ipmitool project (canonical upstream; IPMI 1.5 lan and 2.0/RMCP+ lanplus): https://github.com/ipmitool/ipmitool
  • ipmitool(1) manual (subcommands, -I/-H/-U/-E/-L/-C options, password-on-CLI warning): https://man.archlinux.org/man/extra/ipmitool/ipmitool.1.en
  • DMTF Redfish specification DSP0266 (out-of-band scalable platform management): https://redfish.dmtf.org/schemas/DSP0266_1.15.1.html
  • DMTF Redfish landing page (standard overview): https://www.dmtf.org/standards/redfish
  • DMTF Redfish Resource and Schema Guide DSP2046 (ComputerSystem.Reset URI + ResetType enum; Thermal/Power deprecated for Sensors): https://redfish.dmtf.org/schemas/v1/DSP2046_2025.3.html
  • Redfish session authentication (SessionService/Sessions -> X-Auth-Token): https://www.dmtf.org/sites/default/files/standards/documents/DSP2060_1.0.0.pdf
  • NVIDIA DGX H100/H200 Redfish API support (system id DGX, sensors, per-GPU EnvironmentMetrics, default BMC IP 169.254.0.17, no Redfish SOL): https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html
  • NVIDIA DGX SuperPOD reference architecture (separate out-of-band management fabric): https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
  • CVE-2013-4782 (IPMI 2.0 cipher-suite-0 / RAKP authentication bypass, CVSS 10.0): https://nvd.nist.gov/vuln/detail/CVE-2013-4782

Related: IPMI · Redfish · OOB network · Provisioning & scheduling · OOB-unreachable runbook · Glossary


  1. ipmitool project README: "ipmitool is a utility for managing and configuring devices that support the Intelligent Platform Management Interface. IPMI is an open standard for monitoring, logging, recovery, inventory, and control of hardware." IPMI v1.5 via the lan interface and IPMI v2.0 (RMCP+) via the lanplus interface. https://github.com/ipmitool/ipmitool 

  2. ipmitool(1) manual: chassis power (alias power) subcommands status|on|off|cycle|reset|soft; global options -I lanplus, -H, -U, -P (deprecated/visible in process list — "specifying the password as a command line option is not recommended"), -E (read from IPMI_PASSWORD), -L <privlvl>, -C <ciphersuite>; sensor list, sdr type, sel elist, fru print, sol info|activate|deactivate, mc info. https://man.archlinux.org/man/extra/ipmitool/ipmitool.1.en 

  3. DMTF Redfish Specification DSP0266: Redfish defines "an interoperable, multivendor, remote, and out-of-band capable interface" for "scalable platform management", based on out-of-band systems management with a RESTful interface and a schema-based data model. https://redfish.dmtf.org/schemas/DSP0266_1.15.1.html 

  4. DMTF Redfish Resource and Schema Guide (DSP2046): ComputerSystem.Reset targets /redfish/v1/Systems/{id}/Actions/ComputerSystem.Reset with ResetType enum values On, ForceOff, GracefulShutdown, GracefulRestart, ForceRestart, Nmi, ForceOn, PushPowerButton, PowerCycle; chassis Sensors collection at /redfish/v1/Chassis/{id}/Sensors; Thermal and Power resources deprecated in favor of Sensors/ThermalSubsystem/PowerSubsystem. https://redfish.dmtf.org/schemas/v1/DSP2046_2025.3.html 

  5. Redfish session authentication: POST {"UserName":..., "Password":...} to /redfish/v1/SessionService/Sessions returns HTTP 201 with an X-Auth-Token header used on subsequent requests; DELETE the session location to log out. https://www.dmtf.org/sites/default/files/standards/documents/DSP2060_1.0.0.pdf 

  6. NVIDIA DGX H100/H200 User Guide, Redfish APIs Support: base URI https://<bmc-ip-address>/redfish/v1/ with basic auth; reset at /redfish/v1/Systems/DGX/Actions/ComputerSystem.Reset (ResetType: ForceRestart, On, GracefulRestart, ForceOff, GracefulShutdown, PowerCycle); sensors at /redfish/v1/Chassis/DGX/Sensors (75 members per page, $skip pagination); per-GPU metrics at /redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_SXM_<1-8>/EnvironmentMetrics; BMC preconfigured at 169.254.0.17 on 169.254.0.0/16; no documented Redfish SOL/console endpoint. https://docs.nvidia.com/dgx/dgxh100-user-guide/redfish-api-supp.html 

  7. NVIDIA DGX SuperPOD reference architecture: out-of-band management is a separate fabric alongside compute, storage, and in-band management. https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html 

  8. CVE-2013-4782 (NVD): IPMI 2.0 cipher suite 0 ("cipher zero") / RAKP authentication bypass — a remote attacker can bypass authentication and execute arbitrary IPMI commands using cipher suite 0 and an arbitrary password; CVSS v2 base score 10.0. Cipher suite 0 provides no authentication and must be disabled. https://nvd.nist.gov/vuln/detail/CVE-2013-4782