Markdown

Runbook: OOB / BMC unreachable¶

Scope: a node has hung and its baseboard management controller (BMC) does not answer (no ping, ipmitool and Redfish both time out), so there is no lights-out path to power-cycle or console it; restore the out-of-band (OOB) channel, or escalate to hands-on, without making blind power changes.

Every command below is a reference template, not hardware-tested. BMC IPs, channel numbers, user IDs, the Redfish SystemId/ManagerId, and DHCP/VLAN details are site-specific. Substitute your own and validate on one node before relying on this.

The OOB channel is the recovery path of last resort: it works when the OS is wedged, the GPUs are hung, or the host NIC is down. When it is also gone, the node is dark: no remote power, no serial console, no sensor read. This runbook does not re-explain what a BMC is or how the OOB network is built; that is OOB management / BMC and OOB network infrastructure, with the IPMI and Redfish protocol detail on IPMI / Redfish. The conceptual hub is provisioning and scheduling. Here we only diagnose and restore reachability.

A loud caveat up front: "BMC unreachable" and "host hung" are two faults that often arrive together but are independent. The BMC is a separate microcontroller with its own power, its own NIC, and its own IP; it normally stays up across a host hang, a host power-off, even a host kernel panic. If the BMC is unreachable, treat that as a second incident on its own management plane. Do not assume fixing one fixes the other, and never power-cycle a host you cannot observe.

Trigger¶

Open this runbook when a node is unresponsive on its OOB channel and you therefore have no lights-out recovery path. Concretely, any of:

BMC IP does not ping. ping <bmc_ip> is 100% loss while the same subnet's other BMCs answer.
ipmitool times out. A LAN+ session to the BMC fails to establish (Error: Unable to establish IPMI v2 / RMCP+ session or a hang) rather than returning mc info. ipmitool uses the lanplus interface over UDP/IPv4 with the RMCP+ protocol of the IPMI v2.0 spec; a timeout here means the BMC's IPMI stack is not answering.¹
Redfish times out or refuses the connection. GET https://<bmc_ip>/redfish/v1/ (the Redfish service root) does not return: connection refused, TLS handshake hang, or no route.²
The host is hung and you reached for the BMC to recover it, and found the BMC also dark. This is the headline case: a wedged node you cannot drain, reset, or even console because its lights-out controller is gone too.

Scope boundary: if the BMC answers but power control or console misbehaves (auth fails, sol activate drops, a sensor reads bad), that is a different fault: the management plane is up. Stay on the protocol pages (IPMI / Redfish); this runbook is specifically for no answer at all.

Pre-checks¶

Goal: localise the break. Is it the BMC, the management link/switch, the management VLAN, or address assignment (DHCP)? Run these from a host that sits on the same OOB management network. Change nothing yet.

Reachability and L2 adjacency. Ping the BMC, then check whether the OOB network even has an ARP entry for it. An IP that pings is reachable; an IP with no ARP entry and no ping is not on the wire as far as this host can see:
```
BMC_IP=10.20.0.37          # this node's BMC address
ping -c 3 -W 2 "$BMC_IP"
ip neigh show "$BMC_IP"    # FAILED / no entry => no L2 reply on the mgmt segment
```
A BMC that pings but refuses IPMI/Redfish is a service problem (skip to step 5 framing; different fault). A BMC that does not ping is a network or BMC-down problem (steps 2-4).
IPMI session, explicitly over LAN+. Confirm the timeout is the BMC's IPMI stack, not a tooling mistake. mc info is the canonical liveness probe; it returns the BMC device/firmware revision and IPMI version:¹
```
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw mc info
# -f reads the password from a file so it is not in shell history / argv
```
-f <file> supplies the password from a file rather than -P on the command line, keeping it out of argv and history.¹ A clean mc info means the BMC is actually up and you are likely in the wrong runbook; a timeout confirms the OOB channel is dead.
Redfish liveness. Cross-check with the modern path. Some BMCs disable IPMI-over-LAN but keep Redfish (or vice versa), so probe both before concluding the controller is down. The Redfish service root is GET /redfish/v1/:²
```
curl -sk --max-time 5 -u "$BMC_USER:$(cat /run/bmc.pw)" \
  "https://$BMC_IP/redfish/v1/" | head
# connection refused / TLS hang / no route => Redfish is not answering either
```
Management switch port for this BMC's link. If neither protocol answers, look at the physical mgmt link from the switch side: is the port up, and is a MAC learned on it? (Syntax is vendor-specific; these are the NX-OS / Cumulus / generic shapes.)
```
# On the OOB/management switch (example forms — match your NOS):
show interface status | include Eth1/37        # NX-OS: port up/down, speed
show mac address-table interface Ethernet1/37   # is the BMC's MAC learned?
# Cumulus/SONiC:  show interface swp37   /   bridge fdb show dev swp37
```
Port down/notconnect or an empty MAC table for that port points at cabling, the port, or a powered-off BMC, not at the BMC's IP config.
Address assignment (DHCP) on the management segment. Many estates hand BMC addresses out of DHCP on the OOB VLAN. Confirm the BMC currently holds (or recently held) a lease; a missing or expired lease explains a BMC that is powered and linked but has no usable IP:
```
# On the DHCP server serving the OOB VLAN (ISC dhcpd / Kea shapes):
grep -i "<BMC_MAC>" /var/lib/dhcp/dhcpd.leases     # ISC dhcpd lease DB
journalctl -u kea-dhcp4 --no-pager | grep -i "<BMC_MAC>" | tail
```
For statically-addressed BMCs this step is N/A; record that the BMC is static and move on. The fix for a lost lease is on the network side (steps in Procedure), not on the (unreachable) BMC.

At the end of pre-checks you should be able to name the break as one of: (a) mgmt switch/VLAN/cabling, (b) address/DHCP, or (c) BMC itself hung/down. That choice drives the Procedure.

flowchart LR
  A["BMC unreachable: no ping, ipmitool and Redfish time out"] --> B{"Ping or ARP reply?"}
  B -->|"No reply"| C{"Switch port up and MAC learned?"}
  B -->|"Pings but IPMI/Redfish refuse"| Z["Service fault, not this runbook: see IPMI / Redfish pages"]
  C -->|"Port down or no MAC"| D["Fix mgmt switch, VLAN or cabling"]
  C -->|"Port up, MAC learned"| E{"Valid DHCP lease or static IP?"}
  E -->|"No lease"| F["Fix DHCP on OOB VLAN"]
  E -->|"Lease or static OK"| G["BMC is hung: cold-reset BMC"]
  D --> H["Re-probe mc info and Redfish root"]
  F --> H
  G --> H
  H -->|"Still dark"| I["Escalate: dispatch hands to the rack"]

Procedure¶

Cordon and drain the node at the scheduler layer first, before any power or network mutation. The host may be hung and not draining cleanly, but marking it down stops the scheduler placing new work on it and records the incident; do it even if eviction stalls. NODE is the scheduler's node name; BMC_IP / BMC_MAC are the OOB identifiers from pre-checks.

NODE=gpu-07.dc1.internal
BMC_IP=10.20.0.37
BMC_MAC=aa:bb:cc:dd:ee:37
BMC_USER=admin

Cordon and drain at the scheduler. Stop new placement first; a hung host may not evict, and that is expected here.
```
# Kubernetes:
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=10m
# Slurm:
scontrol update nodename=gpu-07 state=drain reason="oob/bmc unreachable"
```
If drain hangs because the kubelet is gone with the host, do not force-delete pods blindly; note it and proceed, because you cannot safely fence a node whose power state you cannot read.

(Break = switch/VLAN) Restore the management link. Work the mgmt switch, not the BMC (you cannot reach it). Confirm the port is enabled and in the correct management VLAN, then bounce the port to force re-link and MAC re-learn:

# On the OOB/management switch (NX-OS form — match your NOS):
show running-config interface Ethernet1/37     # confirm it is in the mgmt VLAN, not shut
configure terminal
  interface Ethernet1/37
    switchport access vlan <OOB_VLAN_ID>       # correct mgmt VLAN
    no shutdown
    shutdown                                    # bounce: down...
    no shutdown                                 # ...then up, forces re-link/relearn
end
show mac address-table interface Ethernet1/37   # BMC MAC should reappear

Re-probe (ping, mc info) before going further. A wrong-VLAN or shut port is the most common "BMC down" that is really a network fault.

(Break = DHCP) Restore address assignment. If the BMC links but has no IP, fix it on the DHCP side. Confirm the reservation maps BMC_MAC to the intended IP, (re)start the scope, and let the BMC re-request. Reference template, not hardware-tested:
```
# ISC dhcpd: confirm a host reservation exists for this BMC
grep -A3 -i "$BMC_MAC" /etc/dhcp/dhcpd.conf
# host gpu-07-bmc { hardware ethernet aa:bb:cc:dd:ee:37; fixed-address 10.20.0.37; }
sudo systemctl restart isc-dhcp-server        # or: kea-dhcp4
```
You generally cannot force a remote, unreachable BMC to renew over the network; a BMC cold reset (step 4, when physically accessible) or a hands dispatch re-triggers DHCP from the BMC side.
(Break = BMC hung) Cold-reset the BMC. A BMC that is powered, linked, and addressed but still not answering is itself wedged. The reset hierarchy, least to most disruptive:

Redfish Manager.Reset, only if Redfish still answers but IPMI does not. This restarts the BMC and, by design, does not touch the running OS.⁴ Prefer GracefulRestart; use ForceRestart only if the controller is unresponsive.⁴

curl -sk -u "$BMC_USER:$(cat /run/bmc.pw)" \
  -H "Content-Type: application/json" -X POST \
  -d '{"ResetType":"ForceRestart"}' \
  "https://$BMC_IP/redfish/v1/Managers/<ManagerId>/Actions/Manager.Reset"

IPMI mc reset cold, only if IPMI still answers but Redfish does not. Reinitialises the BMC; warm is the gentler option but a wedged BMC usually needs cold:¹⁵
```
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw mc reset cold
```
A warm/cold BMC reset restarts the management controller only; it does not reboot the host CPU, and the OS keeps running.⁵ (If both protocols are dead, neither remote reset is possible, so go to Rollback / dispatch hands.)
Physical BMC reset, when no protocol answers. Most server BMCs only fully reset on a full AC power drain: a chassis power button does not necessarily reset the BMC because it runs on standby/aux power. Removing all PSU power for ~30 s (or pulling both power cords) drops standby and forces the BMC to cold-boot. This is hands-on (step is in OOB management / BMC for your specific platform) and is the bridge to the dispatch in Rollback.
(If creds are the suspect, and a protocol is back) Reset BMC credentials. If the BMC answers post-reset but rejects your account (lockout, drifted password), reset the management user from a path that is working. Over IPMI, set the password and re-enable channel access for the user ID (find the ID with user list):¹
```
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw user list 1
# then, for the target user id (e.g. 2):
ipmitool ... user set password 2 '<newpw>'
ipmitool ... channel getaccess 1 2        # confirm the user is enabled on LAN channel 1
```
If only Redfish is reachable, reset the account via its AccountService Account resource instead (PATCH the account's Password); see Redfish. Where the account DB itself is corrupt, the dependable fix is a factory/default reset of the BMC user config, which is platform-specific and usually a hands/ipmitool ... user-from-host operation; document the exact step on OOB management / BMC.
Confirm the channel is back, both protocols. Re-probe IPMI and Redfish liveness before touching host power:
```
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw mc info
curl -sk --max-time 5 -u "$BMC_USER:$(cat /run/bmc.pw)" "https://$BMC_IP/redfish/v1/" | head
```
Only once the BMC answers do you have a lights-out path again, and only now should you consider recovering the (possibly still-hung) host through it.

Verification¶

The single proof that the OOB channel is genuinely restored is end-to-end power control: you can read the host's power state and, if needed, command it, through the BMC. Liveness (mc info) alone is necessary but not sufficient: a BMC can answer mc info yet have lost host power control. Verify a read first (non-destructive); only issue a power change if the incident requires recovering the host.

Read host power state: non-destructive, do this first.

# IPMI:
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw chassis power status
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw chassis status
# Redfish (read PowerState off the ComputerSystem):
curl -sk -u "$BMC_USER:$(cat /run/bmc.pw)" \
  "https://$BMC_IP/redfish/v1/Systems/<SystemId>" | grep -i '"PowerState"'

chassis power status returns the current on/off state; chassis status reports power, buttons, cooling, drives and faults.¹ A correct read here is the proof the management plane controls the host again.

Command power only if recovering the hung host. This is disruptive, but the node is cordoned/drained, so it is the right time. Prefer a graceful path; force only a confirmed-hung host. IPMI power reset is a hard reset; Redfish ComputerSystem.Reset takes a ResetType:¹³
```
# IPMI hard reset (host CPU reset, BMC stays up):
ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw chassis power reset
# Redfish equivalent (GracefulRestart if the OS may respond, ForceRestart if wedged):
curl -sk -u "$BMC_USER:$(cat /run/bmc.pw)" -H "Content-Type: application/json" -X POST \
  -d '{"ResetType":"ForceRestart"}' \
  "https://$BMC_IP/redfish/v1/Systems/<SystemId>/Actions/ComputerSystem.Reset"
```
Redfish ResetType allowable values include On, ForceOff, GracefulShutdown, GracefulRestart, ForceRestart, Nmi, and PushPowerButton.³ An Nmi is the right call when you want a kernel crashdump out of the hung host before resetting it, rather than discarding its state.

Console proves it end-to-end. Re-attach Serial-over-LAN; seeing POST/boot output is the strongest single proof that OOB is fully back and the host is recovering:¹

ipmitool -I lanplus -H "$BMC_IP" -U "$BMC_USER" -f /run/bmc.pw sol activate
# exit SOL with the escape sequence (default '~.'); or from another session:
ipmitool ... sol deactivate

Return to service only after the host is healthy. OOB being back does not mean the node is fit to schedule. Run the node's normal health gate before uncordon, GPU health especially (GPU diagnostics, and the GPU health gating page). Then:
```
kubectl uncordon "$NODE"
# Slurm: scontrol update nodename=gpu-07 state=resume
```

Rollback¶

There is no "rollback" of a reachability fix in the usual sense: the network/DHCP/credential changes are forward fixes, reverted by GitOps/config-management if wrong. The real fallback is escalation to physical hands, which is the expected outcome whenever the BMC cannot be reached remotely.

Both IPMI and Redfish dead after a switch/VLAN/DHCP fix → dispatch hands. The BMC needs a physical reset (full AC drain) or is faulty. Leave the node cordoned and drained (never uncordon a node you cannot observe) and hand off with: rack/elevation, NODE, BMC_IP/BMC_MAC, the pre-check evidence (no ping, no ARP, switch port state, lease state), and the explicit ask: AC-drain the BMC (~30 s, both cords), reseat the mgmt cable, confirm the BMC link LED, then call back for a remote re-probe.
Do not blind-power-cycle the host to "fix" the BMC. Pulling host power when you cannot read its state risks corrupting a node mid-write and does not reliably reset a BMC on standby power anyway. The host stays untouched until the OOB channel is back or hands confirm state at the rack.
Suspected BMC hardware fault (no link even after reseat/AC-drain) → RMA path. A BMC that will not come up after a clean power drain and cable reseat is a hardware fault on the management controller; route it through the node fault / RMA flow (reliability and RAS). The node remains drained until the board is repaired or replaced.
If the host was hung independently, recovering OOB only gives you the lever; the host's own recovery (reset, crashdump triage, then health-gate) is a separate step and a separate fault to close out.

OOB management / BMC: what the BMC is, standby-power behaviour, and platform-specific physical/cred reset steps referenced above.
OOB network infrastructure: the separate management network, VLAN, and switch fabric this runbook works against.
IPMI / Redfish: protocol detail for the probes and resets used here (use these when the BMC answers but a specific operation misbehaves).
GPU health gating and GPU diagnostics: the health gate that must pass before a recovered node is uncordoned.
reliability and RAS: escalation when the BMC itself is a hardware fault (RMA).
provisioning and scheduling: the OOB/imaging/scheduling hub this runbook sits under.
operational runbooks: the runbook index.

References¶

ipmitool(1) manual (Arch) — -I lanplus (RMCP+ / IPMI v2.0 over UDP/IPv4), -H/-U/-P/-f, mc info, mc reset warm|cold, lan print, lan set <ch> ipaddr/netmask/defgw ipaddr/ipsrc, user list/user set name/user set password, channel getaccess, chassis status, chassis power status/on/off/cycle/reset, sol activate/sol deactivate: https://man.archlinux.org/man/extra/ipmitool/ipmitool.1.en
DMTF Redfish — ComputerSystem.Reset action, ResetType allowable values (On, ForceOff, GracefulShutdown, GracefulRestart, ForceRestart, Nmi, PushPowerButton), and ComputerSystem.PowerState: https://www.dmtf.org/sites/default/files/standards/documents/DSP2046_2024.2.html
DMTF Redfish — service root at /redfish/v1/ (Redfish User Guide): https://www.dmtf.org/sites/default/files/standards/documents/DSP2060_1.0.0.pdf
DMTF Redfish-Tacklebox — rf_power_reset.py, supported ResetType values incl. GracefulRestart default: https://github.com/DMTF/Redfish-Tacklebox/blob/main/docs/rf_power_reset.md
DMTF Redfish — Manager.Reset action (reset/reboot the manager, i.e. the BMC), ResetType incl. GracefulRestart/ForceRestart: https://redfish.dmtf.org/schemas/v1/Manager.v1_25_0.json
BMC reset guidance — ipmitool mc reset warm|cold, warm preferred first, resets the BMC not the host CPU (Exxact): https://support.exxactcorp.com/hc/en-us/articles/31728599437847-Resetting-the-BMC-Using-ipmitool-on-Linux
OpenBMC ipmitool cheatsheet — mc reset, lan print, user, chassis power command forms: https://github.com/openbmc/docs/blob/master/IPMITOOL-cheatsheet.md
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

ipmitool(1) manual (Arch): "The lanplus interface communicates with the BMC over an Ethernet LAN connection using UDP under IPv4" using "the RMCP+ protocol as described in the IPMI v2.0 specification"; -H "Remote server address … required for lan and lanplus interfaces"; -f "Specifies a file containing the remote server password"; mc info "Displays information about the BMC hardware … device revision, firmware revision, IPMI version"; mc reset <warm|cold> "Instructs the BMC to perform a warm or cold reset"; lan print [channel]; lan set <channel> ipaddr|netmask|defgw ipaddr|ipsrc; user list, user set name <id> <name>, user set password <id> [password], channel getaccess <channel> [userid]; chassis power status "Show current chassis power status", power reset "perform a hard reset"; chassis status "Status information related to power, buttons, cooling, drives and faults"; sol activate enters Serial Over LAN mode. https://man.archlinux.org/man/extra/ipmitool/ipmitool.1.en ↩↩↩↩↩↩↩↩
DMTF Redfish — the Redfish service is rooted at /redfish/v1/ (the service root resource a client GETs first). https://www.dmtf.org/sites/default/files/standards/documents/DSP2060_1.0.0.pdf ↩↩
DMTF Redfish — the ComputerSystem.Reset action (POST to /redfish/v1/Systems/{SystemId}/Actions/ComputerSystem.Reset) takes a ResetType whose allowable values include On, ForceOff, GracefulShutdown, GracefulRestart, ForceRestart, Nmi, and PushPowerButton. https://www.dmtf.org/sites/default/files/standards/documents/DSP2046_2024.2.html ↩↩
DMTF Redfish Manager schema, Reset action (POST to /redfish/v1/Managers/{ManagerId}/Actions/Manager.Reset): "The reset action resets/reboots the manager" — i.e. it restarts the management controller (BMC) itself, not the host. The ResetType allowable values include GracefulRestart ("Shut down gracefully and restart the unit") and ForceRestart ("Shut down immediately and non-gracefully and restart the unit"); prefer graceful, use forced only if the controller is unresponsive. https://redfish.dmtf.org/schemas/v1/Manager.v1_25_0.json ↩↩
BMC reset guidance (Exxact): ipmitool mc reset warm / mc reset cold; "A warm reset should be preferred first unless otherwise instructed by your hardware vendor"; cold "will reinitialize the BMC hardware more aggressively but still does not reboot the server itself"; both "restart the BMC but not the main CPU." https://support.exxactcorp.com/hc/en-us/articles/31728599437847-Resetting-the-BMC-Using-ipmitool-on-Linux ↩↩