Runbook: stale MIG state¶
Scope: a node whose actual MIG geometry (nvidia-smi -L / nvidia-smi mig -lgi) has drifted from what the scheduler believes: pods stuck Pending on a MIG resource, the wrong profile advertised, or instances that survived a reconfigure that Kubernetes never saw. Recover by resetting the geometry under cordon/drain and forcing the device plugin to re-enumerate. Partition mechanics and profile tables live in MIG.
Reference templates, not hardware-tested. Validate every command against the MIG User Guide and your driver / GPU Operator release before running in production.
This is the longform procedure for the stale-MIG failure modes called out in MIG. It assumes the GPU Operator MIG manager owns geometry; the manual nvidia-smi mig path at the bottom applies when no Operator is present.
Trigger¶
- The scheduler sees the wrong MIG instances:
kubectl describe nodeadvertises MIG resources that do not matchnvidia-smi mig -lgion the box (e.g. it offersnvidia.com/mig-3g.20gbbut the GPU is actuallyall-1g.10gb, or whole-GPUnvidia.com/gpuafter a wiped layout). - Pods cannot get the requested MIG profile: they stay
PendingwithInsufficient nvidia.com/mig-<p>even though the node "should" have it. nvidia-smi -Lmismatches Kubernetes: the device plugin enumerated a geometry that no longer exists, typically after a reconfigure, a-mig 0/1toggle, or a reboot.- On Hopper, MIG mode is not InfoROM-persistent. A rebooted node can come back as one whole GPU while the cluster still expects slices (MIG).1
nvidia.com/mig.config.stateis stuck atpending,rebooting, orfailedinstead ofsuccess(Operator-managed nodes).2
Pre-checks¶
Read the actual on-box state before touching anything. NODE is the Kubernetes node name.
NODE=gpu-07.dc1.internal
# Ground truth on the node (run via your node-access path, e.g. ssh / debug pod):
nvidia-smi -L # CUDA-visible devices + MIG-<UUID> per instance
nvidia-smi mig -lgi # GPU instances that actually exist
nvidia-smi mig -lci # compute instances (add -gi N to scope)
nvidia-smi mig -lgip # profiles the card can still create (remaining slice budget)
# What Kubernetes believes:
kubectl describe node "$NODE" | grep -E 'nvidia.com/(gpu|mig)|mig.config'
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}{"\n"}'
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}'
# Operator MIG manager state and recent decisions:
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager --tail=200
Decide which way the drift goes:
- Geometry on box is correct, Kubernetes is stale → you only need to re-enumerate (restart the device plugin / GFD). Skip the destroy steps.
mig.config.state=failed→ the manager could not apply the requested layout (often clients still attached, or a profile that does not fit the remaining slice budget). Fix the blocker, then re-drive the label.- Geometry is wrong/partial → reset and recreate (full Procedure below).
Confirm capacity headroom before draining so you stay above the healthy quorum, and note the current nvidia.com/mig.config value. That is your rollback target.
Procedure¶
Per node. Always cordon/drain before mutating MIG geometry: on Ampere, enabling/disabling MIG mode triggers a GPU reset (starting with Hopper, enabling MIG mode no longer requires one), and that reset, like any layout change, is refused while any client is attached.1
- Cordon so the scheduler stops placing work:
- Drain GPU pods (keep the Operator DaemonSets):
- Clear any remaining GPU clients on the node. A reset / MIG mutation is refused with
In use by another clientwhile a CUDA app or a straynvidia-smiholds the device.1 Confirm nothing is attached:
Path A: Operator-managed (preferred)¶
4a. Re-drive the MIG manager by setting the node label to the intended profile. Bounce to all-disabled first only if the state is wedged; otherwise set the target directly. The manager stops nvidia-device-plugin-daemonset, gpu-feature-discovery and nvidia-dcgm-exporter, applies the geometry with mig-parted, then restarts those pods:2
# force a clean cycle if mig.config.state is failed/stuck:
kubectl label node "$NODE" nvidia.com/mig.config=all-disabled --overwrite
# then the desired layout (single-strategy example):
kubectl label node "$NODE" nvidia.com/mig.config=all-1g.10gb --overwrite
all-balanced, or a custom key like five-1g-one-2g); the label value must exist as a key in that ConfigMap.2
5a. Watch the manager converge to success (it sets nvidia.com/mig.config.state through pending → optionally rebooting → success/failed):2
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}' -w
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager -f
Path B: Manual nvidia-smi mig (no Operator)¶
4b. Tear down the stale layout: compute instances first, then GPU instances; a GI will not delete while it still owns CIs.1 To wipe the whole card:
sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
# scope to one GPU/instance instead: -i <gpu> -gi <GI> -ci <CI>
sudo nvidia-smi -i 0 -mig 0
sudo nvidia-smi -i 0 -mig 1 # Hopper+: re-enable on boot; mode is not InfoROM-persistent
5b. Recreate the intended geometry (-C creates the GI and its compute instance so CUDA can see the slice; -cgi alone leaves it unusable):1
sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C
sudo nvidia-smi mig -lgi # confirm the instances exist
nvidia-smi -L # confirm MIG-<UUID> devices are now enumerated
-
Force Kubernetes to re-enumerate. The device plugin does not reliably clean up its advertised GPU/MIG resources after an enable/disable or reconfigure. Restart it (and GFD) so it re-reads the geometry:3
In Path A the MIG manager already restarts these pods; only restart manually if the advertised resources are still stale after# GPU Operator deployment: kubectl rollout restart daemonset -n gpu-operator nvidia-device-plugin-daemonset kubectl rollout restart daemonset -n gpu-operator gpu-feature-discovery # standalone k8s-device-plugin (adjust namespace/name to your install): kubectl rollout restart daemonset -n kube-system nvidia-device-plugin-daemonsetmig.config.state=success. -
Uncordon once the node advertises the correct resources (see Verification first):
Verification¶
Prove all three: geometry, advertisement, and a real bind.
- On-box geometry matches intent.
nvidia-smi -Llists the expectedMIG <profile> Device N: (UUID: MIG-<...>)lines (one per compute instance), andnvidia-smi mig -lgishows the right GIs:1 - Kubernetes advertises the matching resource and the label state is
success: - A test pod actually binds a MIG device and
nvidia-smi -Linside the pod reports a single instance of the requested profile (mixed-strategy example; for single-strategy requestnvidia.com/gpu: 1):Expect exactly onekubectl run mig-probe --rm -it --restart=Never --image=nvidia/cuda:12.4.1-base-ubuntu22.04 \ --overrides='{"spec":{"nodeName":"'"$NODE"'","containers":[{"name":"mig-probe","image":"nvidia/cuda:12.4.1-base-ubuntu22.04","command":["nvidia-smi","-L"],"resources":{"limits":{"nvidia.com/mig-1g.10gb":"1"}}}]}}'MIG 1g.10gb Device 0: (UUID: MIG-<...>)line and a clean exit. The pod scheduling at all (noInsufficient nvidia.com/mig-<p>) is the proof the advertisement is now correct. - Optional hardware proof of the slice:
dcgmi diagis not supported on a MIG-configured GPU (it needs the whole GPU and fails withGPU N's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU), so monitor the MIG instance entity instead, or run a CUDA workload inside the test pod (diagnostics tools):4
Do not uncordon until 1–3 pass.
Rollback¶
Single-variable. The rollback target is the prior nvidia.com/mig.config value you recorded in Pre-checks.
# Operator: restore the previous profile label; the manager re-applies it.
kubectl label node "$NODE" nvidia.com/mig.config=<previous-profile> --overwrite
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}' -w # -> success
# Manual: re-create the previous geometry, then re-enumerate.
sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
sudo nvidia-smi mig -cgi <previous-profile-list> -C
kubectl rollout restart daemonset -n gpu-operator nvidia-device-plugin-daemonset
If a reconfigure repeatedly lands in mig.config.state=failed after clients are cleared, suspect a layout that exceeds the card's slice budget (a large profile leaves no room for the rest; check nvidia-smi mig -lgip) or an ECC/MIG reset blocked by another client; reboot the node to clear state and re-drive the label. If MIG mode itself will not enable, treat it as a driver/firmware issue (GSP firmware, kernel modules) rather than a geometry problem.
Related runbooks¶
- MIG: partition model, profile tables,
nvidia-smi miglifecycle, Kubernetes single/mixed strategies (the conceptual reference for this runbook). - ECC toggle recovery: sequence ECC before laying out MIG; both need a reset that attached clients block.
- Persistence mode: keep persistence on so MIG/driver state survives the last client exiting.
- GSP firmware mismatch: when MIG enable fails at the driver/firmware layer, not the geometry.
- Operational runbooks: runbook index.
References¶
- MIG User Guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
- Getting started with MIG (
-mig 1/0,-cgi/-C/-cci,-dci/-dgi,-lgi/-lci/-lgip, MIG UUIDs,In use by another client): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html - GPU Operator with MIG (
nvidia.com/mig.config,mig.config.state, mig-manager pod-stop/restart sequence,mig.strategy): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html - NVIDIA k8s-device-plugin (DaemonSet,
nvidia.com/mig-<p>resources, MIG strategy): https://github.com/NVIDIA/k8s-device-plugin - k8s-device-plugin issue #240 — device plugin does not clean up advertised GPUs after MIG enable/disable/reconfigure (why a restart is required): https://github.com/NVIDIA/k8s-device-plugin/issues/240
- kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
- DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- DCGM diag incompatible with MIG-configured GPUs (NVIDIA/DCGM issues #94, #70): https://github.com/NVIDIA/DCGM/issues/94
Related: MIG · ECC toggle recovery · Persistence mode · GSP firmware · Glossary
-
MIG User Guide, Getting Started —
nvidia-smi -i <ids> -mig 1/0, destroy order (-dcibefore-dgi, scoping with-i/-gi/-ci),-cgi ... -Cvs-cgialone,-lgi/-lci -gi N/-lgip,nvidia-smi -LMIG-UUID format, Hopper non-persistence across reboot (no InfoROM status bit; Ampere is persistent), the Ampere-only GPU reset on MIG-mode toggle, and theIn use by another clientreset refusal. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html ↩↩↩↩↩↩ -
NVIDIA GPU Operator with MIG —
nvidia.com/mig.configlabel and--overwrite,all-disabled/all-1g.10gb/all-balancedvalues, the mig-manager stoppingnvidia-device-plugin-daemonset/gpu-feature-discovery/nvidia-dcgm-exporterthen applying geometry via mig-parted and restarting them,nvidia.com/mig.config.state(pending/rebooting/success/failed), andmig.strategysingle/mixed. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩↩ -
NVIDIA/k8s-device-plugin issue #240 — the device plugin does not reliably clean up advertised GPU/MIG resources after a MIG enable/disable or reconfiguration, so it must be restarted to re-enumerate. https://github.com/NVIDIA/k8s-device-plugin/issues/240 ↩
-
NVIDIA/DCGM issues #94 and #70 —
dcgmi diagis incompatible with a MIG-configured GPU because it requires access to the entire GPU, failing withGPU N's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU; monitor the MIG instance entity withdcgmi dmoninstead. https://github.com/NVIDIA/DCGM/issues/94 ↩