Markdown

Runbook: stale MIG state¶

Scope: a node whose actual MIG geometry (nvidia-smi -L / nvidia-smi mig -lgi) has drifted from what the scheduler believes: pods stuck Pending on a MIG resource, the wrong profile advertised, or instances that survived a reconfigure that Kubernetes never saw. Recover by resetting the geometry under cordon/drain and forcing the device plugin to re-enumerate. Partition mechanics and profile tables live in MIG.

Reference templates, not hardware-tested. Validate every command against the MIG User Guide and your driver / GPU Operator release before running in production.

This is the longform procedure for the stale-MIG failure modes called out in MIG. It assumes the GPU Operator MIG manager owns geometry; the manual nvidia-smi mig path at the bottom applies when no Operator is present.

Trigger¶

The scheduler sees the wrong MIG instances: kubectl describe node advertises MIG resources that do not match nvidia-smi mig -lgi on the box (e.g. it offers nvidia.com/mig-3g.20gb but the GPU is actually all-1g.10gb, or whole-GPU nvidia.com/gpu after a wiped layout).
Pods cannot get the requested MIG profile: they stay Pending with Insufficient nvidia.com/mig-<p> even though the node "should" have it.
nvidia-smi -L mismatches Kubernetes: the device plugin enumerated a geometry that no longer exists, typically after a reconfigure, a -mig 0/1 toggle, or a reboot.
On Hopper, MIG mode is not InfoROM-persistent. A rebooted node can come back as one whole GPU while the cluster still expects slices (MIG).¹
nvidia.com/mig.config.state is stuck at pending, rebooting, or failed instead of success (Operator-managed nodes).²

Pre-checks¶

Read the actual on-box state before touching anything. NODE is the Kubernetes node name.

NODE=gpu-07.dc1.internal

# Ground truth on the node (run via your node-access path, e.g. ssh / debug pod):
nvidia-smi -L                       # CUDA-visible devices + MIG-<UUID> per instance
nvidia-smi mig -lgi                 # GPU instances that actually exist
nvidia-smi mig -lci                 # compute instances (add -gi N to scope)
nvidia-smi mig -lgip                # profiles the card can still create (remaining slice budget)

# What Kubernetes believes:
kubectl describe node "$NODE" | grep -E 'nvidia.com/(gpu|mig)|mig.config'
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}{"\n"}'
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}'

# Operator MIG manager state and recent decisions:
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager --tail=200

Decide which way the drift goes:

Geometry on box is correct, Kubernetes is stale → you only need to re-enumerate (restart the device plugin / GFD). Skip the destroy steps.
mig.config.state=failed → the manager could not apply the requested layout (often clients still attached, or a profile that does not fit the remaining slice budget). Fix the blocker, then re-drive the label.
Geometry is wrong/partial → reset and recreate (full Procedure below).

Confirm capacity headroom before draining so you stay above the healthy quorum, and note the current nvidia.com/mig.config value. That is your rollback target.

Procedure¶

Per node. Always cordon/drain before mutating MIG geometry: on Ampere, enabling/disabling MIG mode triggers a GPU reset (starting with Hopper, enabling MIG mode no longer requires one), and that reset, like any layout change, is refused while any client is attached.¹

Cordon so the scheduler stops placing work:
```
kubectl cordon "$NODE"
```

Drain GPU pods (keep the Operator DaemonSets):

kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m

Clear any remaining GPU clients on the node. A reset / MIG mutation is refused with In use by another client while a CUDA app or a stray nvidia-smi holds the device.¹ Confirm nothing is attached:

nvidia-smi                        # the "Processes" table must be empty
sudo fuser -k /dev/nvidia*        # last resort: kill stragglers, then re-check nvidia-smi

Path A: Operator-managed (preferred)¶

4a. Re-drive the MIG manager by setting the node label to the intended profile. Bounce to all-disabled first only if the state is wedged; otherwise set the target directly. The manager stops nvidia-device-plugin-daemonset, gpu-feature-discovery and nvidia-dcgm-exporter, applies the geometry with mig-parted, then restarts those pods:²

# force a clean cycle if mig.config.state is failed/stuck:
kubectl label node "$NODE" nvidia.com/mig.config=all-disabled --overwrite
# then the desired layout (single-strategy example):
kubectl label node "$NODE" nvidia.com/mig.config=all-1g.10gb --overwrite

Mixed/custom layouts use a named config from the MIG manager ConfigMap (e.g. all-balanced, or a custom key like five-1g-one-2g); the label value must exist as a key in that ConfigMap.²

5a. Watch the manager converge to success (it sets nvidia.com/mig.config.state through pending → optionally rebooting → success/failed):²

kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}' -w
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager -f

Path B: Manual `nvidia-smi mig` (no Operator)¶

4b. Tear down the stale layout: compute instances first, then GPU instances; a GI will not delete while it still owns CIs.¹ To wipe the whole card:

sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
# scope to one GPU/instance instead: -i <gpu> -gi <GI> -ci <CI>

If the layout itself is corrupt (or you are switching Ampere ECC/MIG state), toggle MIG mode (this resets the GPU):

sudo nvidia-smi -i 0 -mig 0
sudo nvidia-smi -i 0 -mig 1     # Hopper+: re-enable on boot; mode is not InfoROM-persistent

5b. Recreate the intended geometry (-C creates the GI and its compute instance so CUDA can see the slice; -cgi alone leaves it unusable):¹

sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb -C
sudo nvidia-smi mig -lgi        # confirm the instances exist
nvidia-smi -L                   # confirm MIG-<UUID> devices are now enumerated

Force Kubernetes to re-enumerate. The device plugin does not reliably clean up its advertised GPU/MIG resources after an enable/disable or reconfigure. Restart it (and GFD) so it re-reads the geometry:³

# GPU Operator deployment:
kubectl rollout restart daemonset -n gpu-operator nvidia-device-plugin-daemonset
kubectl rollout restart daemonset -n gpu-operator gpu-feature-discovery
# standalone k8s-device-plugin (adjust namespace/name to your install):
kubectl rollout restart daemonset -n kube-system nvidia-device-plugin-daemonset

In Path A the MIG manager already restarts these pods; only restart manually if the advertised resources are still stale after mig.config.state=success.

Uncordon once the node advertises the correct resources (see Verification first):
```
kubectl uncordon "$NODE"
```

Verification¶

Prove all three: geometry, advertisement, and a real bind.

On-box geometry matches intent. nvidia-smi -L lists the expected MIG <profile> Device N: (UUID: MIG-<...>) lines (one per compute instance), and nvidia-smi mig -lgi shows the right GIs:¹
```
nvidia-smi -L
nvidia-smi mig -lgi
```

Kubernetes advertises the matching resource and the label state is success:

kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}'   # success
kubectl describe node "$NODE" | grep -E 'nvidia.com/(gpu|mig)'   # Capacity/Allocatable match the layout

A test pod actually binds a MIG device and nvidia-smi -L inside the pod reports a single instance of the requested profile (mixed-strategy example; for single-strategy request nvidia.com/gpu: 1):
```
kubectl run mig-probe --rm -it --restart=Never --image=nvidia/cuda:12.4.1-base-ubuntu22.04 \
  --overrides='{"spec":{"nodeName":"'"$NODE"'","containers":[{"name":"mig-probe","image":"nvidia/cuda:12.4.1-base-ubuntu22.04","command":["nvidia-smi","-L"],"resources":{"limits":{"nvidia.com/mig-1g.10gb":"1"}}}]}}'
```
Expect exactly one MIG 1g.10gb Device 0: (UUID: MIG-<...>) line and a clean exit. The pod scheduling at all (no Insufficient nvidia.com/mig-<p>) is the proof the advertisement is now correct.
Optional hardware proof of the slice: dcgmi diag is not supported on a MIG-configured GPU (it needs the whole GPU and fails with GPU N's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU), so monitor the MIG instance entity instead, or run a CUDA workload inside the test pod (diagnostics tools):⁴
```
dcgmi dmon -i i:<gi> -e 203,252     # GPU util (203) + FB used MB (252) on one MIG instance (i:<entityId>)
```

Do not uncordon until 1–3 pass.

Rollback¶

Single-variable. The rollback target is the prior nvidia.com/mig.config value you recorded in Pre-checks.

# Operator: restore the previous profile label; the manager re-applies it.
kubectl label node "$NODE" nvidia.com/mig.config=<previous-profile> --overwrite
kubectl get node "$NODE" -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}' -w  # -> success

# Manual: re-create the previous geometry, then re-enumerate.
sudo nvidia-smi mig -dci && sudo nvidia-smi mig -dgi
sudo nvidia-smi mig -cgi <previous-profile-list> -C
kubectl rollout restart daemonset -n gpu-operator nvidia-device-plugin-daemonset

If a reconfigure repeatedly lands in mig.config.state=failed after clients are cleared, suspect a layout that exceeds the card's slice budget (a large profile leaves no room for the rest; check nvidia-smi mig -lgip) or an ECC/MIG reset blocked by another client; reboot the node to clear state and re-drive the label. If MIG mode itself will not enable, treat it as a driver/firmware issue (GSP firmware, kernel modules) rather than a geometry problem.

MIG: partition model, profile tables, nvidia-smi mig lifecycle, Kubernetes single/mixed strategies (the conceptual reference for this runbook).
ECC toggle recovery: sequence ECC before laying out MIG; both need a reset that attached clients block.
Persistence mode: keep persistence on so MIG/driver state survives the last client exiting.
GSP firmware mismatch: when MIG enable fails at the driver/firmware layer, not the geometry.
Operational runbooks: runbook index.

References¶

MIG User Guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Getting started with MIG (-mig 1/0, -cgi/-C/-cci, -dci/-dgi, -lgi/-lci/-lgip, MIG UUIDs, In use by another client): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html
GPU Operator with MIG (nvidia.com/mig.config, mig.config.state, mig-manager pod-stop/restart sequence, mig.strategy): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
NVIDIA k8s-device-plugin (DaemonSet, nvidia.com/mig-<p> resources, MIG strategy): https://github.com/NVIDIA/k8s-device-plugin
k8s-device-plugin issue #240 — device plugin does not clean up advertised GPUs after MIG enable/disable/reconfigure (why a restart is required): https://github.com/NVIDIA/k8s-device-plugin/issues/240
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
DCGM diag incompatible with MIG-configured GPUs (NVIDIA/DCGM issues #94, #70): https://github.com/NVIDIA/DCGM/issues/94

Related: MIG · ECC toggle recovery · Persistence mode · GSP firmware · Glossary

MIG User Guide, Getting Started — nvidia-smi -i <ids> -mig 1/0, destroy order (-dci before -dgi, scoping with -i/-gi/-ci), -cgi ... -C vs -cgi alone, -lgi/-lci -gi N/-lgip, nvidia-smi -L MIG-UUID format, Hopper non-persistence across reboot (no InfoROM status bit; Ampere is persistent), the Ampere-only GPU reset on MIG-mode toggle, and the In use by another client reset refusal. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html ↩↩↩↩↩↩
NVIDIA GPU Operator with MIG — nvidia.com/mig.config label and --overwrite, all-disabled/all-1g.10gb/all-balanced values, the mig-manager stopping nvidia-device-plugin-daemonset/gpu-feature-discovery/nvidia-dcgm-exporter then applying geometry via mig-parted and restarting them, nvidia.com/mig.config.state (pending/rebooting/success/failed), and mig.strategy single/mixed. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩↩
NVIDIA/k8s-device-plugin issue #240 — the device plugin does not reliably clean up advertised GPU/MIG resources after a MIG enable/disable or reconfiguration, so it must be restarted to re-enumerate. https://github.com/NVIDIA/k8s-device-plugin/issues/240 ↩
NVIDIA/DCGM issues #94 and #70 — dcgmi diag is incompatible with a MIG-configured GPU because it requires access to the entire GPU, failing with GPU N's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU; monitor the MIG instance entity with dcgmi dmon instead. https://github.com/NVIDIA/DCGM/issues/94 ↩