Markdown

Manifest: MIG mode (single & mixed)¶

Scope: driving MIG declaratively through the GPU Operator's mig-manager, covering the nvidia.com/mig.config node label, the default-mig-parted-config ConfigMap, mig.strategy single vs mixed, and the pod resource-request syntax that follows (nvidia.com/gpu under single, nvidia.com/mig-1g.10gb under mixed). Verify nvidia-smi -L after relabel and that a pod binds one MIG instance. The cluster-side counterpart to the host-driven nvidia-smi mig lifecycle; part of Kubernetes & Helm: GPU Platform.

Reference templates from the GPU Operator Helm chart and the upstream mig-parted config. Pin chart and image versions; apply via GitOps (SRE and MLOps practices), not by hand. Not hardware-tested. Validate every label, profile name, and helm value against the GPU Operator MIG docs and your driver release before production.

What it is¶

The GPU Operator ships a mig-manager DaemonSet that watches one node label and applies the matching MIG geometry to every GPU on that node. You never SSH in to run nvidia-smi mig; you label the node, the manager reconciles.¹

Two moving parts:

nvidia.com/mig.config is the label you set. Its value is the name of a config defined in the mig-parted ConfigMap (e.g. all-1g.10gb, all-balanced, all-disabled). The manager reads the named config, enables MIG mode if needed, and creates the GPU instances.¹
nvidia.com/mig.config.state is the label the manager writes back: pending while reconfiguring, success when applied, failed on error. This is your status signal.¹

How the resource is then exposed to schedulers is a separate axis: the strategy, set once cluster-wide in the ClusterPolicy:

single: all GPUs on a node carry the same profile; MIG slices are advertised as plain nvidia.com/gpu. Pod specs are unchanged; one "GPU" just means one slice. This is what the platform hub installs (mig.strategy=single).⁴
mixed: each profile is advertised as its own extended resource, nvidia.com/mig-<slices>g.<mem>gb (e.g. nvidia.com/mig-1g.10gb), requested explicitly. Required when a node runs heterogeneous profiles or mixes MIG and whole GPUs.⁴

The label drives geometry; the strategy drives request syntax. They are configured independently.

flowchart TB
  LABEL["kubectl label node: nvidia.com/mig.config=all-1g.10gb"] --> MGR["mig-manager DaemonSet reconciles"]
  CM["ConfigMap: default-mig-parted-config (version v1, mig-configs)"] --> MGR
  MGR --> EVICT["Stop GPU pods, enable MIG mode, create GIs"]
  EVICT --> STATE["Writes nvidia.com/mig.config.state=success"]
  STATE --> GFD["GFD relabels node with MIG geometry"]
  GFD --> SINGLE["strategy single: advertise nvidia.com/gpu"]
  GFD --> MIXED["strategy mixed: advertise nvidia.com/mig-1g.10gb"]
  SINGLE --> POD["Pod binds one MIG instance"]
  MIXED --> POD

Prerequisites¶

A MIG-capable GPU (A30, A100, H100, H200, B200, RTX PRO 6000/5000/4500 Blackwell) on the node. Consumer GeForce is excluded (MIG silicon table).⁵
The GPU Operator installed with migManager.enabled=true (the chart default). The manager, device plugin, and GPU Feature Discovery (GFD) must be running.¹
A driver that supports MIG (datacenter driver; driver by tier). On Ampere, enabling MIG triggers a GPU reset; on Hopper/Blackwell it does not, but the mode is not InfoROM-persistent and the manager re-establishes it (MIG lifecycle).¹
No user GPU workloads on the node being reconfigured. The mig-manager requires the GPUs idle and stops all GPU pods before changing MIG mode or geometry; cordon/drain first in production.¹
Decide the strategy once (single or mixed); it is cluster-wide in the ClusterPolicy, not per node.⁴

Install¶

The strategy lives in the ClusterPolicy; set it at install on the GPU Operator Helm release (the hub already does this with mig.strategy=single):¹

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install -n gpu-operator --create-namespace gpu-operator nvidia/gpu-operator \
  --version <pinned> \
  --set migManager.enabled=true \
  --set mig.strategy=single          # or mixed; see Configuration

mig-manager defaults: image nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.14.2, migManager.config.default="all-disabled", migManager.config.name="" (empty -> the operator-supplied default-mig-parted-config is used).² Switch strategy live by patching the ClusterPolicy instead of re-running Helm:¹

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
  -p='[{"op":"replace","path":"/spec/mig/strategy","value":"mixed"}]'

The mig-parted ConfigMap¶

The label value must name an entry in the mig-parted config. The operator's built-in default-mig-parted-config already defines the common ones (all-disabled, all-enabled, all-1g.10gb, all-2g.10gb, all-3g.20gb, all-balanced, plus per-SKU variants). To add a custom geometry, ship your own ConfigMap and point migManager.config.name at it. Structure is version: v1 then mig-configs, each a list of device selectors:¹³

# Custom mig-parted config. version and mig-devices keys are exact; profile
# names (1g.10gb) must be valid for the GPU. Reference template.
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      all-1g.10gb:                    # H100-80GB / A100-80GB: seven 1g.10gb slices
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      mixed-1g-3g:                    # heterogeneous -> requires strategy=mixed
        - devices: all
          mig-enabled: true
          mig-devices:                # 4x1g + 1x3g = 8 of 8 memory, 7 of 7 compute slices -> fits
            "1g.10gb": 4
            "3g.40gb": 1

Reference it on the release:

helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
  --set migManager.config.name=custom-mig-config \
  --set migManager.config.default=all-disabled

Configuration¶

Key / field	Where	Example	Meaning
`mig.strategy`	helm / `ClusterPolicy.spec.mig.strategy`	`single` \| `mixed`	How MIG slices are advertised: `single` -> `nvidia.com/gpu`; `mixed` -> `nvidia.com/mig-<profile>`.⁴
`migManager.enabled`	helm	`true`	Run the `mig-manager` DaemonSet (chart default).²
`migManager.config.name`	helm	`custom-mig-config`	ConfigMap holding `mig-parted` configs; empty uses `default-mig-parted-config`.²
`migManager.config.default`	helm	`all-disabled`	Config applied to a node with no `nvidia.com/mig.config` label.²
`nvidia.com/mig.config`	node label	`all-1g.10gb`	Names the `mig-parted` config to apply on this node.¹
`nvidia.com/mig.config.state`	node label (read)	`success`	Reconcile status: `pending` / `success` / `failed`.¹
`WITH_REBOOT`	`migManager.env[0]`	`"true"`	Reboot the node if a reboot is needed to enable MIG (some CSPs).¹
`version`	ConfigMap `config.yaml`	`v1`	`mig-parted` schema version.³
`mig-enabled`	per `mig-configs` entry	`true`	Turn MIG mode on for matched devices.³
`mig-devices`	per `mig-configs` entry	`{"1g.10gb": 7}`	Profile -> instance count to create.³

WITH_REBOOT (only if your platform needs a reboot to enter MIG mode):

helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --reuse-values \
  --set migManager.env[0].name=WITH_REBOOT --set-string migManager.env[0].value=true

Apply & verify¶

1. Label the node with the desired geometry (matches the hub's example):¹

kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite

2. Watch the manager reconcile. Expected signal: state goes pending then success. If it sticks on pending or flips to failed, GPU pods are likely still running on the node.¹

kubectl get node <gpu-node> -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}'
# expect: success

3. Confirm the geometry on the GPU. The mig-manager logs and nvidia-smi -L (run in the driver/operator pod) must list one MIG <profile> Device N per instance. For all-1g.10gb on an 80GB card, seven MIG 1g.10gb lines:¹

kubectl exec -n gpu-operator ds/nvidia-driver-daemonset -- nvidia-smi -L
# MIG 1g.10gb  Device 0: (UUID: MIG-...)
# ... seven lines on H100-80GB / A100-80GB

If nvidia-smi -L shows the parent GPU but no MIG devices, the geometry was not applied; recheck the label value names a real config entry and the state is success (MIG state runbook).

4. Confirm the advertised resource. Under single the node still reports nvidia.com/gpu (count = number of slices); under mixed it reports nvidia.com/mig-1g.10gb:⁴

kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}{"\n"}' | tr ',' '\n' | grep nvidia.com
# single -> "nvidia.com/gpu":"7"
# mixed  -> "nvidia.com/mig-1g.10gb":"7"

5. Bind a pod to one MIG instance. Request syntax depends on the strategy.

Under single (request looks identical to a whole GPU; you get one slice):

apiVersion: v1
kind: Pod
metadata:
  name: mig-smoke-single
  namespace: gpu-operator
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["nvidia-smi", "-L"]
      resources:
        limits:
          nvidia.com/gpu: 1            # single strategy -> one MIG slice

Under mixed (request the profile explicitly):

apiVersion: v1
kind: Pod
metadata:
  name: mig-smoke-mixed
  namespace: gpu-operator
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["nvidia-smi", "-L"]
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1    # mixed strategy -> name the profile

Apply and read the logs. Expected signal: the pod is Scheduled, runs, and nvidia-smi -L inside prints exactly one MIG 1g.10gb device, proof it bound a single instance, not the whole card:¹

kubectl apply -f mig-smoke-single.yaml
kubectl wait --for=condition=Ready pod/mig-smoke-single -n gpu-operator --timeout=120s || true
kubectl logs mig-smoke-single -n gpu-operator
# MIG 1g.10gb  Device 0: (UUID: MIG-...)   <- exactly one MIG device

A pod stuck Pending with 0/N nodes are available: insufficient nvidia.com/... means the requested resource is not advertised; wrong strategy/request pairing, or the geometry never applied.

Failure modes¶

Pod requests nvidia.com/gpu but cluster is mixed (or requests nvidia.com/mig-1g.10gb under single). Resource is not advertised, pod stays Pending forever. Match request syntax to the strategy.⁴
mig.config.state=failed / stuck pending. Almost always GPU pods still running on the node; the mig-manager will not reconfigure while GPUs are in use. Cordon and drain (or evict GPU pods) first.¹
Label value names a config that does not exist in the mig-parted ConfigMap; reconcile fails. The value must match an entry under mig-configs (e.g. all-1g.10gb), not a bare profile like 1g.10gb.¹
Profile invalid for the GPU. E.g. an A100-40GB cannot host 1g.10gb×7. Use a config whose mig-devices are valid for that SKU (profiles by GPU).³
Geometry needs a reboot but WITH_REBOOT is unset. Node never enters MIG mode on some CSPs. Set WITH_REBOOT=true on the manager.¹
MIG layout gone after reboot (Hopper/Blackwell). Mode is not InfoROM-persistent; the manager re-establishes it from the label, but a missing/changed label leaves the node as one whole GPU and pods Pending (MIG state runbook).¹
Expecting NVLink/P2P between MIG slices. There is none; multi-GPU collectives and model parallelism run on whole GPUs, not slices (NVSwitch/NVLink).

References¶

GPU Operator — MIG support (mig.config label, mig-manager, strategies, WITH_REBOOT): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html
GPU Operator Helm values (mig.strategy, migManager.*, image k8s-mig-manager): https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml
default-mig-parted-config ConfigMap (real mig-configs entries): https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml
mig-parted config format (version: v1, mig-configs, mig-devices): https://github.com/NVIDIA/mig-parted
MIG support in Kubernetes (single vs mixed, nvidia.com/mig-<profile>): https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
MIG User Guide (profiles, nvidia-smi -L, supported GPUs): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

GPU Operator, MIG Support — nvidia.com/mig.config label, nvidia.com/mig.config.state (pending/success/failed), mig-manager stops GPU pods and requires idle GPUs before reconfiguring, ConfigMap-named configs, WITH_REBOOT, kubectl patch of clusterpolicies.nvidia.com strategy. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
GPU Operator Helm values.yaml — mig.strategy: single, migManager.enabled: true, migManager.image: k8s-mig-manager, migManager.version: v0.14.2, migManager.config.default: "all-disabled", migManager.config.name: "". https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/values.yaml ↩↩↩↩
mig-parted config schema and the operator's default-mig-parted-config — version: v1, mig-configs, per-entry devices/mig-enabled/mig-devices (e.g. "1g.10gb": 7). https://github.com/NVIDIA/gpu-operator/blob/main/assets/state-mig-manager/0400_configmap.yaml ↩↩↩↩↩
MIG in Kubernetes — single advertises nvidia.com/gpu, mixed advertises nvidia.com/mig-<slices>g.<mem>gb; strategy is cluster-wide. https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html ↩↩↩↩↩↩
MIG User Guide, Supported GPUs — A30, A100, H100, H200, B200, RTX PRO 6000/5000/4500 Blackwell; GeForce excluded. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html ↩