Markdown

Ansible role: mig¶

Scope: enable MIG mode and lay out one requested profile per GPU via nvidia-smi mig, wrapped so the role is idempotent. It reads current state (nvidia-smi --query-gpu=mig.mode.current, nvidia-smi -L) before mutating, gates on node readiness, and converges to mig_profile without re-cutting an already-correct geometry. This is the optional mig role in the bring-up site.yml, run after nvidia_stack. Partition mechanics, profile tables, and the nvidia-smi mig lifecycle live in MIG.

Reference template, not hardware-tested. Validate every command and profile name against the MIG User Guide and your driver / GPU SKU before a fleet roll. Run on one node first.

flowchart LR
  GATE["Gate: mig_enabled and datacenter tier and GPUs visible"] --> MODE["nvidia-smi --query-gpu=mig.mode.current"]
  MODE -->|"Disabled"| ENABLE["nvidia-smi -i ID -mig 1"]
  ENABLE -->|"pending=Enabled, Ampere"| REBOOT["notify reboot node"]
  MODE -->|"Enabled"| HAVE["nvidia-smi -L"]
  ENABLE --> HAVE
  HAVE -->|"profile missing on GPU ID"| CREATE["nvidia-smi mig -i ID -cgi mig_profile -C"]
  HAVE -->|"profile present"| OK["no change"]
  CREATE --> VERIFY["nvidia-smi -L lists MIG-UUID devices"]

What it does¶

Takes a node whose driver stack is already up (the nvidia_stack role ran, nvidia-smi enumerates GPUs) and brings the requested MIG geometry into existence, idempotently:

Gate. Run only when mig_enabled | bool, the node is gpu_tier == 'datacenter' (MIG is a datacenter / RTX PRO capability, not consumer GeForce), and GPUs are actually visible. If the gate fails the role is a no-op, leaving no partial state.
Read mode. Query mig.mode.current per GPU. Enable MIG mode (nvidia-smi -i <id> -mig 1) only where it reads Disabled, so a re-run on an already-enabled node changes nothing.
Handle the reset. On Ampere enabling MIG triggers a GPU reset and the mode is persistent across reboots (InfoROM status bit); if the reset cannot complete in-band the change lands in mig.mode.pending=Enabled and a reboot applies it. On Hopper and newer no reset is needed, but the mode is not reboot-persistent, so the role must re-enable on every boot (re-run site.yml after reboot, or front it with the stale-MIG runbook).
Read geometry. Parse nvidia-smi -i <id> -L per target GPU. Create instances (nvidia-smi mig -i <id> -cgi <mig_profile> -C) only where the requested profile is not already present, so the destructive create path is skipped on a converged GPU.
Verify. Confirm nvidia-smi -L enumerates MIG-<UUID> devices for the requested profile.

The role deliberately does not tear down and recreate on every run; reshaping a live layout is a drain-gated operation owned by the stale-MIG runbook, not by routine convergence. It also does not manage the GPU Operator's MIG manager: when the Operator owns geometry (nvidia.com/mig.config), leave mig_enabled=false and let the Operator drive it.

Variables¶

Role and inventory variables (set in inventory/hosts.ini [gpu_nodes:vars] or host_vars/). mig_enabled and gpu_tier are shared with the hub inventory; mig_profile is introduced by this role.

Variable	Default	Scope	Meaning
`mig_enabled`	`false`	inventory (hub)	Master gate. Role is a no-op unless `true`. Only datacenter (A30, A100, H100/H200, B-series) or RTX PRO Blackwell SKUs.
`gpu_tier`	`datacenter`	inventory (hub)	`datacenter \\| workstation \\| consumer`. MIG tasks run only on `datacenter`.
`mig_profile`	`1g.10gb`	role	Profile applied to each GPU on the node, by name or numeric profile ID from `-lgip`. Comma-separate to cut several instances per card, e.g. `3g.20gb,3g.20gb` or `1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb`. Must fit the card's slice budget; must be valid for the SKU (see profile tables in MIG).
`mig_gpu_ids`	`all`	role	Which GPUs to target. `all` omits `-i` for the mode-enable step and derives IDs from `nvidia-smi --query-gpu=index`; a comma list such as `0,1` is passed as `nvidia-smi -i 0,1 -mig 1` for mode enable and looped as `nvidia-smi mig -i <id> -cgi ...` for instance creation. `-i all` is not valid syntax, so the token is dropped when the value is `all`.
`mig_create_compute_instances`	`true`	role	Pass `-C` so each GI also gets its compute instance (CUDA sees the slice). `false` cuts GIs only — almost never what you want; CUDA enumerates nothing without a CI.
`mig_reboot_on_reset`	`true`	role	Ampere only: if enabling MIG cannot reset in-band, allow the role to notify a reboot handler to apply `mig.mode.pending`. Set `false` to fail loudly instead of rebooting.

Profile defaults are intentionally conservative (1g.10gb). Override per node group; do not assume one profile fits A100 and H200 alike (memory-slice sizes differ: 1g.10gb vs 1g.18gb). The full per-GPU profile tables are in MIG.

Tasks¶

roles/mig/tasks/main.yml. Uses only ansible.builtin.command (no shell metacharacters needed) with task-level register / changed_when / failed_when for idempotency, plus ansible.builtin.assert for the readiness gate. The whole file is guarded by a block-level when, so nothing runs off-tier.

# roles/mig/tasks/main.yml
- name: MIG configuration
  when: mig_enabled | bool and gpu_tier == 'datacenter'
  block:

    - name: Readiness gate - GPUs are visible to the driver
      ansible.builtin.command: nvidia-smi -L
      register: mig_smi_list
      changed_when: false
      failed_when: mig_smi_list.rc != 0

    - name: Readiness gate - assert at least one GPU enumerated
      ansible.builtin.assert:
        that: "'GPU 0' in mig_smi_list.stdout"
        fail_msg: "No GPU enumerated by nvidia-smi -L; nvidia_stack role must complete before mig."

    - name: Read current MIG mode per GPU
      ansible.builtin.command: >-
        nvidia-smi --query-gpu=index,mig.mode.current --format=csv,noheader
      register: mig_mode
      changed_when: false
      failed_when: mig_mode.rc != 0
      # stdout rows: "0, Disabled" | "0, Enabled". [Disabled] also appears on
      # non-MIG SKUs; the block 'when' already excludes those by gpu_tier.

    - name: Enable MIG mode where currently Disabled
      ansible.builtin.command: >-
        nvidia-smi {{ ('-i ' ~ mig_gpu_ids) if mig_gpu_ids != 'all' else '' }} -mig 1
      register: mig_enable
      when: "'Disabled' in mig_mode.stdout"
      changed_when: "'Enabled MIG Mode' in mig_enable.stdout"
      failed_when: >-
        mig_enable.rc != 0
        and 'In use by another client' not in (mig_enable.stderr | default(''))
      # On Ampere this resets the GPU; the reset is refused while a CUDA app or
      # a stray nvidia-smi holds the device ("In use by another client").
      # Drain/clear clients first (see runbook-mig-state-stale).

    - name: Re-read MIG mode after enable (detect pending reset, Ampere)
      ansible.builtin.command: >-
        nvidia-smi --query-gpu=index,mig.mode.current,mig.mode.pending --format=csv,noheader
      register: mig_mode_post
      changed_when: false
      failed_when: mig_mode_post.rc != 0
      when: mig_enable is changed

    - name: Reboot to apply pending MIG mode (Ampere, reset not yet in effect)
      ansible.builtin.command: "true"
      changed_when: true
      notify: reboot node
      when:
        - mig_reboot_on_reset | bool
        - mig_mode_post is defined
        - mig_mode_post.stdout is defined
        - "'Enabled' in mig_mode_post.stdout"
        - "'Disabled, Enabled' in mig_mode_post.stdout"   # current=Disabled, pending=Enabled
      # Hopper+ needs no reset (current flips immediately); this fires only when
      # current is still Disabled but pending is Enabled. Flush handlers before
      # creating instances so the node is back up first.

    - name: Apply pending reboot now (before creating instances)
      ansible.builtin.meta: flush_handlers

    - name: Resolve target GPU IDs for instance creation
      ansible.builtin.set_fact:
        mig_target_gpu_ids: >-
          {{ (mig_mode.stdout_lines
              | map('regex_replace', '^\\s*([^,]+),.*$', '\\1')
              | map('trim')
              | list)
             if mig_gpu_ids == 'all'
             else (mig_gpu_ids.split(',') | map('trim') | list) }}

    - name: List existing MIG devices per target GPU
      ansible.builtin.command: nvidia-smi -i {{ item }} -L
      register: mig_existing
      changed_when: false
      failed_when: mig_existing.rc != 0
      loop: "{{ mig_target_gpu_ids }}"

    - name: Create GPU instances for the requested profile (with compute instances)
      ansible.builtin.command: >-
        nvidia-smi mig -i {{ item.item }} -cgi {{ mig_profile }}
        {{ '-C' if mig_create_compute_instances | bool else '' }}
      register: mig_create
      # Idempotency: only cut instances on a GPU where the requested profile is
      # not already present in that GPU's `nvidia-smi -i <id> -L` output.
      loop: "{{ mig_existing.results }}"
      when: ('MIG ' ~ (mig_profile.split(',')[0]) ~ ' Device') not in item.stdout
      changed_when: "'Successfully created' in mig_create.stdout"
      failed_when: >-
        mig_create.rc != 0
        and 'Insufficient resources' not in (mig_create.stderr | default(''))

    - name: Verify MIG devices are enumerated per target GPU
      ansible.builtin.command: nvidia-smi -i {{ item }} -L
      register: mig_verify
      loop: "{{ mig_target_gpu_ids }}"
      changed_when: false
      failed_when: >-
        ('MIG ' ~ (mig_profile.split(',')[0]) ~ ' Device') not in mig_verify.stdout

# roles/mig/handlers/main.yml
- name: reboot node
  ansible.builtin.reboot:
    reboot_timeout: 1200
    post_reboot_delay: 30

Notes on the idempotency contract:

Read-before-write everywhere. mig.mode.current gates the enable; nvidia-smi -L gates the create. No step mutates without first proving the target state is absent.
changed_when is keyed on success strings (Enabled MIG Mode, Successfully created), not on exit code, so a skipped-because-present run reports ok, not changed.
The create loop targets each GPU ID explicitly with nvidia-smi mig -i <id> ...; a multi-GPU node is not left with only GPU 0 cut. The presence check still matches only the first profile token in mig_profile. For a heterogeneous layout (e.g. 3g.20gb,1g.10gb) the presence check is partial; prefer the explicit teardown/recreate path in stale-MIG runbook, which is drain-gated, over re-running this role to reshape.
flush_handlers forces the reboot (if any) to complete before the create step, so instances are cut on a GPU whose MIG mode is actually in effect.

Apply & verify¶

Run the hub playbook scoped to one node, or the role directly:

# whole bring-up, one node, MIG on, explicit profile:
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal \
  -e mig_enabled=true -e mig_profile=1g.10gb

# dry run first (will show the enable/create as would-change):
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal \
  -e mig_enabled=true --check --diff

# tags, if site.yml tags the role:
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal --tags mig

Validation command and expected signal. On the node, MIG mode is Enabled and nvidia-smi -L lists one MIG-<UUID> device per compute instance:

nvidia-smi --query-gpu=index,mig.mode.current --format=csv,noheader
# expect every row: "<n>, Enabled"

nvidia-smi -L
# expect one line per slice, e.g.:
#   GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-...)
#     MIG 1g.10gb Device 0: (UUID: MIG-c7384736-a75d-5afc-978f-d2f1294409fd)

The MIG UUID (MIG-<...>), not the bare GPU index, is what pins a process to one instance via CUDA_VISIBLE_DEVICES. A profile that cut zero devices means GIs were created without compute instances; confirm -C (i.e. mig_create_compute_instances=true).

Idempotency check (design intent, not yet hardware-verified). Because every mutating step is guarded by a read-before-write when/changed_when (enable fires only on Disabled; create fires only when the profile is absent), a second identical run on an already-converged node is expected to report changed=0. Run it and confirm on your hardware; this role is a reference template and has not been validated on a live GPU:

ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal \
  -e mig_enabled=true -e mig_profile=1g.10gb
# expect PLAY RECAP -> ... changed=0 ... once the node is converged

Slot this role's validate cousin (the health role in the hub) after mig to assert geometry fleet-wide.

Failure modes¶

Enable refused, In use by another client. On Ampere the reset needed to enable MIG is blocked by an attached CUDA app or a stray nvidia-smi. The role tolerates this stderr so a fleet run does not abort, but the GPU stays Disabled. Drain the node / kill clients (or reboot), then re-run. Runbook: stale-MIG state.
MIG layout gone after reboot (Hopper / Blackwell). Mode is not InfoROM-persistent on Hopper+; a rebooted node comes back as one whole GPU while the scheduler still expects slices, and pods stay Pending. Re-run site.yml on boot (or let a systemd unit / the GPU Operator re-enable). Runbook: stale-MIG state.
mig.mode.pending=Enabled but current=Disabled and mig_reboot_on_reset=false. The Ampere reset never applied; the create step then fails because MIG is not actually on. Reboot the node and re-run. Runbook: stale-MIG state.
Insufficient resources on create. mig_profile exceeds the card's slice budget (7 SM / 8 memory slices), uses a profile invalid for the SKU, or collides with placement constraints. Check nvidia-smi mig -lgip for the remaining budget and the per-GPU tables in MIG. The role tolerates this stderr only to surface it in the verify step's failure.
Stale / partial geometry vs. what Kubernetes advertises. Device-plugin labels disagree with the on-box nvidia-smi mig -lgi layout (typically after a partial reconfigure or a -mig 0 that left CIs/GIs behind). This role does not reshape live geometry; use the drain-gated teardown in stale-MIG state.
Operator and host both managing MIG. If the GPU Operator's MIG manager owns nvidia.com/mig.config and this role also runs, they fight over geometry. Pick one: leave mig_enabled=false when the Operator is present.

References¶

MIG User Guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Getting Started with MIG (-mig 1/0, -cgi ... -C, -lgi/-lci/-lgip, -dci/-dgi, mig.mode.current/mig.mode.pending, MIG-UUID format, Ampere reset + InfoROM persistence vs. Hopper+ non-persistence, In use by another client): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/getting-started-with-mig.html
Supported MIG profiles (per-GPU profile tables, slice budget, +me/+gfx suffixes): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-mig-profiles.html
Supported GPUs (A30, A100, H100, H200, B200, RTX PRO Blackwell, Thor iGPU): https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html
ansible.builtin.command (creates/removes, no shell, task-level changed_when/failed_when/register): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html
ansible.builtin.assert: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/assert_module.html
ansible.builtin.reboot: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
ansible.builtin.meta (flush_handlers): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/meta_module.html
NVIDIA GPU Operator with MIG (when the Operator owns geometry instead): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html