Markdown

NVIDIA container toolkit & CDI¶

Scope: how the NVIDIA Container Toolkit and Container Device Interface (CDI) expose host GPUs to OCI containers: install, runtime wiring (Docker/containerd), CDI spec generation and lifecycle, and the smoke test that proves a container sees the GPUs.

What it is¶

The NVIDIA Container Toolkit (package nvidia-container-toolkit, formerly nvidia-docker2) is the layer that injects the host GPU driver into a container at start time. It does not put a driver in the image. The container ships everything above libcuda.so (the CUDA runtime, cuDNN, NCCL; see Frameworks); the toolkit mounts the matching host driver user-space libraries and device nodes (/dev/nvidia*) so the container's CUDA runtime can reach the GPU. You need the NVIDIA driver installed on the host; you do not need the CUDA toolkit on the host.³ This is exactly the boundary described in CUDA Driver and GPU Software Stack and Node Administration.

The toolkit has two delivery mechanisms into the container, and the distinction is the single most important operational fact on this page:

Legacy hook / runtime path. A custom OCI runtime (nvidia-container-runtime) wraps runc and, via libnvidia-container, performs the driver injection. This is what docker run --gpus all drives.³
CDI (Container Device Interface). A vendor-neutral spec (github.com/cncf-tags/container-device-interface) describing exactly which device nodes, libraries and mounts a named device (nvidia.com/gpu=all, nvidia.com/gpu=0) requires. The toolkit generates the spec once; any CDI-aware runtime then consumes it. CDI is the modern path and the one Kubernetes, Podman and recent Docker/containerd standardise on.²

flowchart LR
  DRV["Host NVIDIA driver libcuda.so + /dev/nvidia*"] --> CTK["nvidia-container-toolkit"]
  CTK --> HOOK["Legacy runtime hook: docker --gpus"]
  CTK --> GEN["nvidia-ctk cdi generate"]
  GEN --> SPEC["CDI spec /etc/cdi or /var/run/cdi"]
  SPEC --> CDIRT["CDI-aware runtime: Docker 25+, containerd, Podman, CRI"]
  HOOK --> CON["Container CUDA runtime"]
  CDIRT --> CON

Why it's needed (and when)¶

Driver/image decoupling. The image must not pin a driver. Drivers are reboot-class, fleet-managed, and tier-pinned (Driver and Feature Support by GPU Tier, Driver Install and Lifecycle). The toolkit lets one image run unchanged across nodes on different driver branches, as long as the host driver satisfies the image's CUDA floor (Driver Versions and Branches).
Use CDI when the consumer is Kubernetes, Podman, or you want one declarative device description shared across runtimes. CDI is also what the GPU Operator and device plugin emit. Reach for CDI on any new build; the legacy --gpus hook remains for simple single-node Docker.²
Use the --gpus hook when you are on plain Docker, want the terse one-liner, and do not need MIG-device selection by CDI name. Both paths ultimately call libnvidia-container.³
MIG and partitioning. CDI is the clean way to hand a specific MIG instance to a container by name (the generated spec enumerates each MIG device). After any MIG reconfiguration the spec is stale and must be regenerated. See Failure modes and MIG.²
NVSwitch nodes. On NVLink/NVSwitch platforms the container path assumes the node prerequisites are already met: Fabric Manager running and version-matched, Persistence Mode on. The toolkit does not bring those up.

How it's installed & managed¶

Reference template, not hardware-tested. Pin and re-verify the version and package names against the official install guide in References before rollout. Latest stable at time of writing: NVIDIA Container Toolkit 1.19.1 (May 2025).⁶

1. Add the repository (Ubuntu/Debian). Single block, verbatim from the install guide.¹

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

2. Install, pinned. Pin all four packages to the same version on managed fleets; never float the toolkit independently of libnvidia-container.¹

sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.1-1
sudo apt-get install -y \
    nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

3a. Wire Docker (legacy --gpus path). Writes the nvidia runtime into /etc/docker/daemon.json, then restart.¹

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

3b. Wire containerd (Kubernetes path). Creates a drop-in at /etc/containerd/conf.d/99-nvidia.toml; restart containerd.¹

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

4. CDI: generate the spec. Generate into /etc/cdi (static) or /var/run/cdi (regenerated, ephemeral). On recent toolkit installs the nvidia-cdi-refresh systemd unit auto-generates /var/run/cdi/nvidia.yaml on install, driver update, and reboot; the manual command is below.²

# Generate a CDI spec for all GPUs/MIG devices on the host
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# List the device names the spec exposes (nvidia.com/gpu=0, nvidia.com/gpu=all, MIG names)
nvidia-ctk cdi list

5. Enable CDI in the runtime.

Docker: CDI is enabled by default since Docker Engine 28.3.0. On Docker 25.0–28.2 enable it in /etc/docker/daemon.json and restart:⁴

{
  "features": { "cdi": true }
}

containerd: enable CDI and point it at the spec dirs in the containerd config (CRI plugin). Confirm the exact key path for your containerd major version against its docs; verify against containerd config docs:

# containerd config.toml (CRI plugin) — verify key path against your containerd version
enable_cdi = true
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]

Podman: CDI-native; no daemon flag. It reads /etc/cdi and /var/run/cdi directly.²

Forcing CDI mode for the NVIDIA runtime itself (rather than the engine) is available via sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi. Verify against the install guide before using on a fleet.¹

Validated usage & tests¶

All examples below are reference templates, not hardware-tested. The smoke test is always the same: run a container that has nvidia-smi and confirm it prints the host GPU table. The driver/CUDA versions, GPU model, and process list in the output depend on the node. Do not expect fixed numbers.

Legacy --gpus path. Expose all GPUs; the utility capability is what makes nvidia-smi available in the container.³

# All GPUs
docker run -it --rm --gpus all ubuntu nvidia-smi

# Make nvidia-smi available explicitly via the utility capability
docker run --rm --gpus 'all,capabilities=utility' ubuntu nvidia-smi

# Specific GPU by index, multiple by index, or by UUID (from: nvidia-smi -L)
docker run -it --rm --gpus device=0 ubuntu nvidia-smi
docker run -it --rm --gpus '"device=0,2"' ubuntu nvidia-smi
docker run -it --rm --gpus device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a ubuntu nvidia-smi

CUDA base image (closer to a real workload). The task asks to verify against a CUDA base image; NVIDIA publishes these on NGC (nvcr.io/nvidia/cuda) and Docker Hub (nvidia/cuda). Pick a tag whose CUDA series the host driver supports.⁵

# Verify GPU visibility from a CUDA runtime image (pin the tag to one the host driver supports)
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi

CDI path (Docker 25+ / Podman). Request the device by its CDI name; no --gpus, no --runtime needed once CDI is enabled.²

# Docker (CDI enabled): inject by CDI device name
docker run --rm --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi

# Podman (CDI-native)
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L

Expected result (all paths). The container prints the nvidia-smi table: driver version and CUDA version in the header, one row per visible GPU (name, memory, utilisation), and usually No running processes found. Index/UUID selection should show only the requested GPUs. If you used --device nvidia.com/gpu=0, exactly one GPU appears. For the meaning of each field see nvidia-smi Reference. A non-zero exit or an empty/Failed to initialize NVML line means the path is broken. Go to Failure modes.

Failure modes¶

nvidia-smi: command not found inside the container: the image lacks the binary or the utility capability was not injected. Add capabilities=utility (or use a CUDA base image), or set NVIDIA_DRIVER_CAPABILITIES appropriately.³
Failed to initialize NVML: Unknown Error / could not select device driver "" with capabilities: [[gpu]]: runtime not wired. Re-run the matching nvidia-ctk runtime configure and restart the daemon. If the host itself shows no devices, the kernel/driver is the problem first (Kernel Upgrade: GPU Missing, NVIDIA Kernel Modules).
unresolvable CDI devices nvidia.com/gpu=all: unknown: CDI not enabled in the engine, or no spec present. Enable CDI (Docker features.cdi, or generate the spec) and confirm with nvidia-ctk cdi list.²
Stale CDI spec after a driver upgrade or MIG change: nvidia-cdi-refresh does not handle driver removal or MIG reconfiguration; regenerate manually with nvidia-ctk cdi generate (see Rolling Driver / CUDA Upgrade, MIG).²
CDI hook conflict: if both the legacy hook and CDI inject, you get duplicated devices; remove /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json if present.²
NVSwitch node, container sees GPUs but NCCL/peer-to-peer fails: node-level fabric, not a toolkit issue: check Fabric Manager and Fabric Manager Failure.

References¶

Installing the NVIDIA Container Toolkit (apt repo, pinned install, nvidia-ctk runtime configure for docker/containerd, drop-in path): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Support for Container Device Interface (CDI) — nvidia-ctk cdi generate/list, /etc/cdi & /var/run/cdi, nvidia-cdi-refresh, Podman/Docker examples, regeneration after MIG/driver changes, hook conflict: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
Running a Sample Workload (the nvidia-smi container verification): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
Docker — GPU access (--gpus all, device=, UUID, capabilities=utility; requires toolkit + driver): https://docs.docker.com/engine/containers/gpu/
Docker — dockerd reference (features.cdi, cdi-spec-dirs, default-on since Engine 28.3.0): https://docs.docker.com/reference/cli/dockerd/
NVIDIA Container Toolkit releases (latest stable 1.19.1): https://github.com/NVIDIA/nvidia-container-toolkit/releases
CNCF Container Device Interface spec: https://github.com/cncf-tags/container-device-interface
NVIDIA CUDA container images (NGC / Docker Hub): https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda

NVIDIA Container Toolkit install guide: apt repo configuration (gpgkey + nvidia-container-toolkit.list), apt-get update, pinned install of nvidia-container-toolkit{,-base}/libnvidia-container-tools/libnvidia-container1 at 1.19.1-1; nvidia-ctk runtime configure --runtime=docker edits /etc/docker/daemon.json; --runtime=containerd writes /etc/containerd/conf.d/99-nvidia.toml; nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi forces CDI mode. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ↩↩↩↩↩
NVIDIA Container Toolkit CDI support: sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml (or /var/run/cdi/nvidia.yaml), nvidia-ctk cdi list; spec dirs /etc/cdi (static) and /var/run/cdi (generated); nvidia-cdi-refresh auto-generates on install/driver-update/reboot but does NOT handle driver removal or MIG reconfiguration (regenerate manually); Podman podman run --device nvidia.com/gpu=all; remove /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json to avoid hook+CDI double injection. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ↩↩↩↩↩↩↩↩↩
Docker GPU access: requires installed NVIDIA driver + NVIDIA Container Toolkit; docker run -it --rm --gpus all ubuntu nvidia-smi; --gpus device=0; --gpus '"device=0,2"'; --gpus device=GPU-<uuid> (UUIDs from nvidia-smi -L); docker run --gpus 'all,capabilities=utility' --rm ubuntu nvidia-smi. https://docs.docker.com/engine/containers/gpu/ ↩↩↩↩↩
Docker dockerd reference: {"features":{"cdi":true}} enables CDI device injection; cdi-spec-dirs defaults to /etc/cdi/ and /var/run/cdi; "CDI is currently only supported for Linux containers and is enabled by default since Docker Engine 28.3.0." https://docs.docker.com/reference/cli/dockerd/ ↩
NVIDIA CUDA images are published on NGC as nvcr.io/nvidia/cuda (and nvidia/cuda on Docker Hub) with -base/-runtime/-devel variants per CUDA version and Ubuntu base, e.g. 13.0.0-base-ubuntu24.04. Pick a tag whose CUDA series the host driver supports. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda ↩
NVIDIA Container Toolkit releases: latest stable v1.19.1 (May 2025), a unified release of libnvidia-container 1.19.1 and nvidia-container-toolkit 1.19.1. https://github.com/NVIDIA/nvidia-container-toolkit/releases ↩