NVIDIA container toolkit & CDI¶
Scope: how the NVIDIA Container Toolkit and Container Device Interface (CDI) expose host GPUs to OCI containers: install, runtime wiring (Docker/containerd), CDI spec generation and lifecycle, and the smoke test that proves a container sees the GPUs.
What it is¶
The NVIDIA Container Toolkit (package nvidia-container-toolkit, formerly nvidia-docker2) is the layer that injects the host GPU driver into a container at start time. It does not put a driver in the image. The container ships everything above libcuda.so (the CUDA runtime, cuDNN, NCCL; see Frameworks); the toolkit mounts the matching host driver user-space libraries and device nodes (/dev/nvidia*) so the container's CUDA runtime can reach the GPU. You need the NVIDIA driver installed on the host; you do not need the CUDA toolkit on the host.3 This is exactly the boundary described in CUDA Driver and GPU Software Stack and Node Administration.
The toolkit has two delivery mechanisms into the container, and the distinction is the single most important operational fact on this page:
- Legacy hook / runtime path. A custom OCI runtime (
nvidia-container-runtime) wrapsruncand, vialibnvidia-container, performs the driver injection. This is whatdocker run --gpus alldrives.3 - CDI (Container Device Interface). A vendor-neutral spec (github.com/cncf-tags/container-device-interface) describing exactly which device nodes, libraries and mounts a named device (
nvidia.com/gpu=all,nvidia.com/gpu=0) requires. The toolkit generates the spec once; any CDI-aware runtime then consumes it. CDI is the modern path and the one Kubernetes, Podman and recent Docker/containerd standardise on.2
flowchart LR
DRV["Host NVIDIA driver libcuda.so + /dev/nvidia*"] --> CTK["nvidia-container-toolkit"]
CTK --> HOOK["Legacy runtime hook: docker --gpus"]
CTK --> GEN["nvidia-ctk cdi generate"]
GEN --> SPEC["CDI spec /etc/cdi or /var/run/cdi"]
SPEC --> CDIRT["CDI-aware runtime: Docker 25+, containerd, Podman, CRI"]
HOOK --> CON["Container CUDA runtime"]
CDIRT --> CON
Why it's needed (and when)¶
- Driver/image decoupling. The image must not pin a driver. Drivers are reboot-class, fleet-managed, and tier-pinned (Driver and Feature Support by GPU Tier, Driver Install and Lifecycle). The toolkit lets one image run unchanged across nodes on different driver branches, as long as the host driver satisfies the image's CUDA floor (Driver Versions and Branches).
- Use CDI when the consumer is Kubernetes, Podman, or you want one declarative device description shared across runtimes. CDI is also what the GPU Operator and device plugin emit. Reach for CDI on any new build; the legacy
--gpushook remains for simple single-node Docker.2 - Use the
--gpushook when you are on plain Docker, want the terse one-liner, and do not need MIG-device selection by CDI name. Both paths ultimately calllibnvidia-container.3 - MIG and partitioning. CDI is the clean way to hand a specific MIG instance to a container by name (the generated spec enumerates each MIG device). After any MIG reconfiguration the spec is stale and must be regenerated. See Failure modes and MIG.2
- NVSwitch nodes. On NVLink/NVSwitch platforms the container path assumes the node prerequisites are already met: Fabric Manager running and version-matched, Persistence Mode on. The toolkit does not bring those up.
How it's installed & managed¶
Reference template, not hardware-tested. Pin and re-verify the version and package names against the official install guide in References before rollout. Latest stable at time of writing: NVIDIA Container Toolkit 1.19.1 (May 2025).6
1. Add the repository (Ubuntu/Debian). Single block, verbatim from the install guide.1
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
2. Install, pinned. Pin all four packages to the same version on managed fleets; never float the toolkit independently of libnvidia-container.1
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.1-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
3a. Wire Docker (legacy --gpus path). Writes the nvidia runtime into /etc/docker/daemon.json, then restart.1
3b. Wire containerd (Kubernetes path). Creates a drop-in at /etc/containerd/conf.d/99-nvidia.toml; restart containerd.1
4. CDI: generate the spec. Generate into /etc/cdi (static) or /var/run/cdi (regenerated, ephemeral). On recent toolkit installs the nvidia-cdi-refresh systemd unit auto-generates /var/run/cdi/nvidia.yaml on install, driver update, and reboot; the manual command is below.2
# Generate a CDI spec for all GPUs/MIG devices on the host
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# List the device names the spec exposes (nvidia.com/gpu=0, nvidia.com/gpu=all, MIG names)
nvidia-ctk cdi list
5. Enable CDI in the runtime.
- Docker: CDI is enabled by default since Docker Engine 28.3.0. On Docker 25.0–28.2 enable it in
/etc/docker/daemon.jsonand restart:4
- containerd: enable CDI and point it at the spec dirs in the containerd config (CRI plugin). Confirm the exact key path for your containerd major version against its docs; verify against containerd config docs:
# containerd config.toml (CRI plugin) — verify key path against your containerd version
enable_cdi = true
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
- Podman: CDI-native; no daemon flag. It reads
/etc/cdiand/var/run/cdidirectly.2
Forcing CDI mode for the NVIDIA runtime itself (rather than the engine) is available via sudo nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdi. Verify against the install guide before using on a fleet.1
Validated usage & tests¶
All examples below are reference templates, not hardware-tested. The smoke test is always the same: run a container that has nvidia-smi and confirm it prints the host GPU table. The driver/CUDA versions, GPU model, and process list in the output depend on the node. Do not expect fixed numbers.
Legacy --gpus path. Expose all GPUs; the utility capability is what makes nvidia-smi available in the container.3
# All GPUs
docker run -it --rm --gpus all ubuntu nvidia-smi
# Make nvidia-smi available explicitly via the utility capability
docker run --rm --gpus 'all,capabilities=utility' ubuntu nvidia-smi
# Specific GPU by index, multiple by index, or by UUID (from: nvidia-smi -L)
docker run -it --rm --gpus device=0 ubuntu nvidia-smi
docker run -it --rm --gpus '"device=0,2"' ubuntu nvidia-smi
docker run -it --rm --gpus device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a ubuntu nvidia-smi
CUDA base image (closer to a real workload). The task asks to verify against a CUDA base image; NVIDIA publishes these on NGC (nvcr.io/nvidia/cuda) and Docker Hub (nvidia/cuda). Pick a tag whose CUDA series the host driver supports.5
# Verify GPU visibility from a CUDA runtime image (pin the tag to one the host driver supports)
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
CDI path (Docker 25+ / Podman). Request the device by its CDI name; no --gpus, no --runtime needed once CDI is enabled.2
# Docker (CDI enabled): inject by CDI device name
docker run --rm --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04 nvidia-smi
# Podman (CDI-native)
podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable ubuntu nvidia-smi -L
Expected result (all paths). The container prints the nvidia-smi table: driver version and CUDA version in the header, one row per visible GPU (name, memory, utilisation), and usually No running processes found. Index/UUID selection should show only the requested GPUs. If you used --device nvidia.com/gpu=0, exactly one GPU appears. For the meaning of each field see nvidia-smi Reference. A non-zero exit or an empty/Failed to initialize NVML line means the path is broken. Go to Failure modes.
Failure modes¶
nvidia-smi: command not foundinside the container: the image lacks the binary or theutilitycapability was not injected. Addcapabilities=utility(or use a CUDA base image), or setNVIDIA_DRIVER_CAPABILITIESappropriately.3Failed to initialize NVML: Unknown Error/could not select device driver "" with capabilities: [[gpu]]: runtime not wired. Re-run the matchingnvidia-ctk runtime configureand restart the daemon. If the host itself shows no devices, the kernel/driver is the problem first (Kernel Upgrade: GPU Missing, NVIDIA Kernel Modules).unresolvable CDI devices nvidia.com/gpu=all: unknown: CDI not enabled in the engine, or no spec present. Enable CDI (Dockerfeatures.cdi, or generate the spec) and confirm withnvidia-ctk cdi list.2- Stale CDI spec after a driver upgrade or MIG change:
nvidia-cdi-refreshdoes not handle driver removal or MIG reconfiguration; regenerate manually withnvidia-ctk cdi generate(see Rolling Driver / CUDA Upgrade, MIG).2 - CDI hook conflict: if both the legacy hook and CDI inject, you get duplicated devices; remove
/usr/share/containers/oci/hooks.d/oci-nvidia-hook.jsonif present.2 - NVSwitch node, container sees GPUs but NCCL/peer-to-peer fails: node-level fabric, not a toolkit issue: check Fabric Manager and Fabric Manager Failure.
References¶
- Installing the NVIDIA Container Toolkit (apt repo, pinned install,
nvidia-ctk runtime configurefor docker/containerd, drop-in path): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html - Support for Container Device Interface (CDI) —
nvidia-ctk cdi generate/list,/etc/cdi&/var/run/cdi,nvidia-cdi-refresh, Podman/Docker examples, regeneration after MIG/driver changes, hook conflict: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html - Running a Sample Workload (the
nvidia-smicontainer verification): https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html - Docker — GPU access (
--gpus all,device=, UUID,capabilities=utility; requires toolkit + driver): https://docs.docker.com/engine/containers/gpu/ - Docker —
dockerdreference (features.cdi,cdi-spec-dirs, default-on since Engine 28.3.0): https://docs.docker.com/reference/cli/dockerd/ - NVIDIA Container Toolkit releases (latest stable 1.19.1): https://github.com/NVIDIA/nvidia-container-toolkit/releases
- CNCF Container Device Interface spec: https://github.com/cncf-tags/container-device-interface
- NVIDIA CUDA container images (NGC / Docker Hub): https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda
Related: GPU Software Stack and Node Administration · Frameworks · CUDA Driver · MIG · Container Image Provenance · Glossary
-
NVIDIA Container Toolkit install guide: apt repo configuration (gpgkey +
nvidia-container-toolkit.list),apt-get update, pinned install ofnvidia-container-toolkit{,-base}/libnvidia-container-tools/libnvidia-container1at1.19.1-1;nvidia-ctk runtime configure --runtime=dockeredits/etc/docker/daemon.json;--runtime=containerdwrites/etc/containerd/conf.d/99-nvidia.toml;nvidia-ctk config --in-place --set nvidia-container-runtime.mode=cdiforces CDI mode. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html ↩↩↩↩↩ -
NVIDIA Container Toolkit CDI support:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml(or/var/run/cdi/nvidia.yaml),nvidia-ctk cdi list; spec dirs/etc/cdi(static) and/var/run/cdi(generated);nvidia-cdi-refreshauto-generates on install/driver-update/reboot but does NOT handle driver removal or MIG reconfiguration (regenerate manually); Podmanpodman run --device nvidia.com/gpu=all; remove/usr/share/containers/oci/hooks.d/oci-nvidia-hook.jsonto avoid hook+CDI double injection. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ↩↩↩↩↩↩↩↩↩ -
Docker GPU access: requires installed NVIDIA driver + NVIDIA Container Toolkit;
docker run -it --rm --gpus all ubuntu nvidia-smi;--gpus device=0;--gpus '"device=0,2"';--gpus device=GPU-<uuid>(UUIDs fromnvidia-smi -L);docker run --gpus 'all,capabilities=utility' --rm ubuntu nvidia-smi. https://docs.docker.com/engine/containers/gpu/ ↩↩↩↩↩ -
Docker
dockerdreference:{"features":{"cdi":true}}enables CDI device injection;cdi-spec-dirsdefaults to/etc/cdi/and/var/run/cdi; "CDI is currently only supported for Linux containers and is enabled by default since Docker Engine 28.3.0." https://docs.docker.com/reference/cli/dockerd/ ↩ -
NVIDIA CUDA images are published on NGC as
nvcr.io/nvidia/cuda(andnvidia/cudaon Docker Hub) with-base/-runtime/-develvariants per CUDA version and Ubuntu base, e.g.13.0.0-base-ubuntu24.04. Pick a tag whose CUDA series the host driver supports. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda ↩ -
NVIDIA Container Toolkit releases: latest stable v1.19.1 (May 2025), a unified release of
libnvidia-container1.19.1 andnvidia-container-toolkit1.19.1. https://github.com/NVIDIA/nvidia-container-toolkit/releases ↩