Markdown

GPU containerization performance¶

Scope: Reaching near-bare-metal GPU throughput inside containers via host-driver/container-CUDA splitting, OverlayFS I/O avoidance, slim images, and rootless Apptainer.

flowchart TB
    App["GPU workload<br/>(CUDA runtime: libcudart / cuDNN / NCCL)"]
    Runtime["Container runtime<br/>(Docker / containerd / CRI-O / Podman / Apptainer)"]
    Toolkit["NVIDIA Container Toolkit / CDI<br/>injects driver libs + device nodes"]
    Driver["Host NVIDIA kernel + user-mode driver<br/>(shared, no hypervisor)"]
    GPU["GPU"]

    App --> Runtime
    Runtime --> Toolkit
    Toolkit -->|"libcuda.so, /dev/nvidia*, gsp*.bin"| Driver
    Driver --> GPU

    Compat["Driver branch >= container-CUDA minimum"]
    Bind["Bind mounts bypass OverlayFS CoW"]
    Compat -. "mismatch fails CUDA init" .-> App
    Bind -. "avoid I/O tax" .-> Runtime

    GPU --> Result["~2% overhead vs bare metal"]

What it is¶

A GPU container shares the host kernel and the host NVIDIA kernel driver directly; there is no hypervisor and no device emulation between the workload and the GPU. The container ships only the CUDA runtime layer (libcudart.so, cuDNN, NCCL, cuBLAS); the host supplies the driver layer. The NVIDIA Container Toolkit bridges the two: at container start it injects the host's user-mode driver libraries (libcuda.so, libnvidia-ml.so) plus the device nodes (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) into the container's mount namespace.

Because the kernel and driver are shared, a kernel launched from inside the container executes as if launched from the host. Measured GPU performance in a properly configured container is within ~2% of bare metal; MLPerf Inference v5.0 results were produced on Kubernetes/OpenShift, demonstrating containers do not compromise GPU efficiency or latency.¹

On modern toolkit versions this injection is expressed through the Container Device Interface (CDI): a generated spec at /var/run/cdi/nvidia.yaml enumerates the exact driver libraries, firmware (gsp*.bin), IPC sockets, and device nodes to mount. CDI is runtime-agnostic and works with Docker, containerd, CRI-O, and Podman.²

Why it matters¶

Two failure modes erase the near-zero overhead:

Driver/CUDA mismatch. If the host driver is older than the CUDA runtime in the image requires, CUDA initialization fails outright (typically CUDA_ERROR_SYSTEM_DRIVER_MISMATCH / forward compatibility was attempted). The job never starts.
OverlayFS write amplification. Container images stack a read-only base layer under a writable layer via a union filesystem (OverlayFS). Every read merges layers (extra metadata lookups); every modification of a base-layer file triggers copy-on-write: the whole file is copied to the writable layer before the write lands.¹ Datasets and checkpoints written through this layer pay that tax on every I/O operation, starving the GPU.

At cluster scale, system-level tuning like this yields double-digit percentage throughput gains and can save tens to hundreds of thousands of dollars in compute time on a large run.¹

When it is needed (and when not)¶

Needed for any reproducible GPU training/inference deployment: containers pin CUDA/cuDNN/NCCL versions and kill the "works on my machine" class of bug.
Bind mounts needed whenever the workload does heavy dataset reads or checkpoint writes, i.e. essentially all training.
Apptainer/Singularity preferred in HPC/multi-tenant centers that forbid a root container daemon, or where the host filesystem must be used directly with zero union-FS layer.
Not a memory-isolation tool. Docker's --gpus flag selects and exposes GPUs but provides no GPU-memory limit.¹ For hard GPU compute/memory isolation use MIG (nvidia.com/mig-* resources) or MPS, not container limits.
Image-slimming is low priority for long jobs: a multi-minute pull is negligible against hours-to-days of training. Still worth doing for disk and registry hygiene, but do not over-invest.

How: implement, integrate, maintain¶

1. Respect the host-driver >= container-CUDA-minimum rule¶

The host driver branch must be at least the minimum required by the CUDA version inside the image. Per NVIDIA's CUDA compatibility documentation:

CUDA toolkit in image	Minimum host driver branch (Linux)
CUDA 13.x	R580 or newer
CUDA 12.x	R525 or newer

A newer CUDA runtime on an older driver fails to initialize.¹ Always consult the supported-drivers matrix when bumping the toolkit, and upgrade the host driver first.

Verify the host driver and the in-container view:

# Host
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# Inside a container (toolkit injects the host driver libs)
docker run --rm --gpus all nvcr.io/nvidia/pytorch:25.04-py3 nvidia-smi

Prefer NGC base images (nvcr.io/nvidia/pytorch, etc.): they bundle matched CUDA/cuDNN/NCCL and document their minimum driver. The toolkit then injects the host driver at runtime, so the driver itself need not be baked into the image.

2. Bypass OverlayFS with read-only bind mounts¶

Do not bake multi-terabyte datasets into the image. Bring data and output directories in through bind mounts, which write straight to the host filesystem and bypass the union-FS copy-on-write layer entirely, delivering the underlying device's full bandwidth (NVMe, NFS, Lustre).¹

docker run --rm --gpus all \
  -v /data/dataset:/mnt/dataset:ro \
  -v /data/checkpoints:/mnt/ckpt \
  nvcr.io/nvidia/pytorch:25.04-py3 \
  python train.py --data /mnt/dataset --out /mnt/ckpt

:ro mounts the dataset read-only (documented Docker bind-mount option).³ The checkpoint mount stays writable. Keep all heavy reads/writes on bind mounts, never on the container's writable layer.

3. Keep images slim¶

Use multi-stage builds; do not ship compilers or build artifacts in the runtime image. Smaller images pull and start faster and cost less registry/disk space, secondary to correctness but free hygiene.

4. Rootless Apptainer for HPC / no-daemon environments¶

Apptainer (formerly Singularity) runs images in user space with no root daemon, executing with the calling user's privileges. It uses the host filesystem directly, so overhead beyond the OS is effectively nil.¹ GPU access uses --nv, which exposes the host NVIDIA driver libraries and device files into the container; only the host driver is required, not the CUDA toolkit.⁴

# Pull/convert an NGC image to an immutable SIF, then run rootless with GPUs
apptainer pull pytorch.sif docker://nvcr.io/nvidia/pytorch:25.04-py3
apptainer exec --nv \
  --bind /data/dataset:/mnt/dataset:ro \
  pytorch.sif python train.py --data /mnt/dataset

Prefer SIF images over sandbox directories on network filesystems: rootless uid mapping causes permission-denied errors against NFS/Lustre/GPFS for sandbox layouts.⁴

Maintain¶

Re-pin and re-test the host-driver/CUDA pair on every toolkit bump; treat the compatibility matrix as a gate.
Keep the NVIDIA driver persistent across jobs (persistence mode); unloading it forces the next job to pay driver-reload cost.¹
Audit that production launch specs use bind mounts for all dataset/checkpoint paths. A regression here silently routes I/O back through OverlayFS CoW.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 3, "OS, Docker, and Kubernetes Tuning for GPU-Based Environments" — Container Runtime Optimizations for GPUs; NVIDIA Container Toolkit and CUDA Compatibility; Avoiding Container Overlay Filesystem Overhead; Reduce Image Size.
NVIDIA Container Toolkit — Install Guide: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
NVIDIA Container Toolkit — CDI support: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
NVIDIA CUDA Compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/latest/
NVIDIA Data Center Drivers — Supported Drivers and CUDA Toolkit Versions: https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html
Docker — Bind mounts (read-only :ro): https://docs.docker.com/engine/storage/bind-mounts/
Apptainer User Guide — GPU support (--nv) and SIF: https://apptainer.org/docs/user/main/gpu.html

Fregly, AI Systems Performance Engineering, Ch. 3 — Container Runtime Optimizations for GPUs (near bare-metal, <2% overhead, MLPerf v5.0), CUDA/driver minimums and runtime/driver library split, OverlayFS CoW overhead and bind-mount workaround, --gpus has no memory limit, image-size guidance, Apptainer rootless host-filesystem use. ↩↩↩↩↩↩↩↩
NVIDIA Container Toolkit — Container Device Interface support. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html ↩
Docker — Bind mounts; :ro/readonly option. https://docs.docker.com/engine/storage/bind-mounts/ ↩
Apptainer User Guide — GPU support (--nv) and SIF vs sandbox on network filesystems. https://apptainer.org/docs/user/main/gpu.html ↩↩