Markdown

Smoke tests: GPU platform¶

Scope: a consolidated acceptance suite for a freshly built GPU platform. It runs a CUDA pod running nvidia-smi, an nccl-tests Job over RDMA (GDRDMA confirmed via NCCL_DEBUG=INFO), a MIG-instance pod, a DRA claim binding, and a gang Job that places all pods or none, each with the one signal that means pass.

Reference templates from the GPU Operator / Network Operator docs, Kubernetes DRA docs, Volcano, the Kubeflow MPI Operator, and the NVIDIA/nccl-tests / coreweave/nccl-tests repos. Nothing here was executed on hardware. Pin every chart, image, and CRD version; substitute real node names, resource keys (rdma/ib vs rdma/roce), and HCA filters before applying. Run on one node/pair before a fleet roll. This is the acceptance gate for the build in Kubernetes & Helm: GPU platform; the host-level fabric proof it complements is fabric bring-up and benchmarking.

flowchart LR
  T1["1 CUDA pod: nvidia-smi"] --> T2["2 nccl-tests over RDMA"]
  T2 --> T3["3 MIG-instance pod"]
  T3 --> T4["4 DRA claim binds"]
  T4 --> T5["5 Gang Job places all pods"]
  T5 --> OK{"all 5 pass?"}
  OK -->|"yes"| ACCEPT["Platform accepted"]
  OK -->|"no"| TRIAGE["Triage the failed layer"]

What it is¶

Five tests, one per platform layer, run bottom-up. Each isolates the layer beneath it, so a failure points at one component rather than the whole stack:

CUDA pod. The GPU Operator advertised nvidia.com/gpu and the runtime injects a working GPU. Proves device plugin + container toolkit + driver.
nccl-tests Job over RDMA. The Network Operator wired an RDMA device into pods and NCCL takes the GPUDirect RDMA path, not TCP. Proves the fabric is usable from inside a pod.
MIG-instance pod. A partitioned GPU is schedulable as the resource its strategy advertises. Proves the sharing model.
DRA claim. A ResourceClaim binds to a real device and the pod runs. Proves the Dynamic Resource Allocation path (Kubernetes 1.34+).⁷
Gang Job. A multi-pod job starts only when every pod can be placed. Proves the scheduler does all-or-nothing placement, so a distributed job cannot half-start and deadlock.

This is GPU-platform acceptance inside Kubernetes. The host-level proof (links up, subnet manager converged, perftest/nccl-tests at line rate on bare metal) lives in fabric bring-up and benchmarking; the GPU-health adjudication (dcgmi diag, nvbandwidth, gpu-burn) lives in diagnostics tools. Run those first: a smoke test should fail on the platform plumbing, not on silicon that was never healthy.

Prerequisites¶

The platform is built and each layer already passed its own install check (Kubernetes & Helm: GPU platform): GPU Operator, Network Operator + NicClusterPolicy, a sharing model (MIG or time-slicing), the DRA driver, and Volcano.
kubectl against the target cluster; a namespace to run in (smoke below). Create it once: kubectl create namespace smoke.
Nodes report GPU capacity: kubectl get nodes -o custom-columns=NODE:.metadata.name,GPU:.status.allocatable.'nvidia\.com/gpu' shows a non-zero count.¹
For test 2: the Network Operator is installed and a NicClusterPolicy exposes an RDMA resource (here rdma/ib; use rdma/roce on RoCE) and a secondary network for the pods.³ The MPI Operator (kubeflow.org/v2beta1) is installed for the MPIJob kind.⁴
For test 3: MIG is enabled on at least one node via the nvidia.com/mig.config label and the GPU Operator mig.strategy is set; this page assumes single.²
For test 4: Kubernetes 1.34+ with the DynamicResourceAllocation feature, the NVIDIA DRA driver installed, and a DeviceClass (gpu.nvidia.com) present (kubectl get deviceclasses).⁷⁹
For test 5: Volcano is installed and kubectl get pods -n volcano-system shows the scheduler, controller, and admission pods Running.¹¹

The manifests¶

1. CUDA pod: `nvidia-smi`¶

apiVersion: v1
kind: Pod
metadata:
  name: smoke-cuda
  namespace: smoke
spec:
  restartPolicy: Never
  containers:
    - name: smi
      image: nvidia/cuda:13.0.0-base-ubuntu24.04   # pin to your CUDA base
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1

2. `nccl-tests` Job over RDMA (GDRDMA via `NCCL_DEBUG=INFO`)¶

Two-worker all_reduce_perf under the MPI Operator. The launcher sets NCCL_DEBUG=INFO so the log states which transport NCCL chose; the workers request nvidia.com/gpu and the RDMA resource. Resource keys, NCCL_IB_HCA, slotsPerWorker, and the test size mirror the coreweave/nccl-tests H100 example; adjust to your GPUs-per-node and HCA naming.⁵

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: smoke-nccl
  namespace: smoke
spec:
  slotsPerWorker: 8                 # = GPUs per worker node
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - name: launcher
              image: ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-61f74e3
              command: ["/bin/bash", "-c"]
              args:
                - >
                  mpirun --allow-run-as-root -np 16 -bind-to none
                  -x LD_LIBRARY_PATH
                  -x NCCL_DEBUG=INFO
                  -x NCCL_SOCKET_IFNAME=eth0
                  -x NCCL_IB_HCA=ibp
                  /opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - name: worker
              image: ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-61f74e3
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/ib: 1          # RoCE clusters: rdma/roce
                  memory: 960Gi
              volumeMounts:
                - { name: dshm, mountPath: /dev/shm }
          volumes:
            - name: dshm
              emptyDir: { medium: Memory }

3. MIG-instance pod¶

Under MIG single strategy the device plugin advertises MIG slices as the ordinary nvidia.com/gpu, so the pod spec is unchanged; only the meaning of one "GPU" shrinks to one slice.² Pin the pod to a MIG node by the GFD-published product label.

apiVersion: v1
kind: Pod
metadata:
  name: smoke-mig
  namespace: smoke
spec:
  restartPolicy: Never
  nodeSelector:
    nvidia.com/mig.config.state: success     # MIG manager finished applying the geometry
  containers:
    - name: smi
      image: nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["nvidia-smi", "-L"]          # lists the MIG device UUID
      resources:
        limits:
          nvidia.com/gpu: 1

Under MIG mixed strategy, request the profile resource instead, e.g. limits: { nvidia.com/mig-1g.10gb: 1 }.²

4. DRA claim binding¶

A ResourceClaimTemplate requests one GPU with at least 40 GiB; a Job references it. apiVersion: resource.k8s.io/v1 is the GA group in Kubernetes 1.34; exactly, deviceClassName, and the CEL selectors are the documented claim fields.⁷⁸

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: smoke-gpu-40g
  namespace: smoke
spec:
  spec:
    devices:
      requests:
        - name: gpu
          exactly:
            deviceClassName: gpu.nvidia.com
            selectors:
              - cel:
                  expression: 'device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0'
---
apiVersion: batch/v1
kind: Job
metadata:
  name: smoke-dra
  namespace: smoke
spec:
  template:
    spec:
      restartPolicy: Never
      resourceClaims:
        - name: gpu
          resourceClaimTemplateName: smoke-gpu-40g
      containers:
        - name: smi
          image: nvidia/cuda:13.0.0-base-ubuntu24.04
          command: ["nvidia-smi", "-L"]
          resources:
            claims:
              - name: gpu

5. Gang Job: place all pods or none¶

A Volcano Job with minAvailable equal to the worker count starts only when every pod can be scheduled together; schedulerName: volcano routes it to the gang scheduler.¹⁰ Set nvidia.com/gpu so placement actually competes for GPUs.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: smoke-gang
  namespace: smoke
spec:
  minAvailable: 4               # all 4 pods, or none, start
  schedulerName: volcano
  queue: default
  plugins:
    svc: []
    env: []
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: smi
              image: nvidia/cuda:13.0.0-base-ubuntu24.04
              command: ["nvidia-smi"]
              resources:
                limits:
                  nvidia.com/gpu: 1

Configuration¶

The fields that change per cluster. Get these wrong and the test fails for the wrong reason:

Field	Where	Meaning	Set to
`nvidia.com/gpu`	pod `resources.limits`	whole-GPU (or MIG `single` slice) request²	`1` per device
`nvidia.com/mig-<g>.<mem>gb`	pod `resources.limits`	MIG `mixed`-strategy profile resource²	one profile per container
`nvidia.com/mig.config`	node label	MIG geometry the MIG manager applies²	e.g. `all-1g.10gb`
`mig.strategy`	GPU Operator Helm value	how MIG devices are advertised²	`single` or `mixed`
`rdma/ib` · `rdma/roce`	worker `resources.limits`	RDMA device from the `NicClusterPolicy`³	`1`; key matches the policy
`slotsPerWorker`	`MPIJob.spec`⁴	MPI slots per worker = GPUs/node	GPUs per worker
`NCCL_IB_HCA`	launcher env	which HCAs NCCL uses⁶	your HCA prefix, e.g. `ibp` / `mlx5`
`NCCL_DEBUG`	launcher env	NCCL log verbosity⁶	`INFO` (to read the transport)
`deviceClassName`	`ResourceClaim` request⁸	DRA device class to match	`gpu.nvidia.com`
`selectors[].cel.expression`	`ResourceClaim` request⁸	CEL device filter (e.g. memory)	per workload
`minAvailable`	Volcano `Job.spec`¹⁰	gang size: pods that must co-schedule	= total worker replicas
`schedulerName`	Volcano `Job.spec`¹⁰	scheduler that places the job	`volcano`

Apply & verify¶

Run in order; do not proceed past a failure.

1. CUDA pod

kubectl apply -f smoke-cuda.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/smoke-cuda -n smoke --timeout=120s
kubectl logs smoke-cuda -n smoke

Pass: the nvidia-smi table prints with the expected GPU model and driver. Fail to schedule (Pending, Insufficient nvidia.com/gpu) means the device plugin did not advertise the resource; fix the GPU Operator before anything else.¹

2. nccl-tests over RDMA

kubectl apply -f smoke-nccl.yaml
kubectl get mpijob smoke-nccl -n smoke
kubectl logs -f job/smoke-nccl-launcher -n smoke

Pass: two signals. First, the log shows GPUDirect RDMA engaged, with lines of the form NET/IB/<n>/GDRDMA and GPU Direct RDMA Enabled for GPU <id> / HCA <id>. Second, all_reduce_perf completes with a busbw figure and # Out of bounds values : 0 Avg bus bandwidth.⁶ A NET/Socket line (or no GDRDMA) means NCCL fell back to TCP, so the RDMA device or NCCL_IB_HCA is wrong; the collective will be an order of magnitude slow (fabric bring-up, diagnostics tools).

3. MIG pod

kubectl apply -f smoke-mig.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/smoke-mig -n smoke --timeout=120s
kubectl logs smoke-mig -n smoke

Pass: nvidia-smi -L lists exactly one MIG device, e.g. MIG 1g.10gb Device 0: (UUID: MIG-...), proving the pod got a partition, not the whole GPU. Pending on a node with free whole GPUs but no MIG geometry means the nvidia.com/mig.config label has not been applied or has not reached state: success.²

4. DRA claim

kubectl apply -f smoke-dra.yaml
kubectl get resourceclaims -n smoke          # generated claim should show allocated
kubectl describe resourceclaim -n smoke | sed -n '/Status/,/Events/p'
kubectl wait --for=condition=Complete job/smoke-dra -n smoke --timeout=180s

Pass: the generated ResourceClaim reports a populated status.allocation (the device was bound) and the Job completes with nvidia-smi -L output.⁸ A claim stuck with empty status.allocation and the pod Pending (waiting for resource claim) means the DRA driver is absent, the DeviceClass is missing, or the CEL selector matched no device; confirm kubectl get deviceclasses and the driver pods.⁹

5. Gang Job

kubectl apply -f smoke-gang.yaml
kubectl get podgroup -n smoke                 # Volcano creates a PodGroup for the job
kubectl get pods -n smoke -l volcano.sh/job-name=smoke-gang

Pass: all four pods transition to Running/Succeeded together; the PodGroup reaches phase Running (it sits Inqueue/Pending until minAvailable can be met).¹⁰ The proof of gang behaviour: on a cluster with only three free GPUs, no pod starts (not three), and kubectl describe podgroup shows not enough resources to schedule minMember. That "all-or-nothing" is the test; partial placement is a fail.

Teardown

kubectl delete -f smoke-gang.yaml -f smoke-dra.yaml -f smoke-mig.yaml -f smoke-nccl.yaml -f smoke-cuda.yaml

Failure modes¶

Test 1 Pending, 0/N nodes available: insufficient nvidia.com/gpu. Device plugin never advertised the resource. GPU Operator driver/toolkit/plugin pods not Ready, or a host driver fighting the container driver. Fix the operator; this gates everything.¹
Test 2 shows NET/Socket, not GDRDMA. NCCL on TCP fallback: wrong rdma/ib key for the NicClusterPolicy, wrong NCCL_IB_HCA, or no RDMA device injected. Collectives run ~10x slow (fabric bring-up).⁶
Test 2 hangs at init / SHM error. Missing /dev/shm emptyDir: { medium: Memory }, or MPI launcher cannot reach workers (SSH/NCCL_SOCKET_IFNAME). NCCL bootstrap needs the shared-memory mount and a reachable bootstrap interface.⁵
Test 3 Pending with whole GPUs free. MIG geometry not applied: nvidia.com/mig.config unset or not state: success, or mixed strategy requested while the pod asks for plain nvidia.com/gpu. Match the request to the strategy.²
Test 4 claim never allocates. DRA driver not installed, cluster below 1.34, DeviceClass missing, or the CEL selector matched nothing (e.g. 40Gi on 24 GB cards). Confirm kubectl get deviceclasses and loosen the selector.⁷⁹
Test 5 pods partially Running instead of all-or-nothing. Job landed on default-scheduler, not Volcano. Either schedulerName is unset or Volcano is not installed. The default scheduler has no gang semantics and will partial-place, idling GPUs.¹⁰
Test 5 stuck Inqueue forever. minAvailable exceeds schedulable GPUs, or the queue has no capacity. Lower minAvailable to a placeable size or check the Volcano Queue.¹⁰

References¶

GPU Operator — getting started / verify GPU resources: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ¹
GPU Operator — MIG support (single vs mixed strategy, nvidia.com/mig.config): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ²
Network Operator — RDMA/NicClusterPolicy, RDMA shared device plugin: https://docs.nvidia.com/networking/display/cokan10/network+operator ³
Kubeflow MPI Operator — MPIJob (kubeflow.org/v2beta1, slotsPerWorker, mpiReplicaSpecs): https://github.com/kubeflow/mpi-operator ⁴
coreweave/nccl-tests — MPIJob example (launcher mpirun/all_reduce_perf, worker rdma/ib, /dev/shm): https://github.com/coreweave/nccl-tests ⁵
NCCL environment variables (NCCL_DEBUG, NCCL_IB_HCA) and GPUDirect-RDMA log signal (NET/IB/.../GDRDMA): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html and https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ⁶
Kubernetes DRA — concept, GA in 1.34, resource.k8s.io/v1: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ⁷
Kubernetes — Allocate Devices to Workloads with DRA (ResourceClaimTemplate, exactly/selectors, pod resourceClaims, status.allocation): https://kubernetes.io/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/ ⁸
NVIDIA DRA driver for GPUs (gpu.nvidia.com device class): https://github.com/NVIDIA/k8s-dra-driver-gpu ⁹
Volcano — VolcanoJob (batch.volcano.sh/v1alpha1, minAvailable, schedulerName, tasks): https://volcano.sh/en/docs/vcjob/ and install: https://volcano.sh/en/docs/ ¹⁰¹¹

GPU Operator getting-started: after install, nodes advertise nvidia.com/gpu and a sample CUDA pod (nvidia-smi) validates the stack; missing capacity means the device plugin/driver is not Ready. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩↩↩↩
GPU Operator MIG: single strategy advertises MIG slices as nvidia.com/gpu (pod spec unchanged); mixed advertises per-profile resources nvidia.com/mig-<g>.<mem>gb. Geometry set via the nvidia.com/mig.config node label (e.g. all-1g.10gb); strategy set via --set mig.strategy=single|mixed. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩↩↩↩↩↩↩↩
Network Operator deploys host RDMA, the RDMA shared device plugin, and secondary networks via NicClusterPolicy; pods then request an RDMA resource (e.g. rdma/ib) and attach the secondary network. https://docs.nvidia.com/networking/display/cokan10/network+operator ↩↩↩
Kubeflow MPI Operator provides the MPIJob CRD at apiVersion: kubeflow.org/v2beta1 with slotsPerWorker, runPolicy.cleanPodPolicy, and mpiReplicaSpecs (Launcher/Worker). https://github.com/kubeflow/mpi-operator ↩↩↩
coreweave/nccl-tests publishes ghcr.io/coreweave/nccl-tests images and mpi-operator/ MPIJob examples; the H100 example uses slotsPerWorker: 8, launcher mpirun -np 64 -bind-to none -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA=ibp ... all_reduce_perf -b 512M -e 8G -f 2 -g 1, worker limits nvidia.com/gpu: 8 + rdma/ib: 1, and a /dev/shm emptyDir{medium: Memory}. https://github.com/coreweave/nccl-tests ↩↩↩
NCCL env vars: NCCL_DEBUG=INFO logs the selected transport; NCCL_IB_HCA filters HCAs. With GPUDirect RDMA active NCCL logs GPU Direct RDMA Enabled for GPU <id> / HCA <id> and connections NET/IB/<n>/GDRDMA; a NET/Socket line means TCP fallback. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html and https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩↩↩↩↩
Dynamic Resource Allocation is GA in Kubernetes v1.34; all DRA kinds are in the resource.k8s.io/v1 API group. https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ↩↩↩↩↩
K8s "Allocate Devices to Workloads with DRA": ResourceClaimTemplate/ResourceClaim use spec.devices.requests[].exactly with deviceClassName and CEL selectors; a pod declares spec.resourceClaims[] (resourceClaimTemplateName/resourceClaimName) and containers[].resources.claims[]; kubectl describe resourceclaim shows status.allocation when bound. https://kubernetes.io/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/ ↩↩↩↩↩
NVIDIA DRA driver for GPUs registers the gpu.nvidia.com device class and the driver pods needed for claims to allocate; requires K8s 1.34+ and CDI. https://github.com/NVIDIA/k8s-dra-driver-gpu ↩↩↩↩
Volcano Job (batch.volcano.sh/v1alpha1) supports minAvailable (gang size), schedulerName: volcano, queue, plugins, and tasks[]; Volcano creates a PodGroup and starts pods only when minAvailable can be co-scheduled. https://volcano.sh/en/docs/vcjob/ ↩↩↩↩↩↩↩
Volcano install (Helm/YAML) and scheduler/controller/admission components. https://volcano.sh/en/docs/ ↩↩