Smoke tests: GPU platform¶
Scope: a consolidated acceptance suite for a freshly built GPU platform. It runs a CUDA pod running nvidia-smi, an nccl-tests Job over RDMA (GDRDMA confirmed via NCCL_DEBUG=INFO), a MIG-instance pod, a DRA claim binding, and a gang Job that places all pods or none, each with the one signal that means pass.
Reference templates from the GPU Operator / Network Operator docs, Kubernetes DRA docs, Volcano, the Kubeflow MPI Operator, and the
NVIDIA/nccl-tests/coreweave/nccl-testsrepos. Nothing here was executed on hardware. Pin every chart, image, and CRD version; substitute real node names, resource keys (rdma/ibvsrdma/roce), and HCA filters before applying. Run on one node/pair before a fleet roll. This is the acceptance gate for the build in Kubernetes & Helm: GPU platform; the host-level fabric proof it complements is fabric bring-up and benchmarking.
flowchart LR
T1["1 CUDA pod: nvidia-smi"] --> T2["2 nccl-tests over RDMA"]
T2 --> T3["3 MIG-instance pod"]
T3 --> T4["4 DRA claim binds"]
T4 --> T5["5 Gang Job places all pods"]
T5 --> OK{"all 5 pass?"}
OK -->|"yes"| ACCEPT["Platform accepted"]
OK -->|"no"| TRIAGE["Triage the failed layer"]
What it is¶
Five tests, one per platform layer, run bottom-up. Each isolates the layer beneath it, so a failure points at one component rather than the whole stack:
- CUDA pod. The GPU Operator advertised
nvidia.com/gpuand the runtime injects a working GPU. Proves device plugin + container toolkit + driver. nccl-testsJob over RDMA. The Network Operator wired an RDMA device into pods and NCCL takes the GPUDirect RDMA path, not TCP. Proves the fabric is usable from inside a pod.- MIG-instance pod. A partitioned GPU is schedulable as the resource its strategy advertises. Proves the sharing model.
- DRA claim. A
ResourceClaimbinds to a real device and the pod runs. Proves the Dynamic Resource Allocation path (Kubernetes 1.34+).7 - Gang Job. A multi-pod job starts only when every pod can be placed. Proves the scheduler does all-or-nothing placement, so a distributed job cannot half-start and deadlock.
This is GPU-platform acceptance inside Kubernetes. The host-level proof (links up, subnet manager converged, perftest/nccl-tests at line rate on bare metal) lives in fabric bring-up and benchmarking; the GPU-health adjudication (dcgmi diag, nvbandwidth, gpu-burn) lives in diagnostics tools. Run those first: a smoke test should fail on the platform plumbing, not on silicon that was never healthy.
Prerequisites¶
- The platform is built and each layer already passed its own install check (Kubernetes & Helm: GPU platform): GPU Operator, Network Operator +
NicClusterPolicy, a sharing model (MIG or time-slicing), the DRA driver, and Volcano. kubectlagainst the target cluster; a namespace to run in (smokebelow). Create it once:kubectl create namespace smoke.- Nodes report GPU capacity:
kubectl get nodes -o custom-columns=NODE:.metadata.name,GPU:.status.allocatable.'nvidia\.com/gpu'shows a non-zero count.1 - For test 2: the Network Operator is installed and a
NicClusterPolicyexposes an RDMA resource (hererdma/ib; userdma/roceon RoCE) and a secondary network for the pods.3 The MPI Operator (kubeflow.org/v2beta1) is installed for theMPIJobkind.4 - For test 3: MIG is enabled on at least one node via the
nvidia.com/mig.configlabel and the GPU Operatormig.strategyis set; this page assumessingle.2 - For test 4: Kubernetes 1.34+ with the
DynamicResourceAllocationfeature, the NVIDIA DRA driver installed, and aDeviceClass(gpu.nvidia.com) present (kubectl get deviceclasses).79 - For test 5: Volcano is installed and
kubectl get pods -n volcano-systemshows the scheduler, controller, and admission pods Running.11
The manifests¶
1. CUDA pod: nvidia-smi¶
apiVersion: v1
kind: Pod
metadata:
name: smoke-cuda
namespace: smoke
spec:
restartPolicy: Never
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04 # pin to your CUDA base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
2. nccl-tests Job over RDMA (GDRDMA via NCCL_DEBUG=INFO)¶
Two-worker all_reduce_perf under the MPI Operator. The launcher sets NCCL_DEBUG=INFO so the log states which transport NCCL chose; the workers request nvidia.com/gpu and the RDMA resource. Resource keys, NCCL_IB_HCA, slotsPerWorker, and the test size mirror the coreweave/nccl-tests H100 example; adjust to your GPUs-per-node and HCA naming.5
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: smoke-nccl
namespace: smoke
spec:
slotsPerWorker: 8 # = GPUs per worker node
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: launcher
image: ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-61f74e3
command: ["/bin/bash", "-c"]
args:
- >
mpirun --allow-run-as-root -np 16 -bind-to none
-x LD_LIBRARY_PATH
-x NCCL_DEBUG=INFO
-x NCCL_SOCKET_IFNAME=eth0
-x NCCL_IB_HCA=ibp
/opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
Worker:
replicas: 2
template:
spec:
containers:
- name: worker
image: ghcr.io/coreweave/nccl-tests:13.2.1-devel-ubuntu24.04-nccl2.30.4-1-61f74e3
resources:
limits:
nvidia.com/gpu: 8
rdma/ib: 1 # RoCE clusters: rdma/roce
memory: 960Gi
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory }
3. MIG-instance pod¶
Under MIG single strategy the device plugin advertises MIG slices as the ordinary nvidia.com/gpu, so the pod spec is unchanged; only the meaning of one "GPU" shrinks to one slice.2 Pin the pod to a MIG node by the GFD-published product label.
apiVersion: v1
kind: Pod
metadata:
name: smoke-mig
namespace: smoke
spec:
restartPolicy: Never
nodeSelector:
nvidia.com/mig.config.state: success # MIG manager finished applying the geometry
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["nvidia-smi", "-L"] # lists the MIG device UUID
resources:
limits:
nvidia.com/gpu: 1
Under MIG mixed strategy, request the profile resource instead, e.g. limits: { nvidia.com/mig-1g.10gb: 1 }.2
4. DRA claim binding¶
A ResourceClaimTemplate requests one GPU with at least 40 GiB; a Job references it. apiVersion: resource.k8s.io/v1 is the GA group in Kubernetes 1.34; exactly, deviceClassName, and the CEL selectors are the documented claim fields.78
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: smoke-gpu-40g
namespace: smoke
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: 'device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0'
---
apiVersion: batch/v1
kind: Job
metadata:
name: smoke-dra
namespace: smoke
spec:
template:
spec:
restartPolicy: Never
resourceClaims:
- name: gpu
resourceClaimTemplateName: smoke-gpu-40g
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["nvidia-smi", "-L"]
resources:
claims:
- name: gpu
5. Gang Job: place all pods or none¶
A Volcano Job with minAvailable equal to the worker count starts only when every pod can be scheduled together; schedulerName: volcano routes it to the gang scheduler.10 Set nvidia.com/gpu so placement actually competes for GPUs.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: smoke-gang
namespace: smoke
spec:
minAvailable: 4 # all 4 pods, or none, start
schedulerName: volcano
queue: default
plugins:
svc: []
env: []
tasks:
- replicas: 4
name: worker
template:
spec:
restartPolicy: Never
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
Configuration¶
The fields that change per cluster. Get these wrong and the test fails for the wrong reason:
| Field | Where | Meaning | Set to |
|---|---|---|---|
nvidia.com/gpu |
pod resources.limits |
whole-GPU (or MIG single slice) request2 |
1 per device |
nvidia.com/mig-<g>.<mem>gb |
pod resources.limits |
MIG mixed-strategy profile resource2 |
one profile per container |
nvidia.com/mig.config |
node label | MIG geometry the MIG manager applies2 | e.g. all-1g.10gb |
mig.strategy |
GPU Operator Helm value | how MIG devices are advertised2 | single or mixed |
rdma/ib · rdma/roce |
worker resources.limits |
RDMA device from the NicClusterPolicy3 |
1; key matches the policy |
slotsPerWorker |
MPIJob.spec4 |
MPI slots per worker = GPUs/node | GPUs per worker |
NCCL_IB_HCA |
launcher env | which HCAs NCCL uses6 | your HCA prefix, e.g. ibp / mlx5 |
NCCL_DEBUG |
launcher env | NCCL log verbosity6 | INFO (to read the transport) |
deviceClassName |
ResourceClaim request8 |
DRA device class to match | gpu.nvidia.com |
selectors[].cel.expression |
ResourceClaim request8 |
CEL device filter (e.g. memory) | per workload |
minAvailable |
Volcano Job.spec10 |
gang size: pods that must co-schedule | = total worker replicas |
schedulerName |
Volcano Job.spec10 |
scheduler that places the job | volcano |
Apply & verify¶
Run in order; do not proceed past a failure.
1. CUDA pod
kubectl apply -f smoke-cuda.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/smoke-cuda -n smoke --timeout=120s
kubectl logs smoke-cuda -n smoke
Pass: the nvidia-smi table prints with the expected GPU model and driver. Fail to schedule (Pending, Insufficient nvidia.com/gpu) means the device plugin did not advertise the resource; fix the GPU Operator before anything else.1
2. nccl-tests over RDMA
kubectl apply -f smoke-nccl.yaml
kubectl get mpijob smoke-nccl -n smoke
kubectl logs -f job/smoke-nccl-launcher -n smoke
Pass: two signals. First, the log shows GPUDirect RDMA engaged, with lines of the form NET/IB/<n>/GDRDMA and GPU Direct RDMA Enabled for GPU <id> / HCA <id>. Second, all_reduce_perf completes with a busbw figure and # Out of bounds values : 0 Avg bus bandwidth.6 A NET/Socket line (or no GDRDMA) means NCCL fell back to TCP, so the RDMA device or NCCL_IB_HCA is wrong; the collective will be an order of magnitude slow (fabric bring-up, diagnostics tools).
3. MIG pod
kubectl apply -f smoke-mig.yaml
kubectl wait --for=jsonpath='{.status.phase}'=Succeeded pod/smoke-mig -n smoke --timeout=120s
kubectl logs smoke-mig -n smoke
Pass: nvidia-smi -L lists exactly one MIG device, e.g. MIG 1g.10gb Device 0: (UUID: MIG-...), proving the pod got a partition, not the whole GPU. Pending on a node with free whole GPUs but no MIG geometry means the nvidia.com/mig.config label has not been applied or has not reached state: success.2
4. DRA claim
kubectl apply -f smoke-dra.yaml
kubectl get resourceclaims -n smoke # generated claim should show allocated
kubectl describe resourceclaim -n smoke | sed -n '/Status/,/Events/p'
kubectl wait --for=condition=Complete job/smoke-dra -n smoke --timeout=180s
Pass: the generated ResourceClaim reports a populated status.allocation (the device was bound) and the Job completes with nvidia-smi -L output.8 A claim stuck with empty status.allocation and the pod Pending (waiting for resource claim) means the DRA driver is absent, the DeviceClass is missing, or the CEL selector matched no device; confirm kubectl get deviceclasses and the driver pods.9
5. Gang Job
kubectl apply -f smoke-gang.yaml
kubectl get podgroup -n smoke # Volcano creates a PodGroup for the job
kubectl get pods -n smoke -l volcano.sh/job-name=smoke-gang
Pass: all four pods transition to Running/Succeeded together; the PodGroup reaches phase Running (it sits Inqueue/Pending until minAvailable can be met).10 The proof of gang behaviour: on a cluster with only three free GPUs, no pod starts (not three), and kubectl describe podgroup shows not enough resources to schedule minMember. That "all-or-nothing" is the test; partial placement is a fail.
Teardown
kubectl delete -f smoke-gang.yaml -f smoke-dra.yaml -f smoke-mig.yaml -f smoke-nccl.yaml -f smoke-cuda.yaml
Failure modes¶
- Test 1
Pending,0/N nodes available: insufficient nvidia.com/gpu. Device plugin never advertised the resource. GPU Operator driver/toolkit/plugin pods not Ready, or a host driver fighting the container driver. Fix the operator; this gates everything.1 - Test 2 shows
NET/Socket, notGDRDMA. NCCL on TCP fallback: wrongrdma/ibkey for theNicClusterPolicy, wrongNCCL_IB_HCA, or no RDMA device injected. Collectives run ~10x slow (fabric bring-up).6 - Test 2 hangs at init / SHM error. Missing
/dev/shmemptyDir: { medium: Memory }, or MPI launcher cannot reach workers (SSH/NCCL_SOCKET_IFNAME). NCCL bootstrap needs the shared-memory mount and a reachable bootstrap interface.5 - Test 3
Pendingwith whole GPUs free. MIG geometry not applied:nvidia.com/mig.configunset or notstate: success, ormixedstrategy requested while the pod asks for plainnvidia.com/gpu. Match the request to the strategy.2 - Test 4 claim never allocates. DRA driver not installed, cluster below 1.34,
DeviceClassmissing, or the CEL selector matched nothing (e.g.40Gion 24 GB cards). Confirmkubectl get deviceclassesand loosen the selector.79 - Test 5 pods partially
Runninginstead of all-or-nothing. Job landed ondefault-scheduler, not Volcano. EitherschedulerNameis unset or Volcano is not installed. The default scheduler has no gang semantics and will partial-place, idling GPUs.10 - Test 5 stuck
Inqueueforever.minAvailableexceeds schedulable GPUs, or thequeuehas no capacity. LowerminAvailableto a placeable size or check the VolcanoQueue.10
References¶
- GPU Operator — getting started / verify GPU resources: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html 1
- GPU Operator — MIG support (single vs mixed strategy,
nvidia.com/mig.config): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html 2 - Network Operator — RDMA/
NicClusterPolicy, RDMA shared device plugin: https://docs.nvidia.com/networking/display/cokan10/network+operator 3 - Kubeflow MPI Operator —
MPIJob(kubeflow.org/v2beta1,slotsPerWorker,mpiReplicaSpecs): https://github.com/kubeflow/mpi-operator 4 - coreweave/nccl-tests — MPIJob example (launcher
mpirun/all_reduce_perf, workerrdma/ib,/dev/shm): https://github.com/coreweave/nccl-tests 5 - NCCL environment variables (
NCCL_DEBUG,NCCL_IB_HCA) and GPUDirect-RDMA log signal (NET/IB/.../GDRDMA): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html and https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html 6 - Kubernetes DRA — concept, GA in 1.34,
resource.k8s.io/v1: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ 7 - Kubernetes — Allocate Devices to Workloads with DRA (
ResourceClaimTemplate,exactly/selectors, podresourceClaims,status.allocation): https://kubernetes.io/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/ 8 - NVIDIA DRA driver for GPUs (
gpu.nvidia.comdevice class): https://github.com/NVIDIA/k8s-dra-driver-gpu 9 - Volcano — VolcanoJob (
batch.volcano.sh/v1alpha1,minAvailable,schedulerName,tasks): https://volcano.sh/en/docs/vcjob/ and install: https://volcano.sh/en/docs/ 1011
Related: Kubernetes & Helm: GPU platform · Fabric bring-up & benchmarking · Diagnostics tools · Kubernetes for GPUs · MIG partitioning · Glossary
-
GPU Operator getting-started: after install, nodes advertise
nvidia.com/gpuand a sample CUDA pod (nvidia-smi) validates the stack; missing capacity means the device plugin/driver is not Ready. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩↩↩↩ -
GPU Operator MIG:
singlestrategy advertises MIG slices asnvidia.com/gpu(pod spec unchanged);mixedadvertises per-profile resourcesnvidia.com/mig-<g>.<mem>gb. Geometry set via thenvidia.com/mig.confignode label (e.g.all-1g.10gb); strategy set via--set mig.strategy=single|mixed. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html ↩↩↩↩↩↩↩↩↩↩ -
Network Operator deploys host RDMA, the RDMA shared device plugin, and secondary networks via
NicClusterPolicy; pods then request an RDMA resource (e.g.rdma/ib) and attach the secondary network. https://docs.nvidia.com/networking/display/cokan10/network+operator ↩↩↩ -
Kubeflow MPI Operator provides the
MPIJobCRD atapiVersion: kubeflow.org/v2beta1withslotsPerWorker,runPolicy.cleanPodPolicy, andmpiReplicaSpecs(Launcher/Worker). https://github.com/kubeflow/mpi-operator ↩↩↩ -
coreweave/nccl-tests publishes
ghcr.io/coreweave/nccl-testsimages andmpi-operator/MPIJob examples; the H100 example usesslotsPerWorker: 8, launchermpirun -np 64 -bind-to none -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_HCA=ibp ... all_reduce_perf -b 512M -e 8G -f 2 -g 1, worker limitsnvidia.com/gpu: 8+rdma/ib: 1, and a/dev/shmemptyDir{medium: Memory}. https://github.com/coreweave/nccl-tests ↩↩↩ -
NCCL env vars:
NCCL_DEBUG=INFOlogs the selected transport;NCCL_IB_HCAfilters HCAs. With GPUDirect RDMA active NCCL logsGPU Direct RDMA Enabled for GPU <id> / HCA <id>and connectionsNET/IB/<n>/GDRDMA; aNET/Socketline means TCP fallback. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html and https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html ↩↩↩↩↩ -
Dynamic Resource Allocation is GA in Kubernetes v1.34; all DRA kinds are in the
resource.k8s.io/v1API group. https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ ↩↩↩↩↩ -
K8s "Allocate Devices to Workloads with DRA":
ResourceClaimTemplate/ResourceClaimusespec.devices.requests[].exactlywithdeviceClassNameand CELselectors; a pod declaresspec.resourceClaims[](resourceClaimTemplateName/resourceClaimName) andcontainers[].resources.claims[];kubectl describe resourceclaimshowsstatus.allocationwhen bound. https://kubernetes.io/docs/tasks/configure-pod-container/assign-resources/allocate-devices-dra/ ↩↩↩↩↩ -
NVIDIA DRA driver for GPUs registers the
gpu.nvidia.comdevice class and the driver pods needed for claims to allocate; requires K8s 1.34+ and CDI. https://github.com/NVIDIA/k8s-dra-driver-gpu ↩↩↩↩ -
Volcano
Job(batch.volcano.sh/v1alpha1) supportsminAvailable(gang size),schedulerName: volcano,queue,plugins, andtasks[]; Volcano creates aPodGroupand starts pods only whenminAvailablecan be co-scheduled. https://volcano.sh/en/docs/vcjob/ ↩↩↩↩↩↩↩ -
Volcano install (Helm/YAML) and scheduler/controller/admission components. https://volcano.sh/en/docs/ ↩↩