Helm: NVIDIA network operator¶
Scope: helm install network-operator with the RDMA shared device plugin and the secondary-network path, so pods get an IB/RoCE device and GPUDirect RDMA engages instead of NCCL falling back to TCP. Pairs with manifest: NicClusterPolicy; the runnable slice of the GPU platform hub.
Reference templates from the upstream chart and CRDs (Network Operator v25.7.0). Versions are pinned for reproducibility, not hardware-tested here. Apply via GitOps (SRE/MLOps) rather than
helm installby hand in production. NIC interface names (ens1f0np0,mlx5_0) and IPAM ranges are placeholders; substitute your fabric's values (RDMA fabric).
flowchart TB
HELM["helm install network-operator"] --> NCP["NicClusterPolicy (mellanox.com/v1alpha1)"]
NCP --> OFED["DOCA/OFED driver DaemonSet"]
NCP --> RDP["rdmaSharedDevicePlugin"]
NCP --> SEC["secondaryNetwork: multus + whereabouts"]
RDP --> RES["node advertises rdma/rdma_shared_device_a"]
SEC --> NAD["MacvlanNetwork -> NetworkAttachmentDefinition"]
RES --> POD["Pod: rdma resource + network annotation"]
NAD --> POD
POD --> GDR["GPUDirect RDMA via nvidia-peermem"]
What it is¶
The Network Operator manages everything between a bare NIC and an RDMA-capable pod: the DOCA/OFED kernel driver, the device plugin that advertises RDMA HCAs as schedulable resources, Multus for a second pod interface, and an IPAM plugin to address it. One Helm release installs the controller; one NicClusterPolicy CR declares which sub-components to deploy. The operator reconciles the CR into DaemonSets and a NetworkAttachmentDefinition, then reports a single state: ready.
Two device-plugin modes exist. rdmaSharedDevicePlugin advertises one HCA to many pods (rdmaHcaMax connections each), the simplest path to GPUDirect RDMA, covered here. sriovDevicePlugin carves SR-IOV VFs for hard isolation; see security/multi-tenancy. GPUDirect RDMA itself is the nvidia-peermem kernel module giving the HCA peer-to-peer DMA into GPU memory; it ships with the NVIDIA driver and is enabled by the GPU Operator, not this chart (see Prerequisites).
Prerequisites¶
- A Kubernetes cluster with Mellanox/NVIDIA ConnectX-5+ or BlueField HCAs on the GPU nodes. Confirm with
lspci | grep Mellanoxon a node (RDMA fabric bring-up). - GPU Operator already installed (helm: GPU Operator) with GPUDirect RDMA on (
--set driver.rdma.enabled=true, per the GPU Operator RDMA guide). This builds and loads thenvidia-peermemmodule. Since the Network Operator delivers the DOCA driver as a DaemonSet here (not baked into the host OS), leavedriver.rdma.useHostMofedat its defaultfalse; set ittrueonly when MOFED is installed directly on the host OS. Never run a host MOFED and the operator OFED driver at once. - Node Feature Discovery: the chart bundles it (
nfd.enabled=true); disable only if NFD is already cluster-wide. - A free, cabled NIC port per node for the secondary network, not the one carrying Kubernetes pod/node traffic.
helm3.x and cluster-admin to install CRDs (RBAC for GPU operators).
Install¶
Add the repo and install the controller. NFD ships in-chart; SR-IOV operator stays off for the shared-device-plugin path.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install network-operator nvidia/network-operator \
-n nvidia-network-operator --create-namespace \
--version v25.7.0 \
--set nfd.enabled=true \
--set sriovNetworkOperator.enabled=false
This installs the controller and CRDs (NicClusterPolicy, MacvlanNetwork, HostDeviceNetwork, IPoIBNetwork) but deploys no networking components until a NicClusterPolicy exists. Per the v25.7.0 deployment guide, the chart deploys only the operator and CRDs; creating the NicClusterPolicy is a separate step. Declare the sub-components in a CR and apply it (GitOps-friendly, since the operator reconciles the rest):
# nic-cluster-policy.yaml (reference template; pin to your tested release)
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy # operator expects this exact name
spec:
# DOCA/OFED kernel driver DaemonSet
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: doca3.1.0-25.07-0.9.7.0-0
env:
- name: RESTORE_DRIVER_ON_POD_TERMINATION
value: "true"
- name: UNLOAD_STORAGE_MODULES
value: "true"
# advertise each HCA to many pods as rdma/<resourceName>
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_a",
"rdmaHcaMax": 63,
"selectors": { "ifNames": ["ens1f0np0"] }
}
]
}
# second pod interface + IPAM (Multus + whereabouts)
secondaryNetwork:
cniPlugins:
image: plugins
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
multus:
image: multus-cni
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
ipamPlugin:
image: whereabouts
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
The chart deploys only the operator and CRDs; the NicClusterPolicy above is what actually deploys the networking sub-components, applied as a separate step (matching the v25.7.0 deployment guide). The full CR, NV-IPAM IPPool, and IPoIBNetwork variant live in manifest: NicClusterPolicy.
Then attach the secondary network so pods can reference it. The operator translates a MacvlanNetwork into a NetworkAttachmentDefinition (RoCE/Ethernet shown; for InfiniBand use kind: IPoIBNetwork):
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net
spec:
networkNamespace: "default"
master: "ens1f0np0" # same NIC the device plugin selects
mode: "bridge"
mtu: 9000
ipam: |
{
"type": "whereabouts",
"range": "192.168.2.0/24"
}
Configuration¶
Key values (Helm) and CRD fields. Defaults from chart v25.7.0; image versions are the operator's tested set; change them together, not piecemeal.
| Key / field | Where | Purpose | Reference value |
|---|---|---|---|
nfd.enabled |
values | Bundle Node Feature Discovery to label HCA nodes | true |
sriovNetworkOperator.enabled |
values | SR-IOV path (VF isolation); off for shared-device plugin | false |
spec.ofedDriver |
NicClusterPolicy | Deploy DOCA/OFED driver DaemonSet (block present = deployed) | image/repository/version |
spec.ofedDriver.version |
NicClusterPolicy | DOCA driver image tag | doca3.1.0-25.07-0.9.7.0-0 |
spec.ofedDriver.env[].RESTORE_DRIVER_ON_POD_TERMINATION |
NicClusterPolicy | Unload driver cleanly on pod stop | "true" |
spec.rdmaSharedDevicePlugin |
NicClusterPolicy | Advertise HCAs as rdma/* resources (block present = deployed) |
image/repository/version |
spec.rdmaSharedDevicePlugin.config.resourceName |
NicClusterPolicy (JSON) | Name -> resource rdma/<name> |
rdma_shared_device_a |
spec.rdmaSharedDevicePlugin.config.rdmaHcaMax |
NicClusterPolicy (JSON) | Max concurrent pods per HCA | 63 |
spec.rdmaSharedDevicePlugin.config.selectors.ifNames |
NicClusterPolicy (JSON) | NIC netdevs to expose | ["ens1f0np0"] |
spec.secondaryNetwork |
NicClusterPolicy | Install Multus + CNI + IPAM (block present = deployed) | multus/cniPlugins/ipamPlugin |
spec.secondaryNetwork.ipamPlugin.image |
NicClusterPolicy | IP address management CNI | whereabouts |
apiVersion / kind |
CRD | NicClusterPolicy, MacvlanNetwork, IPoIBNetwork, HostDeviceNetwork | mellanox.com/v1alpha1 |
MacvlanNetwork.spec.master |
CRD | Host netdev backing the secondary net | ens1f0np0 |
MacvlanNetwork.spec.mtu |
CRD | Jumbo frames for RoCE | 9000 |
MacvlanNetwork.spec.ipam |
CRD | IPAM JSON (type + range) | whereabouts |
GPUDirect RDMA enablement is a GPU Operator concern: driver.rdma.enabled=true builds nvidia-peermem. Leave driver.rdma.useHostMofed default (false) when the Network Operator supplies the DOCA driver as a DaemonSet; set it true only for MOFED installed on the host OS. There is no gpuDirectRDMA field on NicClusterPolicy.
Apply & verify¶
# 1. Controller up
kubectl get pods -n nvidia-network-operator
# nvidia-network-operator-controller-manager-... 2/2 Running
# 2. NicClusterPolicy reconciled to ready (logical AND of all sub-states)
kubectl get nicclusterpolicy nic-cluster-policy -n nvidia-network-operator \
-o json | jq -r '.status.state, (.status.appliedStates[]? | "\(.name)=\(.state)")'
# ready
# state-OFED ready · state-RDMA-device-plugin ready · state-multus-cni ready
# 3. Node advertises the RDMA resource
kubectl get nodes -o json \
| jq -r '.items[].status.allocatable | keys[] | select(startswith("rdma/"))'
# rdma/rdma_shared_device_a
# 4. nvidia-peermem loaded on a GPU node (GPU Operator path)
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=-1 \
| grep -m1 "successfully loaded nvidia-peermem module"
Expected signal: controller 2/2 Running, NicClusterPolicy state: ready, every GPU node lists rdma/rdma_shared_device_a as allocatable, and the driver pod confirms nvidia-peermem loaded.
Prove the data path with two pods that claim the RDMA resource and the secondary network, then run ib_write_bw with CUDA (only after the CR is ready):
apiVersion: v1
kind: Pod
metadata:
name: rdma-server
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
restartPolicy: Never
containers:
- name: perftest
image: "<cuda-perftest-image>@sha256:<digest>" # replace with a tested, pinned perftest image
securityContext:
capabilities: { add: ["IPC_LOCK"] } # RDMA needs locked memory
command: ["sh", "-c"]
args: ["ib_write_bw --use_cuda=0 --use_cuda_dmabuf -d mlx5_0 -a -F --report_gbits -q 1"]
resources:
limits:
rdma/rdma_shared_device_a: 1
nvidia.com/gpu: 1
The client pod runs the same image against the server's secondary-network IP:
ib_write_bw -n 5000 --use_cuda=0 --use_cuda_dmabuf -d mlx5_0 -a -F --report_gbits -q 1 <server-secondary-ip>
Expected signal: --report_gbits BW approaching line rate (e.g. ~390+ Gb/s on a 400G NDR port). For NCCL-level proof at multi-node scale, run nccl-tests and confirm NCCL_DEBUG=INFO prints [GDRDMA] (fabric benchmarking).
Failure modes¶
- No
rdma/*on the node.selectors.ifNamesdoes not match a real netdev (checkip linkon the node), orofedDriveris notreadyyet, since the device plugin waits on the driver. Re-checknicclusterpolicysub-states. - NicClusterPolicy stuck not-ready. Usually
state-OFEDfailing: kernel-headers mismatch or a host MOFED already present. Inspect the OFED DaemonSet pod logs; reconcile host vs operator driver (RDMA fabric). - Pod has the resource but NCCL still uses TCP. Secondary network not attached: missing
k8s.v1.cni.cncf.io/networksannotation or theMacvlanNetwork/NAD absent.kubectl execand check for a second interface. ib_write_bwfails to register CUDA memory.nvidia-peermemnot loaded (GPU Operatordriver.rdma.enabledoff), or pod lacksIPC_LOCK; RDMA cannot pin GPU memory without it.- Throughput far below line rate. ACS not disabled on PCIe switches, or MTU not jumbo end-to-end (disable ACS, fabric benchmarking).
- Driver/plugin version skew. Mixing image tags across
ofedDriver/rdmaSharedDevicePlugin/secondaryNetworkis untested by NVIDIA; keep them on one release's defaults.
References¶
- Getting started with Kubernetes (Network Operator): https://docs.nvidia.com/networking/display/kubernetes2570/getting-started-with-kubernetes.html
- Deployment guide (Kubernetes), v25.7.0: https://docs.nvidia.com/networking/display/kubernetes2570/deployment-guide-kubernetes.html
- Helm chart customization options: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/customization.html
- Network Operator CRD API reference: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/crds.html
- Network Operator source (CRDs, values.yaml): https://github.com/Mellanox/network-operator
- GPUDirect RDMA & GPUDirect Storage (GPU Operator): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html
- Deploying GPUDirect RDMA with the Network Operator (NVIDIA blog): https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/
Related: GPU platform hub · NicClusterPolicy manifest · GPU Operator · RDMA fabric · Fabric benchmarking · Security · Glossary