Skip to content
Markdown

Helm: NVIDIA network operator

Scope: helm install network-operator with the RDMA shared device plugin and the secondary-network path, so pods get an IB/RoCE device and GPUDirect RDMA engages instead of NCCL falling back to TCP. Pairs with manifest: NicClusterPolicy; the runnable slice of the GPU platform hub.

Reference templates from the upstream chart and CRDs (Network Operator v25.7.0). Versions are pinned for reproducibility, not hardware-tested here. Apply via GitOps (SRE/MLOps) rather than helm install by hand in production. NIC interface names (ens1f0np0, mlx5_0) and IPAM ranges are placeholders; substitute your fabric's values (RDMA fabric).

flowchart TB
  HELM["helm install network-operator"] --> NCP["NicClusterPolicy (mellanox.com/v1alpha1)"]
  NCP --> OFED["DOCA/OFED driver DaemonSet"]
  NCP --> RDP["rdmaSharedDevicePlugin"]
  NCP --> SEC["secondaryNetwork: multus + whereabouts"]
  RDP --> RES["node advertises rdma/rdma_shared_device_a"]
  SEC --> NAD["MacvlanNetwork -> NetworkAttachmentDefinition"]
  RES --> POD["Pod: rdma resource + network annotation"]
  NAD --> POD
  POD --> GDR["GPUDirect RDMA via nvidia-peermem"]

What it is

The Network Operator manages everything between a bare NIC and an RDMA-capable pod: the DOCA/OFED kernel driver, the device plugin that advertises RDMA HCAs as schedulable resources, Multus for a second pod interface, and an IPAM plugin to address it. One Helm release installs the controller; one NicClusterPolicy CR declares which sub-components to deploy. The operator reconciles the CR into DaemonSets and a NetworkAttachmentDefinition, then reports a single state: ready.

Two device-plugin modes exist. rdmaSharedDevicePlugin advertises one HCA to many pods (rdmaHcaMax connections each), the simplest path to GPUDirect RDMA, covered here. sriovDevicePlugin carves SR-IOV VFs for hard isolation; see security/multi-tenancy. GPUDirect RDMA itself is the nvidia-peermem kernel module giving the HCA peer-to-peer DMA into GPU memory; it ships with the NVIDIA driver and is enabled by the GPU Operator, not this chart (see Prerequisites).

Prerequisites

  • A Kubernetes cluster with Mellanox/NVIDIA ConnectX-5+ or BlueField HCAs on the GPU nodes. Confirm with lspci | grep Mellanox on a node (RDMA fabric bring-up).
  • GPU Operator already installed (helm: GPU Operator) with GPUDirect RDMA on (--set driver.rdma.enabled=true, per the GPU Operator RDMA guide). This builds and loads the nvidia-peermem module. Since the Network Operator delivers the DOCA driver as a DaemonSet here (not baked into the host OS), leave driver.rdma.useHostMofed at its default false; set it true only when MOFED is installed directly on the host OS. Never run a host MOFED and the operator OFED driver at once.
  • Node Feature Discovery: the chart bundles it (nfd.enabled=true); disable only if NFD is already cluster-wide.
  • A free, cabled NIC port per node for the secondary network, not the one carrying Kubernetes pod/node traffic.
  • helm 3.x and cluster-admin to install CRDs (RBAC for GPU operators).

Install

Add the repo and install the controller. NFD ships in-chart; SR-IOV operator stays off for the shared-device-plugin path.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

helm install network-operator nvidia/network-operator \
  -n nvidia-network-operator --create-namespace \
  --version v25.7.0 \
  --set nfd.enabled=true \
  --set sriovNetworkOperator.enabled=false

This installs the controller and CRDs (NicClusterPolicy, MacvlanNetwork, HostDeviceNetwork, IPoIBNetwork) but deploys no networking components until a NicClusterPolicy exists. Per the v25.7.0 deployment guide, the chart deploys only the operator and CRDs; creating the NicClusterPolicy is a separate step. Declare the sub-components in a CR and apply it (GitOps-friendly, since the operator reconciles the rest):

# nic-cluster-policy.yaml  (reference template; pin to your tested release)
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy            # operator expects this exact name
spec:
  # DOCA/OFED kernel driver DaemonSet
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: doca3.1.0-25.07-0.9.7.0-0
    env:
      - name: RESTORE_DRIVER_ON_POD_TERMINATION
        value: "true"
      - name: UNLOAD_STORAGE_MODULES
        value: "true"

  # advertise each HCA to many pods as rdma/<resourceName>
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": { "ifNames": ["ens1f0np0"] }
          }
        ]
      }

  # second pod interface + IPAM (Multus + whereabouts)
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    multus:
      image: multus-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    ipamPlugin:
      image: whereabouts
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
kubectl apply -f nic-cluster-policy.yaml

The chart deploys only the operator and CRDs; the NicClusterPolicy above is what actually deploys the networking sub-components, applied as a separate step (matching the v25.7.0 deployment guide). The full CR, NV-IPAM IPPool, and IPoIBNetwork variant live in manifest: NicClusterPolicy.

Then attach the secondary network so pods can reference it. The operator translates a MacvlanNetwork into a NetworkAttachmentDefinition (RoCE/Ethernet shown; for InfiniBand use kind: IPoIBNetwork):

apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net
spec:
  networkNamespace: "default"
  master: "ens1f0np0"          # same NIC the device plugin selects
  mode: "bridge"
  mtu: 9000
  ipam: |
    {
      "type": "whereabouts",
      "range": "192.168.2.0/24"
    }

Configuration

Key values (Helm) and CRD fields. Defaults from chart v25.7.0; image versions are the operator's tested set; change them together, not piecemeal.

Key / field Where Purpose Reference value
nfd.enabled values Bundle Node Feature Discovery to label HCA nodes true
sriovNetworkOperator.enabled values SR-IOV path (VF isolation); off for shared-device plugin false
spec.ofedDriver NicClusterPolicy Deploy DOCA/OFED driver DaemonSet (block present = deployed) image/repository/version
spec.ofedDriver.version NicClusterPolicy DOCA driver image tag doca3.1.0-25.07-0.9.7.0-0
spec.ofedDriver.env[].RESTORE_DRIVER_ON_POD_TERMINATION NicClusterPolicy Unload driver cleanly on pod stop "true"
spec.rdmaSharedDevicePlugin NicClusterPolicy Advertise HCAs as rdma/* resources (block present = deployed) image/repository/version
spec.rdmaSharedDevicePlugin.config.resourceName NicClusterPolicy (JSON) Name -> resource rdma/<name> rdma_shared_device_a
spec.rdmaSharedDevicePlugin.config.rdmaHcaMax NicClusterPolicy (JSON) Max concurrent pods per HCA 63
spec.rdmaSharedDevicePlugin.config.selectors.ifNames NicClusterPolicy (JSON) NIC netdevs to expose ["ens1f0np0"]
spec.secondaryNetwork NicClusterPolicy Install Multus + CNI + IPAM (block present = deployed) multus/cniPlugins/ipamPlugin
spec.secondaryNetwork.ipamPlugin.image NicClusterPolicy IP address management CNI whereabouts
apiVersion / kind CRD NicClusterPolicy, MacvlanNetwork, IPoIBNetwork, HostDeviceNetwork mellanox.com/v1alpha1
MacvlanNetwork.spec.master CRD Host netdev backing the secondary net ens1f0np0
MacvlanNetwork.spec.mtu CRD Jumbo frames for RoCE 9000
MacvlanNetwork.spec.ipam CRD IPAM JSON (type + range) whereabouts

GPUDirect RDMA enablement is a GPU Operator concern: driver.rdma.enabled=true builds nvidia-peermem. Leave driver.rdma.useHostMofed default (false) when the Network Operator supplies the DOCA driver as a DaemonSet; set it true only for MOFED installed on the host OS. There is no gpuDirectRDMA field on NicClusterPolicy.

Apply & verify

# 1. Controller up
kubectl get pods -n nvidia-network-operator
#   nvidia-network-operator-controller-manager-...  2/2  Running

# 2. NicClusterPolicy reconciled to ready (logical AND of all sub-states)
kubectl get nicclusterpolicy nic-cluster-policy -n nvidia-network-operator \
  -o json | jq -r '.status.state, (.status.appliedStates[]? | "\(.name)=\(.state)")'
#   ready
#   state-OFED ready · state-RDMA-device-plugin ready · state-multus-cni ready

# 3. Node advertises the RDMA resource
kubectl get nodes -o json \
  | jq -r '.items[].status.allocatable | keys[] | select(startswith("rdma/"))'
#   rdma/rdma_shared_device_a

# 4. nvidia-peermem loaded on a GPU node (GPU Operator path)
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset --tail=-1 \
  | grep -m1 "successfully loaded nvidia-peermem module"

Expected signal: controller 2/2 Running, NicClusterPolicy state: ready, every GPU node lists rdma/rdma_shared_device_a as allocatable, and the driver pod confirms nvidia-peermem loaded.

Prove the data path with two pods that claim the RDMA resource and the secondary network, then run ib_write_bw with CUDA (only after the CR is ready):

apiVersion: v1
kind: Pod
metadata:
  name: rdma-server
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: rdma-net
spec:
  restartPolicy: Never
  containers:
    - name: perftest
        image: "<cuda-perftest-image>@sha256:<digest>"   # replace with a tested, pinned perftest image
      securityContext:
        capabilities: { add: ["IPC_LOCK"] }    # RDMA needs locked memory
      command: ["sh", "-c"]
      args: ["ib_write_bw --use_cuda=0 --use_cuda_dmabuf -d mlx5_0 -a -F --report_gbits -q 1"]
      resources:
        limits:
          rdma/rdma_shared_device_a: 1
          nvidia.com/gpu: 1

The client pod runs the same image against the server's secondary-network IP:

ib_write_bw -n 5000 --use_cuda=0 --use_cuda_dmabuf -d mlx5_0 -a -F --report_gbits -q 1 <server-secondary-ip>

Expected signal: --report_gbits BW approaching line rate (e.g. ~390+ Gb/s on a 400G NDR port). For NCCL-level proof at multi-node scale, run nccl-tests and confirm NCCL_DEBUG=INFO prints [GDRDMA] (fabric benchmarking).

Failure modes

  • No rdma/* on the node. selectors.ifNames does not match a real netdev (check ip link on the node), or ofedDriver is not ready yet, since the device plugin waits on the driver. Re-check nicclusterpolicy sub-states.
  • NicClusterPolicy stuck not-ready. Usually state-OFED failing: kernel-headers mismatch or a host MOFED already present. Inspect the OFED DaemonSet pod logs; reconcile host vs operator driver (RDMA fabric).
  • Pod has the resource but NCCL still uses TCP. Secondary network not attached: missing k8s.v1.cni.cncf.io/networks annotation or the MacvlanNetwork/NAD absent. kubectl exec and check for a second interface.
  • ib_write_bw fails to register CUDA memory. nvidia-peermem not loaded (GPU Operator driver.rdma.enabled off), or pod lacks IPC_LOCK; RDMA cannot pin GPU memory without it.
  • Throughput far below line rate. ACS not disabled on PCIe switches, or MTU not jumbo end-to-end (disable ACS, fabric benchmarking).
  • Driver/plugin version skew. Mixing image tags across ofedDriver/rdmaSharedDevicePlugin/secondaryNetwork is untested by NVIDIA; keep them on one release's defaults.

References

  • Getting started with Kubernetes (Network Operator): https://docs.nvidia.com/networking/display/kubernetes2570/getting-started-with-kubernetes.html
  • Deployment guide (Kubernetes), v25.7.0: https://docs.nvidia.com/networking/display/kubernetes2570/deployment-guide-kubernetes.html
  • Helm chart customization options: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/customization.html
  • Network Operator CRD API reference: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/crds.html
  • Network Operator source (CRDs, values.yaml): https://github.com/Mellanox/network-operator
  • GPUDirect RDMA & GPUDirect Storage (GPU Operator): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html
  • Deploying GPUDirect RDMA with the Network Operator (NVIDIA blog): https://developer.nvidia.com/blog/deploying-gpudirect-rdma-on-egx-stack-with-the-network-operator/

Related: GPU platform hub · NicClusterPolicy manifest · GPU Operator · RDMA fabric · Fabric benchmarking · Security · Glossary