Skip to content
Markdown

Manifest: NicClusterPolicy

Scope: the NicClusterPolicy CRD that drives the NVIDIA Network Operator. It wires the OFED/DOCA driver, the RDMA shared device plugin (resourceName, ifNames), and a secondary network (Multus + NV-IPAM), then a multi-homed pod that requests the RDMA resource and is verified to see a secondary ibX interface and the rdma/shared_ib allocatable. Pairs with NVIDIA Network Operator; the sharing/scheduler layers sit downstream in the GPU platform hub.

Reference templates from the NVIDIA Network Operator v25.7 quick-start CRDs. Pinned chart/image versions and ifNames are placeholders; ifNames must match the real interface names on your hosts. Apply via GitOps; never hand-edit in production. Not validated on this hardware.

flowchart TB
  NCP["NicClusterPolicy (one per cluster)"] --> OFED["ofedDriver (DOCA/MOFED)"]
  NCP --> RDP["rdmaSharedDevicePlugin -> rdma/shared_ib"]
  NCP --> SEC["secondaryNetwork: multus + ipoib + nv-ipam"]
  SEC --> NAD["IPoIBNetwork / NetworkAttachmentDefinition"]
  RDP --> POD["Multi-homed pod: rdma/shared_ib + ibX"]
  NAD --> POD

What it is

NicClusterPolicy is the single cluster-scoped CRD the Network Operator reconciles into a working RDMA data plane. It is defined once per cluster. Each top-level key is a sub-state the operator deploys and tracks:

  • ofedDriver: the DOCA/MOFED kernel-module container, built and loaded per node (omit if the driver is host-installed).
  • rdmaSharedDevicePlugin advertises one host RDMA device to many pods as an extended resource rdma/<resourceName> (shared, not exclusive; rdmaHcaMax caps concurrent consumers).
  • secondaryNetwork: Multus, the reference CNI plugins, the IPoIB CNI, and NV-IPAM, so a pod can attach a second NIC beyond the cluster pod network.

The shared device plugin gives a pod the /dev/infiniband verbs device; the secondary network gives it the L3 ibX interface. A distributed-training pod needs both: the resource alone yields no IP, the interface alone yields no verbs handle. Without this, NCCL falls back to TCP (performance tuning).

apiVersion: mellanox.com/v1alpha1, kind: NicClusterPolicy. The CRD and operator live in the nvidia-network-operator namespace.

Prerequisites

  • Network Operator installed and Ready; see NVIDIA Network Operator (chart nvidia/network-operator v25.7.0).
  • Mellanox/NVIDIA ConnectX or BlueField NICs (PCI vendor 15b3). Confirm host interface names (ip -br link / ibstat) before setting ifNames.
  • Node Feature Discovery running so the operator schedules onto NIC-bearing nodes (--set nfd.enabled=true at install).
  • RDMA subsystem in shared mode on hosts for the shared device plugin (rdma system shows netns shared); host driver loaded if you set ofedDriver absent.
  • Multus enabled in the policy (below) before any k8s.v1.cni.cncf.io/networks annotation will resolve.

The manifest

One NicClusterPolicy, one NV-IPAM IPPool, and one IPoIBNetwork attachment. Pin every image; set ifNames to your real IB interfaces.

# 1) Cluster-wide RDMA data plane. apiVersion/kind/fields per NVIDIA NetOp v25.7 quick-start.
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy            # operator expects this exact name
spec:
  ofedDriver:                          # OMIT this whole block if the driver is host-installed
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: doca3.1.0-25.07-0.9.7.0-0
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "configList": [
          {
            "resourceName": "shared_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"],
              "ifNames": ["ibs1f0", "ibs1f1"]
            }
          }
        ]
      }
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    multus:
      image: multus-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    ipoib:
      image: ipoib-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
  nvIpam:
    image: nvidia-k8s-ipam
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    enableWebhook: false
# 2) NV-IPAM pool the IPoIB attachment draws addresses from.
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: ipoib-pool-a
  namespace: nvidia-network-operator   # NV-IPAM watches this namespace
spec:
  subnet: 192.168.5.0/24
  perNodeBlockSize: 50
  gateway: 192.168.5.1
# 3) Secondary IPoIB network. master must be a real IB interface (an ifName from above).
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: ipoib-network-a
spec:
  networkNamespace: "default"          # namespace where consuming pods run
  master: "ibs1f0"
  ipam: |
    {
      "type": "nv-ipam",
      "poolName": "ipoib-pool-a"
    }
# 4) Multi-homed pod: requests rdma/shared_ib AND attaches the IPoIB secondary network.
apiVersion: v1
kind: Pod
metadata:
  name: rdma-test-pod-a
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: ipoib-network-a   # -> secondary ibX interface
spec:
  restartPolicy: Never
  containers:
    - name: test
      image: "<rping-test-image>@sha256:<digest>"        # replace with a tested, pinned RDMA test image
      command: ["/bin/bash", "-c", "sleep infinity"]
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]                          # required for RDMA memory pinning
      resources:
        requests:
          rdma/shared_ib: 1                          # rdma/<resourceName> from configList
        limits:
          rdma/shared_ib: 1

resourceName: "shared_ib" is what makes the extended resource surface as rdma/shared_ib. Change the name and every pod requests/limits key and the node allocatable change with it.

Configuration

Field Where Meaning Note
spec.ofedDriver NicClusterPolicy DOCA/MOFED driver container (image/repository/version) Omit entirely when the host driver is installed; never run both (Ansible bring-up)
spec.rdmaSharedDevicePlugin.config NicClusterPolicy JSON string: configList[] of device groups A string, not a YAML map — keep it as a block scalar
configList[].resourceName plugin config Suffix of the advertised resource rdma/<resourceName> shared_ib -> rdma/shared_ib
configList[].rdmaHcaMax plugin config Max concurrent pods sharing the device Shared semantics; not isolation
configList[].selectors.ifNames plugin config Host interfaces to back the resource Must match real names (ip -br link); mismatch = 0 allocatable
configList[].selectors.vendors plugin config PCI vendor filter 15b3 = Mellanox/NVIDIA
spec.secondaryNetwork.multus NicClusterPolicy Meta-CNI that attaches the second NIC Required for the k8s.v1.cni.cncf.io/networks annotation
spec.secondaryNetwork.ipoib NicClusterPolicy IPoIB CNI image Use macvlan path instead for RoCE/Ethernet
spec.nvIpam NicClusterPolicy NV-IPAM controller + CNI Backs "type": "nv-ipam" in attachment ipam
IPoIBNetwork.spec.master IPoIBNetwork Parent IB interface for the secondary link One of the ifNames
IPPool.spec.perNodeBlockSize IPPool Addresses carved per node from subnet Sized to max pods/node
Pod annotations k8s.v1.cni.cncf.io/networks Pod Names the attachment(s) to add Comma-separate for multiple NICs
Pod resources.limits rdma/<name> Pod Claim one share of the RDMA device IPC_LOCK capability also required

For RoCE/Ethernet fabrics, swap the ipoib sub-state and IPoIBNetwork for the MacVLAN path: secondaryNetwork.cniPlugins + a MacvlanNetwork (spec.master, mode: bridge, mtu, ipam). The rdmaSharedDevicePlugin block is identical.

Apply & verify

kubectl apply -f nic-cluster-policy.yaml
kubectl apply -f ippool.yaml
kubectl apply -f ipoib-network.yaml

# 1) Policy converged — global state and every active sub-state must read "ready".
kubectl get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'; echo
kubectl get nicclusterpolicy nic-cluster-policy \
  -o jsonpath='{range .status.appliedStates[*]}{.name}{"="}{.state}{"\n"}{end}'
# expect: state-OFED=ready  state-RDMA-device-plugin=ready
#         state-multus-cni=ready  state-nv-ipam-cni=ready  (others "ignore")

# 2) Resource advertised on NIC nodes.
kubectl get nodes -o json \
  | jq '.items[].status.allocatable | with_entries(select(.key|test("rdma")))'
# expect: { "rdma/shared_ib": "63" }

# 3) Run the multi-homed pod and inspect inside it.
kubectl apply -f rdma-test-pod.yaml
kubectl wait --for=condition=Ready pod/rdma-test-pod-a --timeout=120s

kubectl exec rdma-test-pod-a -- ip -br addr show     # expect a secondary net1/ibX with a 192.168.5.x IP
kubectl exec rdma-test-pod-a -- ibstat               # expect an HCA, State: Active, Physical state: LinkUp
kubectl exec rdma-test-pod-a -- ibv_devinfo          # expect PORT_ACTIVE
kubectl exec rdma-test-pod-a -- ls /dev/infiniband   # expect uverbs*, rdma_cm

Expected signal: status.state: ready; rdma/shared_ib non-zero on each NIC node; inside the pod a secondary ibX/net1 interface with an IPPool address, ibstat Active/LinkUp, and /dev/infiniband populated. Prove end-to-end with a two-pod ib_write_bw or a 2-node nccl-tests run showing NCCL_DEBUG=INFO selecting the IB HCA, not [socket] (fabric bring-up & benchmarking).

Failure modes

  • ifNames mismatch. Selector lists an interface the host does not have; plugin advertises 0, pods sit Pending on rdma/shared_ib. Fix the names to match ip -br link.
  • config as a YAML map instead of a JSON string. The plugin ignores it; no resource appears. Keep it a | block scalar of valid JSON.
  • Driver double-load. ofedDriver set while a host driver is present; module conflict, node NotReady. Drop the ofedDriver block when host-installed (Ansible bring-up).
  • RDMA subsystem in exclusive mode. Shared device plugin cannot multiplex; only one pod ever schedules. Set the host RDMA netns to shared.
  • Missing IPC_LOCK. Verbs apps fail to pin memory (Cannot allocate memory / mlx5: …) even with the resource granted.
  • Pod gets the resource but no IP. The networks annotation is absent or the attachment/IPPool is missing; the verbs device is present but ip addr shows no ibX. Apply the IPoIBNetwork + IPPool and add the annotation.
  • status.state: notReady. Read status.appliedStates for the sub-state stuck off ready, then the corresponding operator pod logs in nvidia-network-operator.

References

  • NicClusterPolicy CRD (sub-states, status): https://docs.nvidia.com/networking/display/kubernetes2640/life-cycle-management.html
  • Customization Options: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/customization.html
  • Network Operator CRD API reference: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/crds.html
  • IPoIB + RDMA shared device quick-start (NicClusterPolicy, IPoIBNetwork, IPPool, test pod): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/ipoib-rdma-shared.html
  • MacVLAN + RDMA shared device quick-start (RoCE path, MacvlanNetwork): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/macvlan-rdma-shared.html
  • Deployment Guide with Kubernetes (Helm, namespace): https://docs.nvidia.com/networking/display/kubernetes2570/deployment-guide-kubernetes.html
  • Network Operator source and chart: https://github.com/Mellanox/network-operator

Related: Network Operator · GPU Platform hub · Kubernetes for GPUs · Fabric bring-up · Ansible · Glossary