Markdown

Manifest: NicClusterPolicy¶

Scope: the NicClusterPolicy CRD that drives the NVIDIA Network Operator. It wires the OFED/DOCA driver, the RDMA shared device plugin (resourceName, ifNames), and a secondary network (Multus + NV-IPAM), then a multi-homed pod that requests the RDMA resource and is verified to see a secondary ibX interface and the rdma/shared_ib allocatable. Pairs with NVIDIA Network Operator; the sharing/scheduler layers sit downstream in the GPU platform hub.

Reference templates from the NVIDIA Network Operator v25.7 quick-start CRDs. Pinned chart/image versions and ifNames are placeholders; ifNames must match the real interface names on your hosts. Apply via GitOps; never hand-edit in production. Not validated on this hardware.

flowchart TB
  NCP["NicClusterPolicy (one per cluster)"] --> OFED["ofedDriver (DOCA/MOFED)"]
  NCP --> RDP["rdmaSharedDevicePlugin -> rdma/shared_ib"]
  NCP --> SEC["secondaryNetwork: multus + ipoib + nv-ipam"]
  SEC --> NAD["IPoIBNetwork / NetworkAttachmentDefinition"]
  RDP --> POD["Multi-homed pod: rdma/shared_ib + ibX"]
  NAD --> POD

What it is¶

NicClusterPolicy is the single cluster-scoped CRD the Network Operator reconciles into a working RDMA data plane. It is defined once per cluster. Each top-level key is a sub-state the operator deploys and tracks:

ofedDriver: the DOCA/MOFED kernel-module container, built and loaded per node (omit if the driver is host-installed).
rdmaSharedDevicePlugin advertises one host RDMA device to many pods as an extended resource rdma/<resourceName> (shared, not exclusive; rdmaHcaMax caps concurrent consumers).
secondaryNetwork: Multus, the reference CNI plugins, the IPoIB CNI, and NV-IPAM, so a pod can attach a second NIC beyond the cluster pod network.

The shared device plugin gives a pod the /dev/infiniband verbs device; the secondary network gives it the L3 ibX interface. A distributed-training pod needs both: the resource alone yields no IP, the interface alone yields no verbs handle. Without this, NCCL falls back to TCP (performance tuning).

apiVersion: mellanox.com/v1alpha1, kind: NicClusterPolicy. The CRD and operator live in the nvidia-network-operator namespace.

Prerequisites¶

Network Operator installed and Ready; see NVIDIA Network Operator (chart nvidia/network-operator v25.7.0).
Mellanox/NVIDIA ConnectX or BlueField NICs (PCI vendor 15b3). Confirm host interface names (ip -br link / ibstat) before setting ifNames.
Node Feature Discovery running so the operator schedules onto NIC-bearing nodes (--set nfd.enabled=true at install).
RDMA subsystem in shared mode on hosts for the shared device plugin (rdma system shows netns shared); host driver loaded if you set ofedDriver absent.
Multus enabled in the policy (below) before any k8s.v1.cni.cncf.io/networks annotation will resolve.

The manifest¶

One NicClusterPolicy, one NV-IPAM IPPool, and one IPoIBNetwork attachment. Pin every image; set ifNames to your real IB interfaces.

# 1) Cluster-wide RDMA data plane. apiVersion/kind/fields per NVIDIA NetOp v25.7 quick-start.
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy            # operator expects this exact name
spec:
  ofedDriver:                          # OMIT this whole block if the driver is host-installed
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: doca3.1.0-25.07-0.9.7.0-0
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "configList": [
          {
            "resourceName": "shared_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": ["15b3"],
              "ifNames": ["ibs1f0", "ibs1f1"]
            }
          }
        ]
      }
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    multus:
      image: multus-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
    ipoib:
      image: ipoib-cni
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
  nvIpam:
    image: nvidia-k8s-ipam
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    enableWebhook: false

# 2) NV-IPAM pool the IPoIB attachment draws addresses from.
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
  name: ipoib-pool-a
  namespace: nvidia-network-operator   # NV-IPAM watches this namespace
spec:
  subnet: 192.168.5.0/24
  perNodeBlockSize: 50
  gateway: 192.168.5.1

# 3) Secondary IPoIB network. master must be a real IB interface (an ifName from above).
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: ipoib-network-a
spec:
  networkNamespace: "default"          # namespace where consuming pods run
  master: "ibs1f0"
  ipam: |
    {
      "type": "nv-ipam",
      "poolName": "ipoib-pool-a"
    }

# 4) Multi-homed pod: requests rdma/shared_ib AND attaches the IPoIB secondary network.
apiVersion: v1
kind: Pod
metadata:
  name: rdma-test-pod-a
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: ipoib-network-a   # -> secondary ibX interface
spec:
  restartPolicy: Never
  containers:
    - name: test
      image: "<rping-test-image>@sha256:<digest>"        # replace with a tested, pinned RDMA test image
      command: ["/bin/bash", "-c", "sleep infinity"]
      securityContext:
        capabilities:
          add: ["IPC_LOCK"]                          # required for RDMA memory pinning
      resources:
        requests:
          rdma/shared_ib: 1                          # rdma/<resourceName> from configList
        limits:
          rdma/shared_ib: 1

resourceName: "shared_ib" is what makes the extended resource surface as rdma/shared_ib. Change the name and every pod requests/limits key and the node allocatable change with it.

Configuration¶

Field	Where	Meaning	Note
`spec.ofedDriver`	NicClusterPolicy	DOCA/MOFED driver container (image/repository/version)	Omit entirely when the host driver is installed; never run both (Ansible bring-up)
`spec.rdmaSharedDevicePlugin.config`	NicClusterPolicy	JSON string: `configList[]` of device groups	A string, not a YAML map — keep it as a block scalar
`configList[].resourceName`	plugin config	Suffix of the advertised resource `rdma/<resourceName>`	`shared_ib` -> `rdma/shared_ib`
`configList[].rdmaHcaMax`	plugin config	Max concurrent pods sharing the device	Shared semantics; not isolation
`configList[].selectors.ifNames`	plugin config	Host interfaces to back the resource	Must match real names (`ip -br link`); mismatch = 0 allocatable
`configList[].selectors.vendors`	plugin config	PCI vendor filter	`15b3` = Mellanox/NVIDIA
`spec.secondaryNetwork.multus`	NicClusterPolicy	Meta-CNI that attaches the second NIC	Required for the `k8s.v1.cni.cncf.io/networks` annotation
`spec.secondaryNetwork.ipoib`	NicClusterPolicy	IPoIB CNI image	Use `macvlan` path instead for RoCE/Ethernet
`spec.nvIpam`	NicClusterPolicy	NV-IPAM controller + CNI	Backs `"type": "nv-ipam"` in attachment `ipam`
`IPoIBNetwork.spec.master`	IPoIBNetwork	Parent IB interface for the secondary link	One of the `ifNames`
`IPPool.spec.perNodeBlockSize`	IPPool	Addresses carved per node from `subnet`	Sized to max pods/node
Pod `annotations` `k8s.v1.cni.cncf.io/networks`	Pod	Names the attachment(s) to add	Comma-separate for multiple NICs
Pod `resources.limits` `rdma/<name>`	Pod	Claim one share of the RDMA device	`IPC_LOCK` capability also required

For RoCE/Ethernet fabrics, swap the ipoib sub-state and IPoIBNetwork for the MacVLAN path: secondaryNetwork.cniPlugins + a MacvlanNetwork (spec.master, mode: bridge, mtu, ipam). The rdmaSharedDevicePlugin block is identical.

Apply & verify¶

kubectl apply -f nic-cluster-policy.yaml
kubectl apply -f ippool.yaml
kubectl apply -f ipoib-network.yaml

# 1) Policy converged — global state and every active sub-state must read "ready".
kubectl get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'; echo
kubectl get nicclusterpolicy nic-cluster-policy \
  -o jsonpath='{range .status.appliedStates[*]}{.name}{"="}{.state}{"\n"}{end}'
# expect: state-OFED=ready  state-RDMA-device-plugin=ready
#         state-multus-cni=ready  state-nv-ipam-cni=ready  (others "ignore")

# 2) Resource advertised on NIC nodes.
kubectl get nodes -o json \
  | jq '.items[].status.allocatable | with_entries(select(.key|test("rdma")))'
# expect: { "rdma/shared_ib": "63" }

# 3) Run the multi-homed pod and inspect inside it.
kubectl apply -f rdma-test-pod.yaml
kubectl wait --for=condition=Ready pod/rdma-test-pod-a --timeout=120s

kubectl exec rdma-test-pod-a -- ip -br addr show     # expect a secondary net1/ibX with a 192.168.5.x IP
kubectl exec rdma-test-pod-a -- ibstat               # expect an HCA, State: Active, Physical state: LinkUp
kubectl exec rdma-test-pod-a -- ibv_devinfo          # expect PORT_ACTIVE
kubectl exec rdma-test-pod-a -- ls /dev/infiniband   # expect uverbs*, rdma_cm

Expected signal: status.state: ready; rdma/shared_ib non-zero on each NIC node; inside the pod a secondary ibX/net1 interface with an IPPool address, ibstat Active/LinkUp, and /dev/infiniband populated. Prove end-to-end with a two-pod ib_write_bw or a 2-node nccl-tests run showing NCCL_DEBUG=INFO selecting the IB HCA, not [socket] (fabric bring-up & benchmarking).

Failure modes¶

ifNames mismatch. Selector lists an interface the host does not have; plugin advertises 0, pods sit Pending on rdma/shared_ib. Fix the names to match ip -br link.
config as a YAML map instead of a JSON string. The plugin ignores it; no resource appears. Keep it a | block scalar of valid JSON.
Driver double-load. ofedDriver set while a host driver is present; module conflict, node NotReady. Drop the ofedDriver block when host-installed (Ansible bring-up).
RDMA subsystem in exclusive mode. Shared device plugin cannot multiplex; only one pod ever schedules. Set the host RDMA netns to shared.
Missing IPC_LOCK. Verbs apps fail to pin memory (Cannot allocate memory / mlx5: …) even with the resource granted.
Pod gets the resource but no IP. The networks annotation is absent or the attachment/IPPool is missing; the verbs device is present but ip addr shows no ibX. Apply the IPoIBNetwork + IPPool and add the annotation.
status.state: notReady. Read status.appliedStates for the sub-state stuck off ready, then the corresponding operator pod logs in nvidia-network-operator.

References¶

NicClusterPolicy CRD (sub-states, status): https://docs.nvidia.com/networking/display/kubernetes2640/life-cycle-management.html
Customization Options: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/customization.html
Network Operator CRD API reference: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/crds.html
IPoIB + RDMA shared device quick-start (NicClusterPolicy, IPoIBNetwork, IPPool, test pod): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/ipoib-rdma-shared.html
MacVLAN + RDMA shared device quick-start (RoCE path, MacvlanNetwork): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/macvlan-rdma-shared.html
Deployment Guide with Kubernetes (Helm, namespace): https://docs.nvidia.com/networking/display/kubernetes2570/deployment-guide-kubernetes.html
Network Operator source and chart: https://github.com/Mellanox/network-operator