Manifest: NicClusterPolicy¶
Scope: the NicClusterPolicy CRD that drives the NVIDIA Network Operator. It wires the OFED/DOCA driver, the RDMA shared device plugin (resourceName, ifNames), and a secondary network (Multus + NV-IPAM), then a multi-homed pod that requests the RDMA resource and is verified to see a secondary ibX interface and the rdma/shared_ib allocatable. Pairs with NVIDIA Network Operator; the sharing/scheduler layers sit downstream in the GPU platform hub.
Reference templates from the NVIDIA Network Operator v25.7 quick-start CRDs. Pinned chart/image versions and
ifNamesare placeholders;ifNamesmust match the real interface names on your hosts. Apply via GitOps; never hand-edit in production. Not validated on this hardware.
flowchart TB
NCP["NicClusterPolicy (one per cluster)"] --> OFED["ofedDriver (DOCA/MOFED)"]
NCP --> RDP["rdmaSharedDevicePlugin -> rdma/shared_ib"]
NCP --> SEC["secondaryNetwork: multus + ipoib + nv-ipam"]
SEC --> NAD["IPoIBNetwork / NetworkAttachmentDefinition"]
RDP --> POD["Multi-homed pod: rdma/shared_ib + ibX"]
NAD --> POD
What it is¶
NicClusterPolicy is the single cluster-scoped CRD the Network Operator reconciles into a working RDMA data plane. It is defined once per cluster. Each top-level key is a sub-state the operator deploys and tracks:
ofedDriver: the DOCA/MOFED kernel-module container, built and loaded per node (omit if the driver is host-installed).rdmaSharedDevicePluginadvertises one host RDMA device to many pods as an extended resourcerdma/<resourceName>(shared, not exclusive;rdmaHcaMaxcaps concurrent consumers).secondaryNetwork: Multus, the reference CNI plugins, the IPoIB CNI, and NV-IPAM, so a pod can attach a second NIC beyond the cluster pod network.
The shared device plugin gives a pod the /dev/infiniband verbs device; the secondary network gives it the L3 ibX interface. A distributed-training pod needs both: the resource alone yields no IP, the interface alone yields no verbs handle. Without this, NCCL falls back to TCP (performance tuning).
apiVersion: mellanox.com/v1alpha1, kind: NicClusterPolicy. The CRD and operator live in the nvidia-network-operator namespace.
Prerequisites¶
- Network Operator installed and
Ready; see NVIDIA Network Operator (chartnvidia/network-operatorv25.7.0). - Mellanox/NVIDIA ConnectX or BlueField NICs (PCI vendor
15b3). Confirm host interface names (ip -br link/ibstat) before settingifNames. - Node Feature Discovery running so the operator schedules onto NIC-bearing nodes (
--set nfd.enabled=trueat install). - RDMA subsystem in shared mode on hosts for the shared device plugin (
rdma systemshowsnetns shared); host driver loaded if you setofedDriverabsent. - Multus enabled in the policy (below) before any
k8s.v1.cni.cncf.io/networksannotation will resolve.
The manifest¶
One NicClusterPolicy, one NV-IPAM IPPool, and one IPoIBNetwork attachment. Pin every image; set ifNames to your real IB interfaces.
# 1) Cluster-wide RDMA data plane. apiVersion/kind/fields per NVIDIA NetOp v25.7 quick-start.
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy # operator expects this exact name
spec:
ofedDriver: # OMIT this whole block if the driver is host-installed
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: doca3.1.0-25.07-0.9.7.0-0
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
config: |
{
"configList": [
{
"resourceName": "shared_ib",
"rdmaHcaMax": 63,
"selectors": {
"vendors": ["15b3"],
"ifNames": ["ibs1f0", "ibs1f1"]
}
}
]
}
secondaryNetwork:
cniPlugins:
image: plugins
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
multus:
image: multus-cni
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
ipoib:
image: ipoib-cni
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
nvIpam:
image: nvidia-k8s-ipam
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
enableWebhook: false
# 2) NV-IPAM pool the IPoIB attachment draws addresses from.
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: ipoib-pool-a
namespace: nvidia-network-operator # NV-IPAM watches this namespace
spec:
subnet: 192.168.5.0/24
perNodeBlockSize: 50
gateway: 192.168.5.1
# 3) Secondary IPoIB network. master must be a real IB interface (an ifName from above).
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
name: ipoib-network-a
spec:
networkNamespace: "default" # namespace where consuming pods run
master: "ibs1f0"
ipam: |
{
"type": "nv-ipam",
"poolName": "ipoib-pool-a"
}
# 4) Multi-homed pod: requests rdma/shared_ib AND attaches the IPoIB secondary network.
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod-a
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: ipoib-network-a # -> secondary ibX interface
spec:
restartPolicy: Never
containers:
- name: test
image: "<rping-test-image>@sha256:<digest>" # replace with a tested, pinned RDMA test image
command: ["/bin/bash", "-c", "sleep infinity"]
securityContext:
capabilities:
add: ["IPC_LOCK"] # required for RDMA memory pinning
resources:
requests:
rdma/shared_ib: 1 # rdma/<resourceName> from configList
limits:
rdma/shared_ib: 1
resourceName: "shared_ib"is what makes the extended resource surface asrdma/shared_ib. Change the name and every podrequests/limitskey and the node allocatable change with it.
Configuration¶
| Field | Where | Meaning | Note |
|---|---|---|---|
spec.ofedDriver |
NicClusterPolicy | DOCA/MOFED driver container (image/repository/version) | Omit entirely when the host driver is installed; never run both (Ansible bring-up) |
spec.rdmaSharedDevicePlugin.config |
NicClusterPolicy | JSON string: configList[] of device groups |
A string, not a YAML map — keep it as a block scalar |
configList[].resourceName |
plugin config | Suffix of the advertised resource rdma/<resourceName> |
shared_ib -> rdma/shared_ib |
configList[].rdmaHcaMax |
plugin config | Max concurrent pods sharing the device | Shared semantics; not isolation |
configList[].selectors.ifNames |
plugin config | Host interfaces to back the resource | Must match real names (ip -br link); mismatch = 0 allocatable |
configList[].selectors.vendors |
plugin config | PCI vendor filter | 15b3 = Mellanox/NVIDIA |
spec.secondaryNetwork.multus |
NicClusterPolicy | Meta-CNI that attaches the second NIC | Required for the k8s.v1.cni.cncf.io/networks annotation |
spec.secondaryNetwork.ipoib |
NicClusterPolicy | IPoIB CNI image | Use macvlan path instead for RoCE/Ethernet |
spec.nvIpam |
NicClusterPolicy | NV-IPAM controller + CNI | Backs "type": "nv-ipam" in attachment ipam |
IPoIBNetwork.spec.master |
IPoIBNetwork | Parent IB interface for the secondary link | One of the ifNames |
IPPool.spec.perNodeBlockSize |
IPPool | Addresses carved per node from subnet |
Sized to max pods/node |
Pod annotations k8s.v1.cni.cncf.io/networks |
Pod | Names the attachment(s) to add | Comma-separate for multiple NICs |
Pod resources.limits rdma/<name> |
Pod | Claim one share of the RDMA device | IPC_LOCK capability also required |
For RoCE/Ethernet fabrics, swap the ipoib sub-state and IPoIBNetwork for the MacVLAN path: secondaryNetwork.cniPlugins + a MacvlanNetwork (spec.master, mode: bridge, mtu, ipam). The rdmaSharedDevicePlugin block is identical.
Apply & verify¶
kubectl apply -f nic-cluster-policy.yaml
kubectl apply -f ippool.yaml
kubectl apply -f ipoib-network.yaml
# 1) Policy converged — global state and every active sub-state must read "ready".
kubectl get nicclusterpolicy nic-cluster-policy -o jsonpath='{.status.state}'; echo
kubectl get nicclusterpolicy nic-cluster-policy \
-o jsonpath='{range .status.appliedStates[*]}{.name}{"="}{.state}{"\n"}{end}'
# expect: state-OFED=ready state-RDMA-device-plugin=ready
# state-multus-cni=ready state-nv-ipam-cni=ready (others "ignore")
# 2) Resource advertised on NIC nodes.
kubectl get nodes -o json \
| jq '.items[].status.allocatable | with_entries(select(.key|test("rdma")))'
# expect: { "rdma/shared_ib": "63" }
# 3) Run the multi-homed pod and inspect inside it.
kubectl apply -f rdma-test-pod.yaml
kubectl wait --for=condition=Ready pod/rdma-test-pod-a --timeout=120s
kubectl exec rdma-test-pod-a -- ip -br addr show # expect a secondary net1/ibX with a 192.168.5.x IP
kubectl exec rdma-test-pod-a -- ibstat # expect an HCA, State: Active, Physical state: LinkUp
kubectl exec rdma-test-pod-a -- ibv_devinfo # expect PORT_ACTIVE
kubectl exec rdma-test-pod-a -- ls /dev/infiniband # expect uverbs*, rdma_cm
Expected signal: status.state: ready; rdma/shared_ib non-zero on each NIC node; inside the pod a secondary ibX/net1 interface with an IPPool address, ibstat Active/LinkUp, and /dev/infiniband populated. Prove end-to-end with a two-pod ib_write_bw or a 2-node nccl-tests run showing NCCL_DEBUG=INFO selecting the IB HCA, not [socket] (fabric bring-up & benchmarking).
Failure modes¶
ifNamesmismatch. Selector lists an interface the host does not have; plugin advertises0, pods sitPendingonrdma/shared_ib. Fix the names to matchip -br link.configas a YAML map instead of a JSON string. The plugin ignores it; no resource appears. Keep it a|block scalar of valid JSON.- Driver double-load.
ofedDriverset while a host driver is present; module conflict, nodeNotReady. Drop theofedDriverblock when host-installed (Ansible bring-up). - RDMA subsystem in
exclusivemode. Shared device plugin cannot multiplex; only one pod ever schedules. Set the host RDMA netns toshared. - Missing
IPC_LOCK. Verbs apps fail to pin memory (Cannot allocate memory/mlx5: …) even with the resource granted. - Pod gets the resource but no IP. The
networksannotation is absent or the attachment/IPPoolis missing; the verbs device is present butip addrshows noibX. Apply theIPoIBNetwork+IPPooland add the annotation. status.state: notReady. Readstatus.appliedStatesfor the sub-state stuck offready, then the corresponding operator pod logs innvidia-network-operator.
References¶
- NicClusterPolicy CRD (sub-states, status): https://docs.nvidia.com/networking/display/kubernetes2640/life-cycle-management.html
- Customization Options: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/customization.html
- Network Operator CRD API reference: https://docs.nvidia.com/networking/display/kubernetes2570/customizations/crds.html
- IPoIB + RDMA shared device quick-start (NicClusterPolicy, IPoIBNetwork, IPPool, test pod): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/ipoib-rdma-shared.html
- MacVLAN + RDMA shared device quick-start (RoCE path, MacvlanNetwork): https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/macvlan-rdma-shared.html
- Deployment Guide with Kubernetes (Helm, namespace): https://docs.nvidia.com/networking/display/kubernetes2570/deployment-guide-kubernetes.html
- Network Operator source and chart: https://github.com/Mellanox/network-operator
Related: Network Operator · GPU Platform hub · Kubernetes for GPUs · Fabric bring-up · Ansible · Glossary