Kubernetes network drivers: DRA-based networking¶
Scope: the Kubernetes Network Driver (KND) model, which manages NICs as first-class, schedulable Kubernetes resources through Dynamic Resource Allocation (DRA) and the Node Resource Interface (NRI) instead of the legacy CNI + device-plugin composition, and DraNet, its open-source reference implementation. This page covers why the legacy model cannot express NIC topology, how a claim flows from CEL selector to a device in the pod netns, and what the topology alignment is worth (measured NCCL numbers on B200 + RoCE nodes). It extends the DRA material in the GPU DRA driver and DRA ResourceClaim from GPUs to the network, and complements the fabric view in RDMA and RoCE tuning.
The YAML manifests are reference templates, unexecuted here; attribute names are driver-specific, so confirm them against the ResourceSlices your driver publishes and against your cluster's DRA API version. The Python example is executed and asserted. Benchmark numbers are the paper's, not reproduced locally.
flowchart TB
DISC["KND daemon per node discovers NICs and topology (PCI root, NUMA node)"] --> SLICE["ResourceSlice objects with qualitative attributes"]
CLAIM["Pod references a ResourceClaim (CEL selector: RDMA NIC on the same PCI root as the GPU)"] --> SCHED["Scheduler matches claim against slices, picks a node"]
SLICE --> SCHED
SCHED --> PREP["kubelet calls NodePrepareResources: driver prepares the device, claim carries config (push model, no API-server callback)"]
PREP --> SANDBOX["NRI RunPodSandbox hook: interface moved into the pod netns"]
SANDBOX --> CONT["NRI CreateContainer hook: RDMA char devices (/dev/infiniband/uverbsN) exposed"]
CONT --> RUN["Pod runs on a topology-aligned NIC"]
What it is¶
The Kubernetes Network Driver (KND) model (Ojea, arXiv 2506.23628) is an architecture for high-performance pod networking built on two first-class Kubernetes-era APIs instead of the container-runtime-delegated CNI path:
- DRA (Dynamic Resource Allocation) publishes each NIC as a device in a
ResourceSlicewith qualitative attributes (PCI root, NUMA node, RDMA capability), and lets pods request one through aResourceClaimwhose selector is a CEL expression. The scheduler sees those attributes, so placement can honor network topology. ANodePrepareResourceshook runs slow setup before pod start, and the claim carries opaque driver configuration with it, a push model that removes API-server lookups from the pod-startup critical path. - NRI (Node Resource Interface) gives drivers event-driven hooks into the container runtime lifecycle. A network driver subscribes to
RunPodSandboxto move a prepared interface into the pod's network namespace and toCreateContainerto present device nodes such as/dev/infiniband/uverbsN. Multiple drivers (GPU, network) act on the same pod in parallel with no ordering dependency, unlike CNI chaining. - Recent OCI runtime-spec work (PR #1271) adds declarative network-interface attachment, so the runtime itself performs the privileged netlink moves and the driver sheds capabilities.
DraNet is the reference implementation: a per-node daemon that discovers host interfaces and their topology, publishes ResourceSlices, and fulfils claims for anything from a plain veth-free host NIC to a Mellanox RoCE device. For GPU-aligned networking it composes with the NVIDIA DRA GPU driver as two independent drivers over the same standard API, replacing the traditional three-component chain (Multus meta-plugin + SR-IOV device plugin + RDMA CNI plugin).
Why use it¶
- The scheduler finally sees the network. CNI is invoked by the container runtime after placement, so it cannot influence scheduling; device plugins advertise only an opaque count. KND exposes NUMA node and PCI root as schedulable attributes, which is exactly what GPU-NIC alignment needs.
- Measured performance, not folklore. On two GKE
a4-highgpu-8gnodes (8x B200 and 8 RoCE NICs each), the paper's topologically aligned claims reached 46.59 GB/s NCCL all-gather bus bandwidth at 8 GB messages versus 29.20 GB/s (with 5.62 GB/s standard deviation) when the GPU came from the topology-blind device plugin: up to 59.6% higher all-gather and 58.1% higher all-reduce throughput, and far lower variance. The unaligned setup is a 1-in-8 lottery on whether GPU and NIC share a PCI root. - Fewer moving parts. Two composable drivers (GPU + network) replace the Multus + device plugin + CNI-plugin chain, removing the annotation-passing that previously synchronized allocation with configuration, and the shim-binary-to-daemon lifecycle mismatch that fails pod startup when the daemon restarts.
- Fast, predictable startup. Claimed-NIC pods started in 1.8 s median (P99 2.3 s) in the paper's 100-run benchmark, against minutes for re-provisioning VMs to fix topology out of band.
When to use it (and when not)¶
- Use it for multi-NIC AI/ML nodes where GPUDirect RDMA throughput depends on GPU/NIC PCI locality, and for Telco/NFV workloads that need NUMA-aligned NICs and deterministic latency.
- Use it when you are already on DRA for GPUs (Kubernetes 1.34+ with the GA structured-parameters API); adding the network side keeps one allocation model for all devices.
- Do not use it for ordinary pod east-west traffic: the primary CNI still provides the default pod network, network policy, and Services. KND manages additional, specialized interfaces; it does not replace Calico or Cilium.
- Hold off if you must run cluster versions that only ship DRA as beta feature gates, or if your workloads cannot tolerate an evolving device-status API. The paper itself ran on GKE v1.33.1 with the
DynamicResourceAllocationfeature gate enabled; the API is GA in 1.34+, but network-specific status reporting (KEP-4817) is newer, so verify against your version's documentation.
Architecture¶
Legacy composition fails high-performance networking in four documented ways: CNI runs below the scheduler's line of sight; device plugins are purely quantitative (a count, no attributes or topology); device plugins are per-container while a NIC is a pod-level resource; and nothing synchronizes the device plugin's allocation with the CNI plugin's configuration except fragile annotations. Two SIG-Network proposals to fix multi-networking natively (KEP-3698 Multi-Network, KEP-4410 KNI) identified the gap but were never implemented.
KND recomposes the stack. Discovery: the driver daemon inventories host interfaces and topology and publishes ResourceSlices. Claiming: the scheduler matches a CEL selector (for example, an RDMA NIC on the same PCI root as the claimed GPU) against slice attributes and binds pod and devices to one node. Preparation: NodePrepareResources runs before the sandbox exists, with configuration pushed inside the claim. Attachment: NRI's RunPodSandbox hook moves the interface into the pod netns (pod-level), then CreateContainer exposes RDMA character devices (container-level). The same mechanism models virtual and logical resources (SR-IOV VFs, an MPLS tunnel, a 5G RAN slice), which is where the paper's "galaxy of drivers" vision points, coordinated through the standardized claim device status of KEP-4817.
How to use it¶
Reference template (unexecuted): claim one RDMA NIC per pod with a CEL selector, and consume it next to a GPU claim. Attribute names below are illustrative; list the real ones from your driver with kubectl get resourceslices -o yaml.
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: rdma-nic
spec:
spec:
devices:
requests:
- name: nic
exactly:
deviceClassName: dranet
selectors:
- cel:
expression: 'device.attributes["dra.net"].rdma == true'
---
apiVersion: v1
kind: Pod
metadata:
name: nccl-worker
spec:
resourceClaims:
- name: nic
resourceClaimTemplateName: rdma-nic
- name: gpu
resourceClaimTemplateName: single-gpu # from the NVIDIA DRA driver
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:25.06-py3 # pin your tested tag
resources:
claims:
- name: nic
- name: gpu
Install DraNet per its repo (a DaemonSet plus DeviceClass; it composes with the NVIDIA DRA GPU driver without ordering constraints). The paper's benchmark methodology is reproducible with nccl-tests (all_gather_perf -b 8 -e 8G -f 2 -n 100 -w 50, one GPU and one NIC per process so NVLink cannot mask fabric performance).
How to develop with it¶
A KND has three contracts: publish devices with truthful attributes (discovery), allocate exclusively against selectors (scheduling), and configure at the right lifecycle hook (NRI). The scheduling semantics are the part most implementers get wrong, and they are checkable without a cluster. The following model is executed and asserted: conjunctive selector matching, exclusive ownership, NUMA preference with fallback, and the adversarial case where an unsatisfiable claim must surface as a defined pending state rather than a partial allocation:
# knd_alloc.py - validated: structured-parameter matching and exclusive allocation
# semantics of DRA network-device claims. Pure stdlib; models the API objects, not a cluster.
from typing import Optional
Device = dict[str, object] # published attributes: name, driver, rdma, numaNode, speedGbps
def matches(device: Device, selectors: dict[str, object]) -> bool:
"""Conjunctive attribute selection (what a CEL selector expresses)."""
return all(device.get(key) == want for key, want in selectors.items())
def allocate(
devices: list[Device],
selectors: dict[str, object],
allocated: set[str],
prefer_numa: Optional[int] = None,
) -> Optional[str]:
"""First-fit with exclusive ownership; NUMA preference reorders candidates.
Returns the chosen device name, or None: the claim stays pending (a defined
result the scheduler retries), never a silent partial allocation.
"""
candidates = [d for d in devices if matches(d, selectors) and d["name"] not in allocated]
if not candidates:
return None
if prefer_numa is not None:
candidates.sort(key=lambda d: d["numaNode"] != prefer_numa)
chosen = str(candidates[0]["name"])
allocated.add(chosen)
return chosen
devices: list[Device] = [
{"name": "eth1", "driver": "mlx5_core", "rdma": True, "numaNode": 0, "speedGbps": 400},
{"name": "eth2", "driver": "mlx5_core", "rdma": True, "numaNode": 1, "speedGbps": 400},
{"name": "eth0", "driver": "gve", "rdma": False, "numaNode": 0, "speedGbps": 100},
]
allocated: set[str] = set()
rdma_sel: dict[str, object] = {"driver": "mlx5_core", "rdma": True}
# 1. Selectors are conjunctive: only the two RDMA-capable mlx5 NICs match.
assert [d["name"] for d in devices if matches(d, rdma_sel)] == ["eth1", "eth2"]
# 2. NUMA preference picks the NIC on the requested node when one is free.
first = allocate(devices, rdma_sel, allocated, prefer_numa=1)
assert first == "eth2", first
# 3. Exclusive ownership: an allocated device is never handed out twice;
# the second claim falls back to the off-NUMA NIC instead of sharing eth2.
second = allocate(devices, rdma_sel, allocated, prefer_numa=1)
assert second == "eth1", second
# 4. Adversarial: an unsatisfiable claim yields a defined pending result.
assert allocate(devices, rdma_sel, allocated) is None # both RDMA NICs taken
assert allocate(devices, {"rdma": True, "speedGbps": 800}, set()) is None # no such device
print("allocation semantics validated:", {"first": first, "second": second})
Output: allocation semantics validated: {'first': 'eth2', 'second': 'eth1'}. When a real driver diverges from these semantics (shared ownership for multi-tenant VFs, scoring instead of first-fit), that divergence should be explicit in the DeviceClass, not implicit in driver code.
How to maintain it¶
- Track API maturity deliberately. The claim/slice API is
resource.k8s.io/v1on 1.34+; earlier clusters serve beta groups behind feature gates. Upgrades that bump the served version require re-applying DeviceClasses and templates; keep them in GitOps and diff against the driver's shipped defaults. - Watch the KEP-4817 device status. Standardized network-interface status in the claim (
status.devices) is what lets independent drivers compose and what your tooling should read for the allocated interface name, MAC, and IPs. Until every driver you run reports it, keep per-driver fallbacks out of shared automation. - Version the runtime side too. NRI hooks depend on containerd/CRI-O versions that ship the NRI revision your driver expects (pod IP visibility landed via NRI PR #119); the OCI declarative-interface path needs a runtime that implements runtime-spec PR #1271. Record both in the node image spec, next to the driver and toolkit pins.
- Re-run the alignment benchmark after upgrades. Alignment regressions are silent: everything schedules, NCCL just gets slower and noisier. A periodic NCCL fabric benchmark comparing aligned-claim pods against the historical baseline catches a broken selector or attribute rename within a day.
Production¶
- Make alignment the default template. Publish one blessed ResourceClaimTemplate pair (GPU + same-PCI-root NIC) per node shape, and treat raw
nvidia.com/gpucounter requests on multi-NIC nodes as a lint failure; the paper's data shows the counter path degrades mean bandwidth by more than a third and multiplies variance. - Admission and quota still apply. Claims consume devices exclusively; pair them with Kueue quota or gang scheduling so a half-scheduled distributed job cannot strand claimed NICs.
- Observability. Alert on claims pending with no matching device (capacity signal), on pods whose NCCL busbw sits far below the aligned baseline (alignment signal), and on NRI hook error rates in runtime logs (configuration signal). Wire these into the SLOs for cluster and fabric.
- Startup-latency budget. Aligned claims added seconds (P50 1.8 s, P99 2.3 s) in the paper's runs; if a batch platform assumes sub-second pod starts, admission-time claim preparation is where to look first.
Failure modes¶
- Claim pending forever, no matching device. The CEL selector references an attribute the driver does not publish (or a renamed one after a driver upgrade). Diagnose by diffing the selector against
kubectl get resourceslices -o yamlon a candidate node. - Driver/kubelet version skew. A driver publishing a newer slice schema than the kubelet's DRA version silently fails allocation; keep the driver's supported-version matrix pinned in the platform repo.
- NRI hook failure leaves a half-configured pod. If the sandbox hook moved the interface but the container hook failed (or the runtime lacks the expected NRI revision), the pod holds a NIC it cannot use; the runtime logs the failed hook, and deleting the pod must return the device to the pool. Verify that reclaim actually happens under fault injection before trusting it in production.
- Topology mismatch despite claims. Requesting the NIC via DRA but the GPU via the legacy device plugin reintroduces the 1-in-8 alignment lottery the paper measured (29.20 GB/s mean with 5.62 GB/s stddev versus 46.59 GB/s aligned at 8 GB all-gather). Both devices must come from topology-aware claims.
- Default-network confusion. KND interfaces are additional; deleting or misconfiguring the primary CNI still takes down pod networking. Keep the boundary explicit in runbooks so on-call does not debug Cilium for a DraNet claim failure or vice versa.
References¶
- Ojea, The Kubernetes Network Driver Model: A Composable Architecture for High-Performance Networking (arXiv 2506.23628): https://arxiv.org/abs/2506.23628 (HF mirror: https://huggingface.co/papers/2506.23628)
- DraNet, reference KND implementation: https://github.com/google/dranet
- Kubernetes documentation, Dynamic Resource Allocation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
- KEP-4381, DRA with structured parameters: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters
- KEP-4817, DRA ResourceClaim status with standardized network interface data: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4817-resource-claim-device-status/README.md
- Node Resource Interface (NRI): https://github.com/containerd/nri (pod IP visibility: https://github.com/containerd/nri/pull/119)
- OCI runtime-spec declarative network interfaces (PR #1271): https://github.com/opencontainers/runtime-spec/pull/1271
- NVIDIA DRA driver for GPUs: https://github.com/NVIDIA/k8s-dra-driver-gpu
- Device plugins (the model KND supersedes for NICs): https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
- Hockin, Why Kubernetes doesn't use libnetwork (CNI origin): https://kubernetes.io/blog/2016/01/why-kubernetes-doesnt-use-libnetwork/
- NCCL tests (benchmark workload): https://github.com/NVIDIA/nccl-tests
Related: Kubernetes GPU scheduling · Helm: DRA driver · Manifest: DRA ResourceClaim · Helm: Network Operator · Manifest: NicClusterPolicy · RDMA and RoCE tuning · Kubernetes · Topology-aware K8s scheduling · BlueField DPUs · Networking fabric