Markdown

Kubernetes pod networking over a WireGuard hybrid fleet¶

Scope: making a real Kubernetes pod network work when the nodes are not on a shared L2 fabric but stitched together by a WireGuard overlay, the specific, sharp-edged case of a CNI (Flannel VXLAN) encapsulated inside WireGuard. Covers the double-encapsulation MTU math, route injection for a hybrid cloud-VPC + remote-NAT fleet, the VXLAN "wedge" failure mode, and how to monitor tunnel/VXLAN health so a silent break does not take down every GPU pod. This is the CNI layer on top of the overlay substrate; the substrate itself (WireGuard, keepalive, collectives) is on the overlay & mesh networking page.

What it is¶

A single-datacenter cluster runs its CNI over a flat, trusted L2/L3 fabric. A geo-distributed or multi-provider GPU cluster has no such fabric (nodes sit behind different NATs) so the WireGuard overlay provides node-to-node reachability, and the CNI must run on top of it. The common, reliable choice is Flannel's VXLAN backend: pod traffic is encapsulated by Flannel (the flannel.1 VXLAN device) and that encapsulated packet is itself encapsulated by WireGuard (wg0). Two layers of tunnel, stacked:

pod payload → [VXLAN: flannel.1, UDP 8472] → [WireGuard: wg0, UDP 51820] → underlay (internet/VPC)

That stack is what lets a pod on a node at provider A reach a pod on a node at provider B by ordinary pod IP: Services, the operator's API calls, apiserver → kubelet, and pod-IP collective traffic all ride it. It also introduces two failure surfaces a single-DC cluster never sees: MTU stacking (each layer steals header bytes) and VXLAN/tunnel health (the encapsulation can wedge while the link looks up).

flowchart LR
  subgraph NA["Node A (provider A, NAT)"]
    PA["Pod A"] --> FA["flannel.1 (VXLAN)"] --> WA["wg0 (WireGuard)"]
  end
  subgraph NB["Node B (provider B, NAT)"]
    WB["wg0 (WireGuard)"] --> FB["flannel.1 (VXLAN)"] --> PB["Pod B"]
  end
  WA <-->|"UDP 51820 over internet"| WB
  WD["VXLAN-health watchdog<br/>(textfile exporter)"] -.->|"FDB, neighbors, route checks"| FA
  WD -.-> FB

Why it matters¶

On a hybrid GPU fleet the pod network is load-bearing for everything: the GPU-allocation operator reaching kubelets, Services resolving, distributed-training ranks talking over pod IPs. When the CNI-over-overlay stack breaks, it breaks in ways that are uniquely hard to diagnose because partial connectivity survives:

MTU black holes drop only large packets. SSH, pings, and health checks pass; a checkpoint write, an image pull, or a gradient all-reduce hangs mid-stream. The cluster looks healthy and isn't.
One-way NAT breaks sever apiserver → kubelet:10250 (the node looks Ready from its own heartbeat, but the control plane cannot scrape or exec it, a 502 signature) while the reverse direction works.
VXLAN wedge, stale forwarding/neighbor entries on flannel.1, makes pods on a node unreachable (a 503 signature for every workload there) even though WireGuard's handshake is fresh.

Because these are silent and partial, they need active monitoring, not just "is the node Ready?". Getting the MTU and the health-gating right is the difference between a geo fleet that runs GPU workloads and one that mysteriously stalls them.

When to use it (and when not)¶

Use CNI-over-WireGuard when:

A single Kubernetes/K3s cluster spans multiple NATs, providers, or regions and needs pod-to-pod networking across them (K3s, overlay & mesh).
A cloud-VPC control plane must reach remote bare-metal/rented GPU nodes behind NAT (the hybrid split-plane architecture).

Do not use it when:

All nodes share a real datacenter fabric. Run a standard CNI (Cilium/Calico/Flannel) directly on the L2/L3 network; an overlay-over-overlay adds latency, an MTU hit, and a failure surface for nothing (networking fabric).
The performance-critical path is synchronous collective traffic. Keep that on RDMA/RoCE/IB, not a double-encapsulated WAN tunnel; reserve the overlay for control traffic and low-frequency DiLoCo-style sync.

How to build and operate it¶

Reference values below, unexecuted, and MTU figures depend entirely on your underlay. Measure your actual path MTU and confirm Flannel/WireGuard/K3s flags for your versions before relying on any number.

1. Compute the Flannel MTU from the stack, do not leave it default¶

Each encapsulation layer subtracts header bytes. Size flannel.1 so the fully-encapsulated packet still fits the underlay path MTU:

flannel.1 MTU = path_MTU − WireGuard_overhead − VXLAN_overhead
              = path_MTU − (60 IPv4 | 80 IPv6) − 50

1500-byte internet path: flannel.1 ≈ 1500 − 80 − 50 = 1370.
9000-byte jumbo path (e.g. within/between cloud VPCs that support it): flannel.1 ≈ 9000 − 80 − 50 = 8870 (a value near 8871 is the jumbo-underlay case; it is not a universal constant and only works if the entire path carries ~9000-byte frames).

Set it explicitly in the Flannel/K3s config; never inherit 1500 on a tunnelled interface. Clamp TCP MSS on the pod path as a second line of defence so TCP negotiates a safe segment even if a path MTU is mis-estimated (overlay & mesh).

2. Inject routes for the hybrid topology¶

In a mixed cloud-VPC + remote-node fleet, the pod CIDR and the WireGuard subnet must be routed correctly from each segment. Remote spokes route the cluster pod CIDR and the WireGuard CIDR back through the server/gateway; cloud-side nodes need per-VPC-subnet routes to the WireGuard network so control-plane traffic reaches the spokes. A missing route is a pod CIDR that is simply unreachable from one segment: pods schedule but cannot talk.

3. Deploy a VXLAN/tunnel-health watchdog¶

flannel.1 can wedge without WireGuard reporting any problem. Run a per-node watchdog that inspects the VXLAN device and emits metrics via the node-exporter textfile collector (no privileged exporter needed):

# Watchdog sketch → /var/lib/node-exporter/textfile/vxlan_health.prom (illustrative)
# Stale neighbor/FDB entries and pod-CIDR routes that escaped onto wg0 are the wedge signatures.
bridge fdb show dev flannel.1 | wc -l                 # vxlan_fdb_entries_total
ip neigh show dev flannel.1 | grep -c STALE           # vxlan_stale_neighbor_entries
ip route get 10.42.0.1 | grep -c wg0                  # flannel_route_via_wg0 (misrouted pod CIDR)

4. Alert on the partial-failure signatures¶

Standard "node Ready" checks miss every failure above. Add explicit alerts (observability & monitoring, alerting & burn-rate):

flannel.1 missing on a GPU node → every pod there returns 503; page if it affects all GPU nodes.
Node Ready but kubelet:10250 unreachable → the one-way WireGuard/NAT break (502); the control plane is blind to a node that thinks it is fine.
Stale VXLAN neighbors / misrouted pod CIDR → wedge precursor; reconcile the route/FDB.
WireGuard reconcile stale / handshake age high → the substrate under the CNI is degrading (overlay & mesh).

Tie a sustained failure into health gating: taint/cordon a node whose tunnel or VXLAN is unhealthy so the scheduler stops landing GPU pods on an island that cannot serve them.

5. Keep the substrate healthy¶

This page assumes a working overlay. The keepalive, roaming-endpoint, and MTU concerns of the WireGuard layer itself (including the PersistentKeepalive that prevents the stale-NAT-endpoint one-way break) live on the overlay & mesh networking page; fix those first.

Failure modes¶

Default MTU on flannel.1. Large packets black-hole; checkpoints and image pulls hang while pings pass. Compute MTU from the stack and clamp MSS.
VXLAN wedge. Stale FDB/neighbor entries on flannel.1 → pods unreachable (503) despite a fresh WireGuard handshake. Watchdog + reconcile.
One-way apiserver → kubelet. Stale NAT endpoint breaks the hub→spoke direction (502); node looks Ready, control plane can't reach it. Keepalive on the WireGuard layer.
Missing route injection. A pod CIDR unreachable from one segment; pods schedule but cannot communicate.
Monitoring blindness. Relying on "node Ready" misses all of the above. Add the partial-failure alerts.
Overlay-over-real-fabric. Running this stack inside a single DC that already has a fabric is pure overhead and extra failure surface. Use a direct CNI.

Open questions & validation¶

Measured path MTU between every provider pair (it is not uniform), and the resulting per-segment flannel.1 MTU, verified with a don't-fragment probe rather than assumed.
Whether Flannel VXLAN, or an alternative (Cilium/Calico over WireGuard, or WireGuard-native CNI), best fits your fleet's scale and observability needs.
Watchdog coverage: every GPU node reports VXLAN health, and the AllDown/Unreachable alerts actually fire in a drill.
Recovery runbook for a wedged flannel.1 (reconcile vs. interface bounce vs. node cordon) and its blast radius.
Interaction with the overlay substrate's keepalive/reconcile cadence.

References¶

Flannel — VXLAN backend and MTU handling: https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md
WireGuard (overlay substrate, MTU/overhead): https://www.wireguard.com/
K3s networking (Flannel, CNI options, --flannel-backend): https://docs.k3s.io/networking
Kubernetes — cluster networking model: https://kubernetes.io/docs/concepts/cluster-administration/networking/
Kubernetes — kubelet ports (:10250, the apiserver → kubelet path): https://kubernetes.io/docs/reference/networking/ports-and-protocols/
Prometheus node_exporter — textfile collector (host-level metrics without a privileged exporter): https://github.com/prometheus/node_exporter#textfile-collector