Markdown

Overlay & mesh networking for distributed GPU clusters¶

Scope: the encrypted overlay network that stitches GPU nodes living behind different providers' NATs and firewalls (across regions, clouds, or a decentralized pool) into one flat, mutually-authenticated, addressable cluster, so a scheduler, a rendezvous store, and cross-worker communication can treat them as a single fabric. Covers WireGuard hub-and-spoke vs full mesh, NAT traversal and keepalive, overlay addressing, the MTU/MSS trap, WAN rendezvous, and which traffic must not cross the overlay. This is the network substrate beneath geo-distributed training; for the algorithm see DiLoCo and the geo-distributed recipe.

What it is¶

An overlay network is a virtual L3 network laid over the public internet: each node gets a stable private address regardless of where it physically sits, and traffic between nodes is encrypted and authenticated end to end. For multi-provider GPU clusters the de-facto tool is WireGuard, an in-kernel VPN built on the Noise protocol, running over a single UDP port (default 51820), with static public-key peer identities and roaming endpoints (a peer's address is learned from the source of its authenticated packets, so it self-heals when a provider hands out a new IP). [WireGuard]

Two topologies:

Hub-and-spoke: every node ("spoke") peers only with a hub that has a public IP. Spokes need no public inbound (they dial out), the hub relays or routes between them, and peer config is O(n). Simplest to operate behind hostile NATs; the hub's bandwidth is the ceiling.
Full mesh: every pair of nodes peers directly. Lowest latency and no relay bottleneck, but O(n²) peer config and it needs direct reachability or NAT hole-punching between every pair.

A production pattern reconciles peers automatically: a per-node agent watches a desired-state record (e.g. a Kubernetes CRD describing each edge) and runs wg set peer … to converge the local config as nodes join and leave, the same controller pattern as the rest of the platform.

flowchart TB
  subgraph WAN["Public internet (untrusted)"]
    HUB["Hub (public IP)<br/>:51820/udp"]
  end
  subgraph P1["Provider A (NAT)"]
    S1["GPU node<br/>wg0 10.42.0.11"]
  end
  subgraph P2["Provider B (NAT)"]
    S2["GPU node<br/>wg0 10.42.0.12"]
  end
  S1 <-->|"encrypted UDP + keepalive 10s"| HUB
  S2 <-->|"encrypted UDP + keepalive 10s"| HUB
  S1 -. "overlay 10.42.0.0/16: scheduler, rendezvous, Gloo/NCCL-socket" .- S2

Why it matters¶

The economics of geo-distributed and neocloud GPU pools depend on using whatever capacity is cheapest, wherever it is, but those nodes do not share a fabric. They sit behind provider NATs with no public inbound, dynamic IPs, and restrictive firewalls, in different administrative domains. Three things break without an overlay:

Connectivity. A scheduler control plane cannot reach a kubelet behind NAT; workers in different clouds cannot open sockets to each other. The overlay gives every node a stable, routable address.
Trust. Inside a datacenter the fabric is physically controlled; across the public internet it is not. WireGuard authenticates every peer by public key and encrypts all traffic, restoring the trust an in-DC fabric gives for free.
Identity stability. Distributed-training ranks, rendezvous endpoints, and service discovery must survive a provider re-assigning a node's public IP. Binding the cluster to overlay IPs decouples identity from the churny underlay.

This is what makes cross-DC DiLoCo practical: the algorithm tolerates a slow WAN link by syncing every H steps, but it still needs the workers to reach each other securely in the first place. The overlay is that substrate.

When to use it (and when not)¶

Use an overlay/mesh when:

GPU nodes span multiple providers, regions, or NATs and must act as one cluster: geo-distributed DiLoCo, bursting across several neoclouds, or a decentralized marketplace stitching rented nodes.
A central control plane must reach nodes behind NAT (scheduler → agent, apiserver → kubelet).
Traffic crosses the public internet and therefore must be encrypted and peer-authenticated.

Do not put it on the hot path when:

All nodes already share one high-bandwidth InfiniBand/RoCE fabric in a single datacenter. The overlay adds encryption cost and an MTU hit for zero benefit; run native collectives on the real fabric (networking-fabric, RDMA & RoCE tuning).
The traffic is latency-sensitive synchronous collective communication (per-step all-reduce for FSDP/DDP). A WAN overlay will stall it, which is exactly the regime DiLoCo exists to avoid. The decision is per-link: fast fabric inside an island → run FSDP there; slow link between islands → put DiLoCo's low-frequency sync on the overlay.

How to build and operate it¶

Reference templates below, unexecuted. Confirm MTU, keepalive, and env-var names against WireGuard and your framework docs before relying on them; values that work on one provider's NAT may not on another.

1. Address the overlay and stand up peers¶

Pick a private CIDR (e.g. 10.42.0.0/16), assign one address per node, and bind the cluster (k3s/k8s, the rendezvous store, collective sockets) to those overlay IPs so identities are stable across underlay IP changes.

# /etc/wireguard/wg0.conf  (spoke side) — reference shape
[Interface]
Address = 10.42.0.11/16
PrivateKey = <spoke-private-key>
MTU = 1420                       # leave headroom for WG overhead (see §3)

[Peer]                           # the hub
PublicKey = <hub-public-key>
Endpoint = hub.example.net:51820
AllowedIPs = 10.42.0.0/16        # route the whole overlay via the hub
PersistentKeepalive = 10         # REQUIRED behind NAT — see §2

2. Keep NAT mappings alive (the detail that bites)¶

A spoke behind NAT must send traffic continuously or the provider's NAT reaps the idle UDP mapping. When that happens the hub's cached endpoint for the spoke goes stale, and the hub→spoke direction silently breaks (e.g. apiserver → kubelet:10250, or a coordinator trying to reach a worker) while spoke→hub still works, producing a baffling one-way partition. The fix is PersistentKeepalive (a low value such as 10s is a safe default against aggressive NATs); WireGuard then refreshes the mapping and relearns the roaming endpoint from each authenticated packet. Treat last-handshake age as a per-peer health signal.

3. Set MTU and clamp MSS, or large transfers blackhole¶

WireGuard adds roughly 60 bytes (IPv4) / 80 bytes (IPv6) of encapsulation. If the tunnel MTU is left at 1500, large packets fragment or get dropped, and because the drops hit only big packets, the symptom is the cruel one: SSH and pings work, but a checkpoint or dataset-shard transfer hangs mid-stream (a PMTU black hole). Set the tunnel MTU with headroom (≈1420, lower over IPv6/PPPoE underlays) and clamp TCP MSS on the overlay path so TCP negotiates a safe segment size.

# Clamp MSS to the path MTU on the WireGuard interface (reference).
iptables -t mangle -A FORWARD -o wg0 -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

4. Rendezvous and collectives over the overlay¶

Distributed training needs a coordination/rendezvous store every worker can reach over the overlay. A subtle trap: PyTorch torchelastic's c10d backend uses rank-0-as-master semantics regardless of the endpoint string, so a single dedicated "rendezvous pod" does not buy elasticity; it is an inert middleman. For a churny, fault-tolerant geo-distributed pool, back rendezvous with etcd (torchelastic's etcd-v2 backend) so workers can join and leave without restarting the world. [PyTorch Elastic]

For the cross-worker collective, NCCL-over-IB is unavailable across an overlay; pin collectives to the WireGuard interface and use the TCP/socket transport (or gloo). For DiLoCo the outer-loop sync is low-frequency, so TCP over the overlay is appropriate; do not place a step-synchronous group here.

# Bind collective transports to the overlay interface (reference).
export GLOO_SOCKET_IFNAME=wg0
export NCCL_SOCKET_IFNAME=wg0
export NCCL_IB_DISABLE=1          # no IB across a WAN overlay; use sockets

5. Operate the mesh¶

Reconcile peers as the fleet changes (controller/DaemonSet writing wg peers from desired state), rotate keys on a schedule, and monitor per-peer handshake age, overlay RTT, and goodput into observability & monitoring. Ensure the underlay firewall permits outbound UDP/51820 from every spoke to the hub; a blocked UDP port is the most common "the mesh won't come up" cause.

6. Reconcile the mesh as CRDs at scale¶

Hand-edited peer lists rot the moment the fleet churns. The production pattern is to make each mesh edge a declarative object (a MeshEdge custom resource naming the node pair, endpoints, and keys) and run a per-node agent (a Kubernetes controller) that watches the edges touching its node and converges the local WireGuard config (wg set peer on add, peer removal on delete) so the kernel state always matches the declared topology. Three disciplines make this reliable:

Active + passive split. One DaemonSet applies peers (the agent); a second probes reachability (sends a UDP packet to each peer's endpoint and records whether it answered). The apply path converges state; the probe path observes it.
Multi-writer status via Server-Side Apply. The agent writes "applied on node A/B"; the probe writes "reachable A→B / B→A". Distinct field managers let both write one object without clobbering, the same SSA discipline a GPU-allocation operator uses.
Canonical pair naming + idempotent apply. Name each edge by its lexicographically-ordered node pair (nodeA--nodeB) so it is created once regardless of which side requests it, and skip wg set when kernel state already matches (distinguish a real change from a no-op) to avoid netlink churn.

Centralize the peer-rendering logic in one pure module shared by both the provisioning-time path (writing /etc/wireguard/wg0.conf when a node is first brought up) and the runtime path (the agent's wg set peer), so the two cannot drift. A stale, agent-failed, or never-converged edge is the signal that a pair has silently fallen back to a hub relay instead of a direct peer; alert on apply error rate and wg set latency.

Failure modes¶

No keepalive behind NAT. Idle UDP mapping reaped → stale hub endpoint → one-way hub→spoke break (control-plane and coordinator traffic fail while the spoke thinks it is fine). Set PersistentKeepalive.
MTU/MSS left at defaults. Small packets pass, large transfers black-hole. Lower tunnel MTU and clamp MSS (§3).
Synchronous collectives on the overlay. Per-step all-reduce over WAN stalls the job; keep step-synchronous training on a real fabric and only low-frequency DiLoCo sync on the overlay.
Single-point rendezvous. A lone TCPStore pod cannot deliver torchelastic elasticity; use an etcd-backed rendezvous for a churny pool (§4).
Peer/key drift. As nodes join and leave, hand-edited peer lists rot; reconcile from desired state.
Hub saturation. In hub-and-spoke, all inter-spoke traffic relays through the hub, and its NIC is the ceiling; consider direct mesh edges or a fatter hub for heavy cross-spoke flows.
Blocked UDP. Provider/underlay firewall drops UDP/51820, so there is no handshake and no overlay.

Open questions & validation¶

Measured overlay throughput and added latency vs. the raw inter-provider link, per peer pair.
Keepalive interval vs. each provider's actual NAT idle timeout (tune empirically; 10s is conservative).
Hub bandwidth headroom, or whether to hole-punch toward a partial/full mesh for the heaviest edges.
Automated MTU discovery across heterogeneous underlays (IPv4/IPv6/PPPoE).
Whether app-layer TLS over an already-encrypted overlay is worth the double-encryption (usually yes for defense in depth; measure the cost).

References¶

WireGuard — Donenfeld, WireGuard: Next Generation Kernel Network Tunnel (NDSS 2017), protocol and roaming: https://www.wireguard.com/papers/wireguard.pdf · Project & docs: https://www.wireguard.com/
PyTorch Elastic (torchelastic) — rendezvous backends (c10d, etcd-v2), torchrun: https://pytorch.org/docs/stable/elastic/run.html
NCCL environment variables (NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
DiLoCo — Distributed Low-Communication Training of Language Models, arXiv:2311.08105: https://arxiv.org/abs/2311.08105
Streaming DiLoCo — arXiv:2501.18512 (further bandwidth reduction): https://arxiv.org/abs/2501.18512