DiLoCo (distributed low-communication)¶
Scope: a two-level optimisation method that lets workers train mostly independently and synchronise rarely. It is the low-bandwidth alternative to per-step data parallelism for multi-DC and heterogeneous-link training. Contrast with FSDP; part of the distributed-training survey in distributed training; the runnable geo-distributed recipe is Recipe: DiLoCo geo-distributed.
Reference templates on real
torch.distributedand Hivemind/DHT APIs: pin versions and validate before production use. The two numpy blocks (outer-step math and communication volume) are executed and asserted; the torch and shell blocks are unexecuted reference templates.
What it is¶
DiLoCo (DeepMind, arXiv 2311.08105) is a local-SGD training method with two nested optimisers:
- An inner optimiser (AdamW) runs H local steps on each worker's own data, with no cross-worker communication.
- Every H steps, each worker computes its pseudo-gradient (the delta between the parameters at the last sync and its current parameters), and these are all-reduced across workers.
- An outer optimiser (SGD with Nesterov momentum) applies the averaged pseudo-gradient to the global parameters, which are broadcast back as the new starting point. Repeat.
Because workers only exchange one parameter-sized tensor every H steps (not every step), DiLoCo communicates roughly 500x less than standard data-parallel/FSDP, trading communication frequency for a small quality gap. It targets settings where the inter-worker link, not compute, is the bottleneck.
Why use it¶
- Multi-datacentre / WAN training. Sync cost is amortised over H steps, so workers can sit in different DCs or regions on ordinary internet links rather than a shared IB fabric (cloud and cost).
- Heterogeneous, decentralised GPU pools. Loosely-coupled, fault-tolerant by design, which suits commodity / spot / volunteer hardware where a synchronous all-reduce every step is impossible.
- Bandwidth-bound clusters. When the network, not the GPUs, gates throughput, cutting comms ~500x recovers utilisation (OpenDiLoCo reported 90-95% compute utilisation training across continents).
When to use it (and when not)¶
- Choose DiLoCo over FSDP/HSDP (FSDP) when workers are separated by slow or heterogeneous links (cross-DC, WAN, spot pools); FSDP's per-step all-gather/reduce-scatter assumes a fast intra-cluster fabric and will stall over such links.
- Choose FSDP when all workers share a high-bandwidth IB/RoCE fabric in one cluster, where per-step sync gives the cleanest convergence and DiLoCo's overhead buys nothing.
- DiLoCo and FSDP compose: run FSDP inside each DiLoCo worker (a worker can itself be a multi-GPU shard) and DiLoCo between islands. The decision is per-link: fast fabric picks FSDP, slow link picks DiLoCo on top.
Architecture¶
flowchart TB
subgraph W0["Worker 0 (DC A)"]
I0["Inner AdamW: H local steps"]
P0["pseudo-grad = theta_global - theta_0"]
I0 --> P0
end
subgraph W1["Worker 1 (DC B)"]
I1["Inner AdamW: H local steps"]
P1["pseudo-grad = theta_global - theta_1"]
I1 --> P1
end
P0 -->|"all-reduce every H steps"| AR["Average pseudo-gradients"]
P1 -->|"all-reduce every H steps"| AR
AR --> OUT["Outer Nesterov SGD -> new theta_global"]
OUT -.->|"broadcast, repeat"| I0
OUT -.->|"broadcast, repeat"| I1
How to use it¶
The method is a thin loop wrapped around an ordinary training step: keep a snapshot of the global parameters, run H inner steps, all-reduce the delta, step the outer optimiser. (OpenDiLoCo provides a reference implementation; the skeleton below is the algorithm, not its full code.)
# diloco_skeleton.py: algorithm sketch (torch reference template); see OpenDiLoCo for a
# complete impl. Validate torch.distributed signatures against your installed build.
import torch, torch.distributed as dist
dist.init_process_group("gloo") # WAN-friendly backend
inner = torch.optim.AdamW(model.parameters(), lr=4e-4)
outer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True)
H = 500 # local steps between syncs
while not done:
global_params = [p.detach().clone() for p in model.parameters()] # snapshot
for _ in range(H): # inner: no communication
loss = model(next(loader)).loss
inner.zero_grad(); loss.backward(); inner.step()
for p, g0 in zip(model.parameters(), global_params):
p.grad = (g0 - p.data) # pseudo-gradient (delta)
dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) # the ONLY cross-worker comm
p.data.copy_(g0) # reset to snapshot...
outer.step() # ...then outer applies avg delta
The defining part is the outer step, and its math is exact enough to validate with numpy alone. The block below reproduces the pseudo-gradient, the averaging all-reduce, and the outer SGD update, then asserts the identities that make DiLoCo correct: with outer_lr=1 and no momentum it reduces exactly to FedAvg parameter averaging (equivalence to a slow reference); it is a convex blend at any outer_lr; outer_lr=0 is a no-op; Nesterov momentum provably accelerates repeated syncs; and the published sign moves toward worker consensus while the flipped sign (the classic reset/sign bug) provably diverges.
# diloco_outer_math.py: numpy validation of the DiLoCo outer step (companion to the
# torch.distributed reference). Asserts the pseudo-gradient/average/outer-SGD identity,
# the FedAvg equivalence, the outer_lr=0 boundary, Nesterov acceleration, and that the
# published sign converges toward worker consensus while the flipped sign diverges.
import numpy as np
def pseudo_grads(theta_global, thetas_local):
# DiLoCo pseudo-gradient per worker: snapshot minus post-inner params (paper's sign).
return [theta_global - t for t in thetas_local]
def outer_step(theta, grad, lr, momentum=0.0, nesterov=False, buf=None):
# PyTorch-style SGD (dampening=0); buf carries outer momentum across sync rounds.
buf = grad.copy() if buf is None else momentum * buf + grad
d = grad + momentum * buf if nesterov else buf
return theta - lr * d, buf
def diloco_round(theta_global, thetas_local, lr, momentum=0.0, nesterov=False, buf=None):
avg = np.mean(pseudo_grads(theta_global, thetas_local), axis=0) # all-reduce AVG
return outer_step(theta_global, avg, lr, momentum, nesterov, buf)
rng = np.random.default_rng(0)
D, K = 256, 4 # param dim, worker count
theta_g = rng.standard_normal(D) # global params at last sync
locals_ = [theta_g + rng.standard_normal(D) for _ in range(K)] # each drifted by H inner steps
consensus = np.mean(locals_, axis=0)
# 1. FedAvg equivalence (slow reference): outer_lr=1, no momentum == plain param averaging.
new, _ = diloco_round(theta_g, locals_, lr=1.0)
assert np.allclose(new, consensus), "outer_lr=1 must reduce to the FedAvg mean of worker params"
# 2. Convex-combination identity for vanilla outer SGD at a general lr.
lr = 0.7
new, _ = diloco_round(theta_g, locals_, lr=lr)
assert np.allclose(new, (1 - lr) * theta_g + lr * consensus), "outer step is a convex blend"
# 3. Boundary: outer_lr=0 -> base unchanged (a stalled outer opt discards worker progress).
new0, _ = diloco_round(theta_g, locals_, lr=0.0)
assert np.array_equal(new0, theta_g), "lr=0 must not move the global params"
# 4. Nesterov acceleration (adversarial): same pseudo-grad each round -> momentum path must
# travel strictly farther than plain SGD after >=2 rounds (why the outer opt has momentum).
g = np.mean(pseudo_grads(theta_g, locals_), axis=0) # fixed averaged pseudo-gradient
p_plain = theta_g.copy(); p_nest = theta_g.copy(); buf = None
for _ in range(3):
p_plain, _ = outer_step(p_plain, g, lr=0.3, momentum=0.0)
p_nest, buf = outer_step(p_nest, g, lr=0.3, momentum=0.9, nesterov=True, buf=buf)
assert np.linalg.norm(p_nest - theta_g) > np.linalg.norm(p_plain - theta_g), "momentum must accelerate"
# 5. Sign check (adversarial): the published sign (global - local) moves toward consensus;
# the flipped sign (local - global) provably moves away -> catches the reset/sign bug.
good, _ = diloco_round(theta_g, locals_, lr=0.5)
bad_avg = np.mean([t - theta_g for t in locals_], axis=0) # WRONG sign
bad, _ = outer_step(theta_g, bad_avg, lr=0.5)
assert np.linalg.norm(good - consensus) < np.linalg.norm(theta_g - consensus), "correct sign converges"
assert np.linalg.norm(bad - consensus) > np.linalg.norm(theta_g - consensus), "flipped sign diverges"
print("outer-step OK: fedavg_equiv=True convex=True lr0_noop=True nesterov_faster=True sign_converges=True")
Running it prints outer-step OK: fedavg_equiv=True convex=True lr0_noop=True nesterov_faster=True sign_converges=True.
How to integrate with it¶
DiLoCo changes only the between-worker sync, so the inner loop is an ordinary training step and integration is about wiring the process group and composing with intra-worker parallelism.
- Backend. Use a WAN/TCP backend (
gloo) or a DHT overlay for the cross-worker group, not NCCL-over-IB; there is no fast fabric between workers to justify NCCL. - FSDP inside, DiLoCo across. Each "worker" can be a single GPU or a multi-GPU FSDP shard group running under
torchrunon its own fast fabric, while a separategloogroup carries the cross-DC pseudo-gradient. Develop the inner step exactly as a normal single-replica loop, then wrap the outer sync. - Sweep axes. H, the outer LR, and the worker topology are the primary knobs. Expose H and the outer LR through the environment so a campaign can vary them without code edits; the outer momentum (
0.9, Nesterov) is what stabilises infrequent updates and is the published default.
# reference template (torch): expose H and the outer LR as the primary sweep axes.
import os, torch
H = int(os.environ.get("DILOCO_H", 500)) # 50 / 125 / 500 reported in OpenDiLoCo
outer = torch.optim.SGD(model.parameters(),
lr=float(os.environ.get("OUTER_LR", 0.7)),
momentum=0.9, nesterov=True)
Launch one worker per DC as a torch.distributed rank reached over WAN: point every rank at a routable MASTER_ADDR, set WORLD_SIZE to the number of workers, assign RANK per site, and pin the WAN-facing NIC for gloo.
# reference template: run at DC-A (rank 0, rendezvous host), DC-B (rank 1), DC-C (rank 2)...
export MASTER_ADDR=dc-a.example.net # routable from every worker over WAN
export MASTER_PORT=29500
export WORLD_SIZE=3 # number of geo-distributed workers
export RANK=0 # 0 at DC-A, 1 at DC-B, 2 at DC-C
export DILOCO_H=500 # local steps between syncs; tune to link bandwidth
export GLOO_SOCKET_IFNAME=eth0 # pin the WAN-facing NIC for the gloo backend
python diloco_train.py
How to run it in production¶
The output is an ordinary checkpoint identical in shape to any other run, but the loop has extra state and a correctness gate that a per-step trainer does not.
- Checkpoint both optimiser states at outer-step boundaries. Save the global parameters, the inner AdamW state, and the outer SGD state (including its momentum buffer) so a resume re-enters the loop cleanly. Saving on the wrong boundary, or dropping the outer momentum buffer, corrupts the resume (checkpoint-recovery runbook).
# reference template (torch): one writer; global params are identical across ranks.
import torch, torch.distributed as dist
def save_outer_ckpt(model, inner, outer, outer_step, path):
if dist.get_rank() != 0:
return
torch.save({"model": model.state_dict(),
"inner": inner.state_dict(),
"outer": outer.state_dict(), # outer momentum buffer matters on resume
"outer_step": outer_step}, path)
- Gate on convergence parity. DiLoCo's correctness check is that the loss curve tracks an FSDP/DDP single-DC baseline at the same token budget. Emit
diloco_lossanddiloco_loss_baselineand alert on drift; a regression follows the same triage as the MFU regression runbook. Wire it into telemetry / monitoring / alerting.
# fire when DiLoCo loss runs >5% above the single-DC baseline for 1h (drift; suspect H too large)
(
avg_over_time(diloco_loss[1h])
/
avg_over_time(diloco_loss_baseline[1h])
) > 1.05
- Fault tolerance. A fixed
WORLD_SIZEgloogroup blocks the all-reduce on any slow or departing worker. For a churny, join/leave decentralised pool use the DHT-backed, fault-tolerantprimestack rather than a fixed world size.
How to maintain it¶
Maintenance is mostly tuning H to the link as conditions change and re-validating after every change:
- Raise H when the network gates throughput (fewer, larger syncs); lower H when the loss drifts from the baseline. Larger H widens the divergence window, so re-run the baseline comparison after any H change rather than assuming the old setting still holds.
- Adopt Streaming DiLoCo (DeepMind, arXiv 2501.18512) when even the per-H burst saturates the WAN link: it syncs parameter subsets in sequence and overlaps sync with compute, reporting roughly 400x less peak bandwidth.
- Track the framework. OpenDiLoCo is no longer maintained; migrate to its successor PRIME (
PrimeIntellect-ai/prime) for the improved fault tolerance and bandwidth use, and re-verify flag names against the current README (they vary by version).
How to scale it¶
DiLoCo scales across instances and datacentres, not just across a rack. The open implementations carry the inter-worker exchange over a peer-to-peer DHT so workers can join/leave:
- OpenDiLoCo (Prime Intellect, github PrimeIntellect-ai/OpenDiloco, arXiv 2407.07852) implements DiLoCo on a Hivemind DHT, demonstrated training across two continents and three countries at ~90-95% utilisation. Note: OpenDiLoCo is no longer maintained; its successor is PRIME (github PrimeIntellect-ai/prime), with improved fault tolerance and bandwidth use.
- INTELLECT-1 (Prime Intellect) scaled this to a 10B-parameter model trained over globally distributed workers, the reference point for low-communication training at scale.
The reason this scales over a WAN at all is the communication volume. DiLoCo moves one parameter-sized all-reduce every H steps against per-step data parallel's one every step, so the saving ratio is exactly H and is independent of model size. The block below asserts that identity, the ~500x headline at H=500, and the H=1 boundary where DiLoCo degenerates back to per-step DP with no saving.
# diloco_comms.py: numpy check of the headline "~500x less communication" and how it
# scales. DiLoCo moves one parameter-sized all-reduce every H steps; per-step data
# parallel moves one every step. The saving ratio is exactly H, independent of model size.
import numpy as np
def diloco_bytes_per_step(params, H, dtype_bytes=4):
return params * dtype_bytes / H # one pseudo-grad all-reduce amortized over H
def dp_bytes_per_step(params, dtype_bytes=4):
return params * dtype_bytes # per-step DP all-reduces the gradient every step
for params in (125_000_000, 10_000_000_000): # 125M and INTELLECT-1's 10B scale
for H in (50, 125, 500):
ratio = dp_bytes_per_step(params) / diloco_bytes_per_step(params, H)
assert np.isclose(ratio, H), "comms saving must equal H, independent of param count"
# the page's headline: H=500 -> ~500x less communication than per-step data parallel.
assert np.isclose(dp_bytes_per_step(1_000_000_000) / diloco_bytes_per_step(1_000_000_000, 500), 500.0)
# boundary: H=1 (sync every step) recovers exactly the per-step DP volume (no saving).
assert np.isclose(diloco_bytes_per_step(1_000_000_000, 1), dp_bytes_per_step(1_000_000_000))
print("comms OK: ratio==H for all sizes; H=500 -> 500x; H=1 -> parity with per-step DP")
Running it prints comms OK: ratio==H for all sizes; H=500 -> 500x; H=1 -> parity with per-step DP. Bootstrap the DHT overlay and point each worker at it:
# OpenDiLoCo (reference; pin a commit and verify on the repo as of mid-2026)
hivemind-dht --host_maddrs /ip4/0.0.0.0/tcp/0 # bootstrap the DHT peer
# then launch each worker pointing at the DHT, e.g. (8 workers, H=500):
# torchrun ... train_fsdp.py --per-device ... --hv.local_steps 500 \
# --hv.initial_peers <DHT_MADDR> # flags vary by version; check the README
Inference¶
Not applicable. DiLoCo is purely a training-time optimisation method; it produces a single set of global parameters identical in shape to any other checkpoint, which is then served by the normal stacks (inference serving, serving open-weight models). There is no DiLoCo-specific inference path.
Fine-tuning¶
DiLoCo applies directly to distributed fine-tuning over WAN / heterogeneous pools: the inner AdamW loop runs any fine-tuning objective (continued pretraining, SFT) while the outer sync keeps geographically separated workers in agreement. For parameter-efficient and RL post-training methods (which usually run inside one cluster and lean on FSDP) see fine-tuning and post-training, SFT and LoRA, and the RL libraries in RL libraries.
Optimised hardware¶
- Built for non-IB networks. DiLoCo's whole premise is tolerating slow/heterogeneous links; it does not need NVLink or InfiniBand between workers, only enough bandwidth to move one parameter-sized all-reduce every H steps. That is the explicit contrast with FSDP, whose per-step all-gather/reduce-scatter demands a fast intra-cluster fabric.
- Inside a worker, ordinary fast-fabric rules still apply: if a worker is an FSDP shard group, it wants NVLink intra-node and IB intra-island with GDR (
NCCL_NET_GDR_LEVEL=SYS,[GDRDMA]inNCCL_DEBUG=INFO) (performance tuning). - The inter-worker sync runs over TCP/WAN; backend is typically
glooor a DHT overlay, not NCCL-over-IB.
Cookbook (common use cases)¶
1. The DiLoCo outer loop (drop H inner steps between syncs)
# reference template (torch): the outer loop validated numerically in "How to use it".
for round in range(num_rounds):
snap = [p.detach().clone() for p in model.parameters()]
for _ in range(H): train_one_step(model, inner, loader) # no comms
for p, g0 in zip(model.parameters(), snap):
p.grad = g0 - p.data
dist.all_reduce(p.grad, op=dist.ReduceOp.AVG)
p.data.copy_(g0)
outer.step() # one sync / round
2. Bootstrap an OpenDiLoCo / Hivemind run (multi-DC)
hivemind-dht --host_maddrs /ip4/0.0.0.0/tcp/0 # note the printed multiaddr
# start N workers in different DCs, each with --hv.initial_peers <that multiaddr>
# and --hv.local_steps 500 ; exact flags per the repo README (verify, mid-2026)
3. Multi-DC layout: FSDP inside, DiLoCo across
DC-A: 8xGPU FSDP shard group ─┐
DC-B: 8xGPU FSDP shard group ─┼─ DiLoCo all-reduce of pseudo-grad every H steps
DC-C: 8xGPU FSDP shard group ─┘ (fast fabric intra-DC, WAN inter-DC)
Failure modes¶
- H too large moves workers apart until the outer update fights divergence and quality drops. Tune H to the link, not just to minimise comms.
- Forgetting the snapshot/reset computes or applies the pseudo-gradient against the wrong base; you must reset params to the pre-inner snapshot before the outer step. The numpy sign check above is exactly this bug: the flipped sign provably diverges.
- Using NCCL-over-IB assumptions stalls, because there is no fast fabric between workers; a step-synchronous design (FSDP) here is exactly the case DiLoCo exists to avoid.
- Stragglers / churn in a decentralised pool block a fixed-
WORLD_SIZEall-reduce; use the fault-tolerant successor (PRIME) rather than the unmaintained OpenDiLoCo for production. - Treating DiLoCo as a serving feature mistakes a training-only method for an inference path; the output is just a normal checkpoint.
References¶
- DiLoCo (Douillard et al., DeepMind, 2023): https://arxiv.org/abs/2311.08105
- Streaming DiLoCo with overlapping communication (DeepMind, 2025): https://arxiv.org/abs/2501.18512
- OpenDiLoCo (Jaghouar et al., Prime Intellect, 2024): https://arxiv.org/abs/2407.07852
- OpenDiLoCo repo (note: superseded by PRIME): https://github.com/PrimeIntellect-ai/OpenDiloco · PRIME: https://github.com/PrimeIntellect-ai/prime
- INTELLECT-1 (first globally-distributed 10B training): https://www.primeintellect.ai/blog/intellect-1
- Hivemind (decentralised training DHT): https://github.com/learning-at-home/hivemind
- PyTorch torch.distributed (backends, all_reduce): https://docs.pytorch.org/docs/stable/distributed.html
Related: Distributed Training · Recipe: DiLoCo geo-distributed · Delta weight sync · Training Recipes · Overlay & Mesh Networking · Cloud / Neoclouds / Cost · FSDP · Pipeline Parallel · Telemetry · Glossary