Skip to content
Markdown

Delta weight sync for RL

Scope: the weight-synchronization bottleneck in disaggregated RL post-training and the sparse-delta technique that cuts it by roughly two orders of magnitude, losslessly and bit-identically. When the trainer and the rollout engine sit on separate GPUs, every optimizer step ships fresh weights to the rollout ranks; this page covers what delta sync is, why and when to use it, how to encode and apply the delta (with a runnable round-trip and the concrete slime flags), and why low-precision quantization breaks it. It is the efficiency companion to async and disaggregated RL, which frames the weight-sync cost, and to rollout redundancy; it also sparsifies the geo-distributed DiLoCo sync.

Techniques below track fast-moving 2026 papers and framework PRs; for the in-engine path the sender and receiver engine must match, so pin both and verify bit-identical reconstruction before production. The Python example is executed and asserted (numpy); the framework flags are reference templates verified against the slime doc.

flowchart LR
  TRAIN["Trainer (after optimizer step)"] --> DIFF["Diff vs pinned snapshot: ~1-3% of bytes changed"]
  DIFF --> ENC["Encode sparse delta: positions + values (or XOR), zstd"]
  ENC --> T{"Apply path"}
  T -->|"disk (slime): patch local checkpoint"| RELOAD["Reload via update_weights_from_disk (no engine change)"]
  T -->|"NCCL (SGLang PR 26519): in-engine"| MASK["NaN-masked overwrite into live weights"]
  RELOAD --> ROLL["Rollout engine: bit-identical weights"]
  MASK --> ROLL

What it is

Disaggregated RL runs the trainer and the rollout engine on separate GPU pools (async RL), and every optimizer step the updated policy must reach every rollout rank before the next generation. A naive full-weight broadcast costs bandwidth proportional to model size, and on commodity networks a full broadcast of even an 8B model can take 100+ seconds.

The 2026 observation, converged on independently across academia, industry, and open source, is that after one RL step only a small fraction of the weights actually change: in BF16, more than 99% of elements are byte-identical to the previous step, because Adam updates at typical RL post-training learning rates fall below the BF16 rounding threshold and vanish on the cast.1 PULSE names this "compute-visible sparsity": transmit only the updates that would change the next forward pass. Density is typically about 1-3% (Fireworks measured an average delta of 20.3 GiB, 1.98% of a 1024 GiB model), so delta weight sync ships only the changed positions and their values, cutting sync traffic by roughly two orders of magnitude.123

The delta is an overwrite, not an addition: the receiver writes the trainer's exact bytes at the changed positions. That is lossless and bit-identical, and it avoids the cross-step floating-point accumulation drift an additive scheme would suffer, so no periodic full re-sync is required.

Why use it

  • Roughly 100x less sync traffic, losslessly. PULSESync reports over 100x reduction with bit-identical reconstruction; SparrowRL reports about 79x payload reduction and 2.4-9.5x throughput on commodity networks; SparseRL-Sync reports about 100x, lossless.12
  • Proven in production. Fireworks reports 98% of BF16 weights bit-equivalent between adjacent checkpoints and cut cross-region transfer by about 94% over a 50-step window, running Cursor's Composer 2 training across three to four global clusters that each reconstruct weights from a shared delta chain.3
  • Decoupled and cross-region RL becomes practical. A few-GB delta moves over a shared filesystem where a full multi-hundred-GB broadcast never could, so RL can span datacenters without RDMA.
  • Orthogonal to RDMA, not a replacement. Over NCCL the 1-3% byte reduction stacks on top of RDMA bandwidth; delta cuts bytes further when RDMA is present and keeps RL decoupled when it is not.
  • No drift. Byte-for-byte overwrite means the rollout weights are exactly the trainer's, with no accumulation error and no scheduled full re-sync.

When to use it (and when not)

  • Use delta sync for disaggregated RL where the sync is a real cost: large models, long sync intervals, or transport that is not full-fat RDMA (commodity networks, cross-datacenter shared filesystems).
  • Use it for BF16 and FP8 block-wise rollouts, where the canonical weights are approximately the runtime layout (below).
  • Skip it when colocated. A colocated engine receives weights through a CUDA-IPC handle of about 64 bytes, so there are no wire bytes to save.
  • Skip it for NVFP4 and other repacking quantization, where the delta densifies and the runtime layout moves with parallelism (below); fall back to full canonical sync.
  • Pin the versions. The disk path reloads through the ordinary endpoint and needs no engine change, but the NCCL in-engine path's apply logic lives in the inference engine (SGLang PR #26519), so a sender/receiver version skew there silently breaks convergence rather than erroring.

Architecture

slime's delta weight sync is disk-transport only and requires no delta-aware changes to the inference engine, which is what keeps it portable across model-parallelism layouts and backward-compatible.5 It runs in four phases:

  • Seed: capture a CPU snapshot from the base checkpoint (not the current GPU weights), so the diff stays correct even when Megatron-to-HF roundtrips are not byte-identical.
  • Publish: diff the gathered HF tensors against the snapshot and write zstd-compressed deltas to a shared filesystem as versioned weight_v{N:06d}/ directories.
  • Apply: each rollout host patches its host-local checkpoint in place, per-tensor parallelized.
  • Reload: the engine reloads through the native update_weights_from_disk endpoint.

A separate, lower-latency approach applies the sparse delta directly into live GPU weights with an in-engine NaN-masked overwrite. That receiver is SGLang PR #26519, which adds support for Miles delta weight sync (companion Miles PR #1235) and handles both disk delta files and NCCL delta payloads in the model runner; it trades the disk path's zero-engine-change portability for speed.

How to use it

The core is a three-step pipeline: diff through an integer view (exact, arithmetic-free), gap-encode the changed positions, and scatter the verbatim values onto the receiver's snapshot. This runnable round-trip is executed and asserted (bit-identical reconstruction, ~33x reduction at 2% density on a 1M-element tensor, plus an adversarial check that a corrupted delta is caught):

# delta_roundtrip.py — validated diff + gap-encode + decode; asserts bit-identical
# reconstruction AND adversarially checks that a corrupted delta is caught.
import numpy as np

def encode_delta(cur, snap):                       # sender: exact integer-view diff, then gap-encode
    pos = np.flatnonzero(cur.view(np.uint32) != snap.view(np.uint32))
    gaps = np.empty(pos.size, np.int64)
    gaps[0], gaps[1:] = pos[0], np.diff(pos) - 1   # store gap between changed indices, not absolute
    assert gaps.max() < 65536                       # fits uint16 at ~1-3% density; else uint32 fallback
    return gaps.astype(np.uint16), cur[pos]         # positions (2B each) + values verbatim (orig dtype)

def apply_delta(snap, gaps, vals):                  # receiver: invert gaps and overwrite onto snapshot
    pos = (gaps.astype(np.int64) + 1).cumsum() - 1
    out = snap.copy(); out[pos] = vals; return out

def checksum(vals):                                 # integrity hash carried on the metadata channel
    return int(np.bitwise_xor.reduce(vals.view(np.uint32))) if vals.size else 0

rng = np.random.default_rng(0); N = 1_000_000
snap = rng.standard_normal(N).astype(np.float32)    # trainer's previous-step snapshot
cur  = snap.copy()
chg  = rng.choice(N, int(0.02 * N), replace=False)  # 2% of weights move this step
cur[chg] = rng.standard_normal(chg.size).astype(np.float32)

gaps, vals = encode_delta(cur, snap); ck = checksum(vals)
recon = apply_delta(snap, gaps, vals)
assert np.array_equal(recon.view(np.uint32), cur.view(np.uint32))   # happy path: bit-identical, not "close"

# adversarial: one corrupted value on the wire must fail the checksum (never apply silently)
bad = vals.copy(); bad[0] += np.float32(1.0)
assert checksum(bad) != ck, "checksum missed corruption"
print(f"reduction={N*4/(gaps.nbytes+vals.nbytes):.1f}x  bit_identical=True  corruption_detected=True")

In production you do not write this; slime implements it. The simplest path is delta over disk, which needs no engine change (verified flags from the slime doc; pin the slime version):5

# slime delta-over-disk: apply into a host-local checkpoint, reload via update_weights_from_disk.
python train.py \
  --update-weight-mode delta \
  --update-weight-transport disk \
  --update-weight-disk-dir /shared/fs/delta-updates \
  --update-weight-local-checkpoint-dir /local/nvme/rollout-ckpt \
  --update-weight-delta-encoding xor \          # value encoding: xor (new^old, apply-once) | overwrite (idempotent)
  --update-weight-delta-checksum xxh3-128       # integrity: xxh3-128 | blake3 | adler32 (zstd level 1 always)

For the lower-latency NCCL path, the receiver instead writes the sparse delta directly into live GPU weights; that logic lives in SGLang (PR #26519), so pin an SGLang build that carries it.

How to develop with it

Two implementations expose two different encoding choices; pick by transport bandwidth.

  • Value encoding (disk path). xor writes new ^ old (smallest, fastest, must be applied exactly once); overwrite writes the changed positions' absolute values (larger, but idempotent and re-appliable). Deltas are always zstd level 1.5
  • Position encoding (NCCL in-engine path). Absolute int32 indices (4 bytes each) for fast transport, or uint16 gap deltas storing idx[k] - idx[k-1] - 1 (about 2 bytes at 2% density, uint32 fallback on overflow) for mid-band. Values travel verbatim in the original dtype.

slime's transports are nccl and disk for full sync, but delta is disk only; the delta + nccl in-engine path is the separate SGLang PR #26519 approach, not slime:

Mode Transport Behavior
full nccl baseline full-weight broadcast over the trainer-engine NCCL group
full disk write full checkpoint; engine reloads via update_weights_from_disk
delta disk slime: write sparse versioned diffs, apply into local checkpoint, ordinary reload
delta nccl separate (SGLang PR #26519): broadcast sparse positions/values, in-engine NaN-masked apply

Receiver and sender have a few tuning knobs: --update-weight-delta-read-workers (disk reader concurrency, default 4), --update-weight-delta-chunk-bytes (receiver VRAM chunk cap, default 512MB), and --update-weight-buffer-size (sender bucket size).

How to maintain it

Guard every sync with the checksum: slime recomputes it after applying and raises on any mismatch, and refuses out-of-order applies, so a corrupt or reordered delta fails loudly rather than silently corrupting weights.5 Still force a full re-sync on any failed apply, because if the receiver fails but the sender has already advanced its snapshot, the rollout weights drift permanently until restart (the known silent-drift defect over cross-DC disk). For the NCCL in-engine path, treat the sender and the SGLang receiver as one versioned unit: pin both and re-run the bit-identical round-trip after any engine or framework upgrade, since a changed apply path can alter reconstruction. Monitor realized delta density: if a run's density jumps from 1-3% toward dense, a shared-scale re-quantization has defeated the diff, so fall back to full canonical sync until it settles.

How to run it in production

Choose the apply path by fabric. On an RDMA cluster, the in-engine NCCL approach (SGLang PR #26519) writes the delta straight into live weights and stacks the byte reduction on bandwidth. For cross-datacenter or fragmented capacity, slime's disk path is what makes it work without RDMA: Fireworks distributed Cursor's Composer 2 training across three to four global clusters (US Ohio, Virginia, EU Frankfurt), each independently reconstructing weights from a shared delta chain and cutting cross-region transfer by about 94%.3 A third transport, used by Hugging Face TRL, publishes the sparse delta as safetensors to a content-addressed Hub bucket that the rollout server fetches asynchronously, fully decoupling trainer, inference, and environment across clouds; on Llama-3.1-405B the per-step payload drops from 810 GB to about 6 GB and the in-cluster inference pause falls about 4x.4 Use the cross-DC visibility hooks (--custom-delta-pre-push-path, --custom-delta-pre-read-path) to gate on filesystem persistence before notifying the engine.5 Size the deployment for the pinned-CPU snapshot (about one model in host memory) and chunk the receiver decode (default 512MB cap) to bound the VRAM peak. Match the precision to what each framework supports: slime online sync covers BF16, FP8, and INT4 (compressed-tensors), with a "quantize then diff" path for FP8 (delta over disk); the miles fork adds MXFP8, and its NVFP4 quantizer is merged but not yet wired into online weight sync (full update only).

Quantization: where delta breaks

Delta holds cleanly for BF16 and FP8 block-wise, and gets hard for finer quantization:

  • Shared scales densify the diff. FP8 and NVFP4 share a block or global scale recomputed from the group amax. If one weight nudges that maximum, the whole block re-quantizes and every value under it changes, so the byte diff is no longer sparse. FP8's high-precision block scale rarely flips; NVFP4's 4-bit steps and extra global scale almost always trigger a whole-block re-quantization.
  • Canonical is not runtime. A quantized weight in engine memory is often a runtime form (4-bit interleaved packing, backend-specific swizzle, folded global scales) that depends on the parallelism layout, and rollout autoscaling makes that layout a moving target, so the delta is logically unalignable.
  • FP8 block-wise is the exception. It is natively 8-bit-addressable with no packing, no swizzle, and only per-block scales, so storage layout equals kernel layout and canonical is approximately runtime, independent of parallelism.

The rule: for low-precision rollout in general, weight sync goes through the full canonical weights with the engine quantizing on load; delta only pays off where canonical is approximately runtime (BF16, FP8 block-wise). The same sparsity idea also sparsifies DiLoCo pseudo-gradient sync with error feedback, cutting that trainer-to-trainer traffic by a reported 17x.1

Failure modes

  • Silent drift on a failed apply. If the receiver fails to apply but the sender has already advanced its snapshot, the rollout weights drift permanently until restart. Over NCCL a failure kills the job (safe); over cross-DC disk the risk is real, so force a full re-sync on any failed apply.
  • XOR delta applied twice. The xor value encoding must be applied exactly once; a retried or duplicated apply corrupts the weights. Use overwrite where idempotency matters.
  • Quantization densifies the delta. A shared-scale re-quantization turns a 2% diff dense; detect it and fall back to full sync rather than shipping a "delta" that is most of the model.
  • Snapshot memory pressure. The host-resident full snapshot plus per-parameter decode buffers can dominate host and VRAM; chunk the apply to bound the peak.
  • Assuming delta replaces RDMA. On a colocated or single-fast-fabric setup the byte savings are marginal or zero; delta earns its keep on constrained or cross-region transport.
  • Version skew after an engine upgrade. For the NCCL path the receiver logic lives in the inference engine; an engine bump can change the apply path, so re-verify bit-identical reconstruction after upgrades.

References

  • Changyi, "Delta Weight Sync" (primary walk-through of the slime and SGLang implementation): https://changyi.fun/posts/delta-weight-sync/
  • slime (THUDM) delta-weight-sync doc (disk path, flags, encodings, checksums): https://github.com/THUDM/slime/blob/main/docs/en/advanced/delta-weight-sync.md
  • SGLang PR #26519, delta weight update receiver (in-engine NaN-masked apply): https://github.com/sgl-project/sglang/pull/26519
  • Fireworks, "Frontier RL Is Cheaper Than You Think" (production cross-region delta chain): https://fireworks.ai/blog/frontier-rl-is-cheaper-than-you-think
  • Hugging Face TRL, "Delta Weight Sync" (Hub-bucket transport, BF16 change detector, vLLM patch extension): https://huggingface.co/blog/delta-weight-sync
  • Miahi & Belilovsky, Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL (PULSE): https://arxiv.org/abs/2602.03839
  • SparrowRL: RL over Commodity Networks, Overcoming the Bandwidth Barrier with Lossless Sparse Deltas: https://arxiv.org/abs/2602.11456
  • SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication: https://arxiv.org/abs/2605.07330

Related: Async and disaggregated RL · Rollout redundancy · GRPO · DiLoCo · DiLoCo geo-distributed recipe · slime · KV-cache transfer (NIXL) · Networking fabric · Model weight loading · Quantization for inference · Disaggregated inference · Glossary


  1. PULSE (arXiv 2602.03839), Miahi & Belilovsky: about 99% of per-step BF16 weight updates are invisible after the cast because Adam updates fall below the rounding threshold at RL post-training learning rates; PULSESync sends sparse BF16 patches for over 100x lower sync communication with bit-identical reconstruction, and PULSELoCo sparsifies DiLoCo pseudo-gradient sync with error feedback for over 17x lower trainer communication. 

  2. Independent corroboration of roughly 100x reduction: SparrowRL (arXiv 2602.11456) reports about 79x payload reduction and 2.4-9.5x throughput over full-weight broadcast on commodity networks; SparseRL-Sync (arXiv 2605.07330) reports about 100x less communication, lossless. 

  3. Fireworks, "Frontier RL Is Cheaper Than You Think": 98% of BF16 weights bit-equivalent between adjacent checkpoints; average delta 20.3 GiB, 1.98% of a 1024 GiB model; about 94% less cross-region transfer over a 50-step window; used for Cursor's Composer 2 training distributed across three to four global clusters that reconstruct weights from a shared delta chain. 

  4. Hugging Face TRL, "Delta Weight Sync": about 99% of BF16 weights are bit-identical between steps; a BF16ChangeDetector hook emits sparse safetensors to a content-addressed Hub bucket, and a roughly 30-line vLLM extension applies the (indices, values) patches asynchronously; per-step payload drops from 1.2 GB to 20-35 MB for a 0.6B model and from 810 GB to about 6 GB for Llama-3.1-405B, cutting in-cluster inference pause about 4x and enabling multi-region rollout fleets. 

  5. slime delta-weight-sync doc (from the repo): delta is disk-transport only and needs no delta-aware engine changes, so it is portable across parallelism layouts. Four phases: seed a CPU snapshot from the base checkpoint (correct even when Megatron-to-HF roundtrips are not byte-identical); publish the tensor diff as zstd-level-1 deltas to a shared filesystem in weight_v{N:06d}/ directories; apply into the host-local checkpoint in place; reload via update_weights_from_disk. Flags: --update-weight-mode delta, --update-weight-transport disk, --update-weight-disk-dir, --update-weight-local-checkpoint-dir; value encoding --update-weight-delta-encoding xor (involution, apply exactly once) or overwrite (idempotent); integrity --update-weight-delta-checksum xxh3-128|blake3|adler32, recomputed after apply and raising on mismatch, out-of-order refused; cross-host visibility hooks --custom-delta-pre-push-path / --custom-delta-pre-read-path. The NCCL in-engine receiver is a separate approach in SGLang PR #26519.