Skip to content
Markdown

Looped and recurrent-depth transformers

Scope: the architecture family that reaches greater effective depth by applying the same transformer block repeatedly rather than stacking distinct layers. This page covers what a looped (weight-tied, recurrent-depth) transformer is, why iterating a shared block is a distinct scaling axis and a parameter-efficiency lever, when it helps versus a conventional stack or a mixture of experts, and how adaptive computation spends more loops on harder inputs. The recent instance is Looped World Models (LoopWM); the lineage is Universal Transformers and Adaptive Computation Time. It contrasts directly with the params-vs-compute tradeoff in MoE sparsity and scaling and sits beside the scaling-axis discussion in scaling toward 100T parameters.

Concept and figures track a 2026 report (LoopWM, 2606.18208) and its lineage; a looped model is trained, not configured, so treat the mechanics as an architecture choice to validate, not a serving flag. The Python example is executed and asserted (numpy); it validates the parameter-count and iterative-refinement math.

flowchart LR
  X["Input latent state"] --> BLOCK["Shared transformer block (one set of weights)"]
  BLOCK --> HALT{"Converged / halt?"}
  HALT -->|"no: loop again (more depth)"| BLOCK
  HALT -->|"yes"| OUT["Output"]
  STACK["Contrast: a stacked model uses K distinct blocks = Kx the parameters"] -.-> BLOCK

What it is

A looped transformer applies one weight-tied block to a latent state over and over, X_{k+1} = f(X_k), so effective depth comes from the number of iterations rather than from stacking K distinct layers. Three ideas define the family:

  • Iterative latent depth. Reasoning happens by refining a latent state across loop steps, which the LoopWM report frames as "iterative latent depth as a new scaling axis," orthogonal to the usual axes of model size and training data.1
  • Weight tying. Because the block is shared across loop steps, the parameter count is independent of the loop count; a looped model reaches depth-K computation at the parameter cost of a single block, which LoopWM reports as up to 100x parameter efficiency over a conventional stack.1
  • Adaptive computation. The loop count can vary per input: spend more iterations on a hard prediction and fewer on an easy one, halting when the latent state stops changing. This is the Adaptive Computation Time idea applied to depth.3

The lineage is Universal Transformers (a weight-tied transformer that recurs in depth with a dynamic halting mechanism) and, more recently, recurrent-depth latent reasoning that scales test-time compute by iterating in latent space rather than emitting more tokens.2

Why use it

  • A scaling axis you do not pay for in parameters. Adding loop iterations adds compute and effective depth without adding weights, so a small model can be given more "thinking depth" at inference. This is the mirror image of MoE, which adds parameters at fixed compute; looped models add compute at fixed parameters.
  • Parameter efficiency. For memory-constrained settings (edge, on-device, or fitting a bigger effective model in the same footprint), reaching depth-K behavior at one block's parameters is a large saving, up to Kx versus the equivalent stack.1
  • Latent reasoning / test-time compute. Iterating in latent space is a way to spend more compute at inference to improve a hard answer without generating intermediate tokens, complementary to token-level test-time compute.
  • Input-adaptive cost. Adaptive halting means easy inputs are cheap and only hard ones pay for extra depth, rather than every input paying the worst-case depth of a fixed stack.

When to use it (and when not)

  • Use it when you want more effective depth or reasoning compute without more parameters, when memory is the binding constraint, or when per-input adaptive depth is valuable.
  • Prefer a conventional stack when representational diversity across layers matters: distinct layers can learn distinct functions, whereas a shared block must do one job well at every depth, which can cap peak quality.
  • Watch convergence. The recurrence must be stable (contractive enough to settle), or extra loops add cost without refining the state; a non-convergent loop is worse than a fixed stack.
  • Mind serving. Variable depth per input complicates batching and the KV cache (below), so the inference-side simplicity of a fixed stack is sometimes worth its parameter cost.

Architecture

The structural choice is loop versus stack. A stacked model has K distinct blocks and K times the parameters; a looped model has one block applied K times, with the loop count either fixed or chosen per input by a halting head. The state carried between iterations is the latent, so the "depth" is a trajectory of refinements of one representation rather than a feed-forward pass through different representations. Adaptive computation attaches a halting signal (a learned probability or a convergence threshold on the latent's change) that decides, per input or per token, when to stop looping.

How to use it

The two load-bearing properties are validatable directly: weight tying makes parameters independent of depth, and iterating a contraction refines the latent toward a fixed point with adaptive early stopping. This runnable block asserts both, including the adversarial reading that a looped model's compute (not its parameters) is what scales with depth:

# looped.py — validated: weight tying makes params independent of loop depth; iterating a shared
# block refines a latent and halts adaptively (fewer loops on easier inputs). numpy only.
import numpy as np

def params(d, layers):
    return layers * (4 * d * d)                        # per-layer weight-count proxy (~4 d^2)

d = 256
for K in (2, 4, 8, 16):                                # a stacked K-layer model costs Kx the parameters
    assert params(d, K) == K * params(d, 1)            # ...while a looped model keeps one block's params at any K

rng = np.random.default_rng(0)
W = rng.standard_normal((d, d)); W *= 0.7 / np.linalg.norm(W, 2)   # spectral norm < 1 -> a contraction
b = rng.standard_normal(d)

def refine(x0, tol=1e-4, max_loops=500):               # apply the SAME block until the latent settles
    x = x0.copy(); deltas = []
    for k in range(1, max_loops + 1):
        xn = np.tanh(W @ x + b)
        deltas.append(np.linalg.norm(xn - x)); x = xn
        if deltas[-1] < tol:                           # adaptive halt: stop when refinement converges
            return k, x, deltas
    return max_loops, x, deltas

_, xstar, _ = refine(np.zeros(d), tol=1e-10)           # the fixed point the loop converges to
easy_k, _, ed = refine(xstar + 0.01 * rng.standard_normal(d))    # near the fixed point: easy
hard_k, _, _  = refine(xstar + 5.0 * rng.standard_normal(d))     # far from it: hard
assert all(b_ <= a + 1e-9 for a, b_ in zip(ed, ed[1:]))          # residual shrinks monotonically (contraction)
assert easy_k < hard_k                                          # adaptive computation: easy input halts sooner
print(f"stacked(8) params = {params(d, 8) // params(d, 1)}x looped; adaptive loops easy={easy_k} hard={hard_k}")

How to develop with it

The knobs are the loop budget, the tying scope, and the halting rule. The maximum loop count bounds worst-case depth and compute; the weight-tying scope decides whether one block or a small set of blocks is reused (fully tied is most parameter-efficient but hardest to fit); and the halting mechanism (Adaptive Computation Time's learned halting probability, or a convergence threshold on the latent) sets how per-input depth is chosen. The central risk is stability: the recurrence should refine the state toward a useful fixed point, so training must keep it from diverging or collapsing, which is why looped models often add normalization or a residual toward the input at each step. Because depth is decoupled from parameters, you can also scale loops at test time beyond the training budget, but validate that extra loops still help rather than drift.

How to maintain it

Treat convergence as a monitored property, not an assumption. Track the per-step change in the latent across loops on a held-out set; if it stops shrinking or oscillates, the recurrence is not contractive and extra loops burn compute without refining the answer. Because the same weights run at every depth, a training or fine-tuning change can shift the fixed-point behavior, so re-check the convergence curve and the adaptive halting distribution after any update. Keep the maximum loop count pinned so a regression cannot silently inflate inference cost, and watch the halting head's calibration so easy inputs keep halting early.

How to run it in production

At inference, compute scales with the realized loop count, so serving a looped model is a compute-versus-latency decision rather than a memory one: a small-parameter model can still be slow if it loops many times. The operational wrinkle is that variable depth per input breaks the uniform-batch assumption, since different sequences in a batch may want different loop counts; either cap depth for batching simplicity or schedule by realized depth the way you would for variable-length generation (continuous batching). The KV cache interacts too: a looped block re-attending across iterations changes the cache access pattern from a single feed-forward pass. The upside for deployment is the parameter footprint, which is what makes looped and recurrent-depth designs attractive for consumer or edge hardware where memory, not compute, is the binding constraint. Present the efficiency honestly: it is a parameter saving, not a compute saving, so quote both.

Failure modes

  • Non-convergent recurrence. If the shared block is not contractive, more loops do not refine the latent and just add cost; monitor the per-step change and add stabilizing structure.
  • Lost layer diversity. A single shared block cannot specialize by depth the way distinct layers can, which can cap peak quality on tasks that need heterogeneous computation.
  • Compute mistaken for free. The parameter efficiency is real, but each loop is a full block of compute; a "100x smaller" model that loops 100 times is not 100x cheaper to run.
  • Variable-depth serving stalls. Per-input loop counts break naive uniform batching; cap depth or schedule by realized depth.
  • Uncalibrated halting. A halting head that stops too early underthinks hard inputs; one that stops too late overspends on easy ones. Calibrate on held-out data.

References

  • LoopWM: Looped World Models (parameter-shared iterative latent refinement, adaptive depth): https://arxiv.org/abs/2606.18208
  • Dehghani et al., Universal Transformers (weight-tied recurrent-depth transformer with dynamic halting): https://arxiv.org/abs/1807.03819
  • Graves, Adaptive Computation Time for Recurrent Neural Networks: https://arxiv.org/abs/1603.08983

Related: MoE sparsity and scaling · Scaling toward 100T parameters · MoE routing and load balancing · Activation checkpointing and offloading · Roofline and arithmetic intensity · Continuous batching internals · vLLM on consumer GPUs · Glossary


  1. LoopWM (arXiv 2606.18208): iteratively refines latent environment states through a parameter-shared transformer block; frames iterative latent depth as a scaling axis orthogonal to model size and data, with adaptive computation that scales depth to each prediction step's complexity, reporting up to 100x parameter efficiency over conventional approaches. 

  2. Dehghani et al., Universal Transformers: a weight-tied transformer that recurs in depth with a per-position dynamic halting mechanism, the direct ancestor of looped/recurrent-depth designs; recent recurrent-depth latent-reasoning work extends the idea to scale test-time compute by iterating in latent space rather than emitting more tokens. 

  3. Graves, Adaptive Computation Time: a learned halting probability lets a recurrent model spend a variable number of computation steps per input, the mechanism looped transformers use to make depth input-adaptive.