Skip to content
Markdown

DwarfStar (ds4): DeepSeek V4 local inference

Scope: DwarfStar (ds4), Salvatore Sanfilippo's self-contained C inference engine that runs DeepSeek V4 Flash (284B parameters, 13B activated) on a single 96-128 GB machine and DeepSeek V4 PRO (1.6T parameters) on paired Mac Studios. This page covers the asymmetric quantization mix, SSD expert streaming, the disk-first KV cache, the OpenAI/Anthropic-compatible server, and layer-split distributed inference, plus the validation methodology that makes a one-model engine trustworthy. It extends running local coding agents and open-weight serving; the quantization background is in quantization for inference.

Shell commands and configs are reference templates from the ds4 repository as of 2026-07; the project is explicitly beta quality and moves fast, so verify flags against the commit you build. Speed numbers are the repository's own single-run benchmarks, not reproduced here. The numpy blocks are self-contained and are executed and asserted in this page.

What it is

DwarfStar is a local inference engine for exactly one model family: DeepSeek V4 Flash, with PRO support on very high-memory machines. It is not a general GGUF runner and refuses arbitrary DeepSeek GGUFs; it only loads the quantized weights published for it (Hugging Face antirez/deepseek-v4-gguf), because the engine, the quantization mix, the prompt rendering, the KV handling, and the validation suite are optimized together as one product. Backends are Metal (primary), CUDA with specific care for the DGX Spark, and ROCm for Strix Halo unified-memory machines. The CPU path exists only for correctness checks.

The problem it solves is arithmetic: 284B parameters at 16 bits is 568 GB of weights, and RAM is normally a binary constraint. ds4 gets the 2-bit Flash build down to 81 GB, and when even that does not fit, SSD streaming keeps the model running at reduced speed instead of not running at all. The repository credits llama.cpp and GGML for the quantization formats and kernel groundwork it builds on, while remaining a from-scratch, dependency-free implementation.

Rough capability picture from the repo's own benchmarks (Metal, 32k context, greedy, long prompts): 21-27 tokens/s generation and 250-468 tokens/s prefill for the 2-bit Flash build across M3 Max, M5 Max, and M3 Ultra machines, and 13.8 tokens/s generation on a DGX Spark GB10.

Why use it

  • Frontier-adjacent capability offline: DeepSeek V4 Flash is a 284B MoE with 1M-token context; the 2-bit build runs it on one 128 GB MacBook at interactive speeds, with tool calling reliable enough for coding agents.
  • The integration is the product: one team controls the engine, the GGUF layout, the calibration corpus, the validation vectors, the server, and the agent. Every layer assumes the others, which is how 2-bit experts stay usable.
  • RAM becomes a speed dial: with --ssd-streaming, 128 GB runs at full speed, 96 GB slightly slower, 64 GB slower still. The model always runs; only the expert cache hit rate changes (validated below).
  • Sessions are durable: the KV cache is a first-class disk citizen. Saved sessions resume with zero re-prefill, and agent restarts reuse cached prefixes instead of reprocessing a 25k-token system prompt.
  • Validated, not vibes: GGUFs are gated on token-level agreement with logit vectors captured from the official DeepSeek API, plus a 92-question capability regression suite that runs the same inference path users run.

When to use it (and when not)

  • Use it when you want the strongest open-weight model that fits personal hardware (MacBook, Mac Studio, DGX Spark, Strix Halo) for offline or private coding-agent work, and you accept a single-model engine. Wire it to a harness as in running local coding agents.
  • Use SSD streaming when the 81 GB build does not fit your RAM (64 GB MacBooks) or when you need to reserve memory for a very long context.
  • Do not use it to serve a fleet or multiple tenants: the server runs one graph worker, does not batch independent requests, and keeps a single live KV session. For throughput serving of open-weight models on datacenter GPUs, use vLLM or SGLang (inference serving, DeepSeek on vLLM).
  • Do not use it if you need model choice: it runs the published DeepSeek V4 GGUFs and nothing else, and the README states old model support may be dropped when a better fit appears.
  • Do not expose it: the HTTP server and the distributed protocol have no authentication or encryption; both are for trusted machines on trusted networks.

Architecture

flowchart TB
  subgraph SSD["GGUF on SSD"]
    EXP["Routed experts, 2-bit<br/>(IQ2_XXS up/gate, Q2_K down)"]
    KVD["Disk KV cache<br/>(sha1-of-prefix .kv files)"]
  end
  subgraph RAM["Resident in RAM, 8-bit skeleton"]
    ATT["Attention + projections"]
    RTR["Router + shared experts"]
    OUT["Output head"]
    EC["Pinned expert cache<br/>(mlocked, hotness eviction, hot preload)"]
    KV["Live KV checkpoint<br/>(raw window + compressed history)"]
  end
  RTR -->|"top-k expert ids"| EC
  EC -->|"cache miss: read expert"| EXP
  KV <-->|"cold / continued / evict / shutdown saves"| KVD
  CLI["ds4 CLI / ds4-agent"] --> RTR
  SRV["ds4-server<br/>/v1/chat/completions · /v1/responses · /v1/messages"] --> RTR
  WRK["Distributed workers<br/>(layer slices over TCP)"] -.->|"activations"| RTR

Two decisions dominate the design. First, precision follows load-bearing structure: every token passes through attention, the router, the shared experts, and the output head, so those stay at 8-bit; each token touches only a few of the 256 routed experts per layer (43 layers in Flash), so the experts absorb the 2-bit damage. The official model release already points this way, shipping experts in FP4 and most other tensors in FP8; ds4 pushes the same asymmetry one level further down. Second, weights and KV state are file-first: experts can stay on SSD and stream through a pinned cache, and KV checkpoints serialize to disk so sessions survive restarts.

How it works (validated core mechanics)

Three mechanisms carry the design: block quantization (why 2-bit is brutal), the asymmetric mix (why it works anyway), and the streaming expert cache (why RAM becomes a dial). Each is validated below in plain numpy, including edge and adversarial cases. The blocks are self-contained; run them with any numpy-only Python.

2-bit block quantization and the outlier problem

GGUF-style quantizers store low-bit codes plus per-block scale factors. At 2 bits there are four representable levels per block, so the step size is 85x coarser than 8-bit on identical data, and a single outlier in a block wrecks the resolution for its 31 neighbours. That outlier sensitivity is why ds4 ships imatrix-tuned quants: activation statistics tell the quantizer which columns matter, so scale choices favour the weights the model actually uses.

import numpy as np


def quantize_blocks(x, bits, block=32):
    """Min/max block quantizer: per-block scale+offset, k-bit codes.
    Returns (codes, mins, steps); x length must be a multiple of block."""
    x = np.asarray(x, dtype=np.float64).reshape(-1, block)
    levels = 2**bits - 1
    mins = x.min(axis=1, keepdims=True)
    steps = (x.max(axis=1, keepdims=True) - mins) / levels
    safe = np.where(steps == 0.0, 1.0, steps)         # constant block: any code maps back
    codes = np.clip(np.round((x - mins) / safe), 0, levels).astype(np.int64)
    return codes, mins, np.where(steps == 0.0, 0.0, steps)


def dequantize_blocks(codes, mins, steps):
    return (codes * steps + mins).reshape(-1)


rng = np.random.default_rng(0)
w = rng.normal(size=4096)

# round-trip error bound: uniform quantization error is at most step/2 per element
for bits in (2, 8):
    codes, mins, steps = quantize_blocks(w, bits)
    assert codes.min() >= 0 and codes.max() <= 2**bits - 1
    err = np.abs(dequantize_blocks(codes, mins, steps) - w).reshape(-1, 32)
    assert np.all(err <= steps / 2 + 1e-12), f"{bits}-bit exceeded step/2 bound"

# 2-bit steps are (2^8-1)/(2^2-1) = 85x coarser than 8-bit on the same blocks
_, _, s2 = quantize_blocks(w, 2)
_, _, s8 = quantize_blocks(w, 8)
assert np.allclose(s2 / s8, 85.0), "step ratio must be exactly (255/3)"

# edge: constant block reconstructs exactly, no NaN/Inf from the zero range
const = np.full(32, 3.14)
codes, mins, steps = quantize_blocks(const, 2)
out = dequantize_blocks(codes, mins, steps)
assert np.all(np.isfinite(out)) and np.allclose(out, const)

# edge / adversarial: one outlier inflates the block scale and ruins the other
# 31 values; this is why imatrix-guided scale choice exists for the 2-bit quants
tame = rng.uniform(-1.0, 1.0, size=31)
codes, mins, steps = quantize_blocks(np.concatenate([tame, [50.0]]), 2)
err_tame = np.abs(dequantize_blocks(codes, mins, steps)[:31] - tame)
assert err_tame.max() > 0.5, "outlier should coarsen the 2-bit block scale"
codes, mins, steps = quantize_blocks(np.concatenate([tame, [1.0]]), 2)
err_tame = np.abs(dequantize_blocks(codes, mins, steps)[:31] - tame)
assert err_tame.max() <= steps.max() / 2, "same data without the outlier is fine"
print("block quantizer: all asserts passed")

The asymmetric mix: 8-bit skeleton, 2-bit experts

ds4 quantizes only the routed MoE experts (up and gate projections at IQ2_XXS, down at Q2_K) and leaves attention, the router, the shared experts, and the output head at high precision; the GGUF filenames spell it out (AProjQ8-SExpQ8-OutQ8). The experiment below reproduces the structural claim on a toy MoE layer: the identical 2-bit expert quantizer is tolerable behind an 8-bit skeleton and destructive when the skeleton is 2-bit too, because a corrupted router selects the wrong experts outright. The absolute error levels here belong to the crude min/max toy; the codebook quants plus imatrix land far lower, but the gap between the two configurations is the point.

import numpy as np


def quant(w, bits, block=32):
    """Min/max block quantize-dequantize a weight matrix in place (see the
    quantizer block for the error-bound and edge-case validation)."""
    flat = w.reshape(-1, block)
    levels = 2**bits - 1
    lo = flat.min(axis=1, keepdims=True)
    step = (flat.max(axis=1, keepdims=True) - lo) / levels
    safe = np.where(step == 0.0, 1.0, step)
    return (np.round((flat - lo) / safe) * step + lo).reshape(w.shape)


def moe_forward(x, router, shared, experts, top_k=4):
    """Toy MoE block: shared expert always on, top-k routed experts, softmax
    route weights. Returns (output, per-token top-k index sets)."""
    logits = x @ router                                   # (n, E)
    top = np.argsort(-logits, axis=1)[:, :top_k]          # (n, k)
    z = np.take_along_axis(logits, top, axis=1)
    gate = np.exp(z - z.max(axis=1, keepdims=True))
    gate /= gate.sum(axis=1, keepdims=True)
    out = np.maximum(x @ shared["up"], 0.0) @ shared["down"]
    for j in range(top_k):
        for e in np.unique(top[:, j]):
            rows = top[:, j] == e
            h = np.maximum(x[rows] @ experts[e]["up"], 0.0)
            out[rows] += gate[rows, j : j + 1] * (h @ experts[e]["down"])
    return out, top


rng = np.random.default_rng(42)
d, f, E, n = 64, 128, 32, 512
x = rng.normal(size=(n, d)) / np.sqrt(d)
router = rng.normal(size=(d, E)) / np.sqrt(d)
shared = {"up": rng.normal(size=(d, f)) / np.sqrt(d),
          "down": rng.normal(size=(f, d)) / np.sqrt(f)}
experts = [{"up": rng.normal(size=(d, f)) / np.sqrt(d),
            "down": rng.normal(size=(f, d)) / np.sqrt(f)} for _ in range(E)]

q = lambda p, bits: {k: quant(v, bits) for k, v in p.items()}
ref, top_ref = moe_forward(x, router, shared, experts)

# ds4's mix: routed experts 2-bit, router and shared expert 8-bit
asym, top_asym = moe_forward(x, quant(router, 8), q(shared, 8),
                             [q(e, 2) for e in experts])
# naive: everything 2-bit, router included
all2, top_all2 = moe_forward(x, quant(router, 2), q(shared, 2),
                             [q(e, 2) for e in experts])

rel = lambda y: np.linalg.norm(y - ref) / np.linalg.norm(ref)
agree = lambda t: np.mean([len(set(a) & set(b)) / 4 for a, b in zip(t, top_ref)])

print(f"asym: rel_err={rel(asym):.4f} routing_agreement={agree(top_asym):.4f}")
print(f"all2: rel_err={rel(all2):.4f} routing_agreement={agree(top_all2):.4f}")

# the 8-bit router preserves expert selection; the 2-bit router corrupts it
assert agree(top_asym) >= 0.99, "8-bit router must keep routing intact"
assert agree(top_all2) < agree(top_asym), "2-bit router must lose routing accuracy"
# output error: the same 2-bit expert quantizer does far less damage behind an
# 8-bit skeleton. The absolute numbers are for this crude min/max toy; ds4's
# codebook quants plus imatrix land much lower, but the structural gap is the claim.
assert rel(all2) > 2 * rel(asym), "uniform 2-bit must be much worse than the mix"
assert rel(asym) < 0.3, "asymmetric mix drifted beyond the toy's expected band"
print("asymmetric MoE quantization: all asserts passed")

Executed result: the asymmetric mix keeps 99.7% routing agreement with fp32 and 2.6x lower output error, while uniform 2-bit drops routing agreement to 74.8%. Wrong expert selection is unrecoverable downstream, which is exactly why the router is a load-bearing wall.

SSD streaming: the expert cache turns RAM into a dial

In streaming mode the non-routed weights stay resident, and routed experts live in an in-memory cache backed by the GGUF file on SSD; a router request either hits the cache or reads the expert from disk and evicts an entry: the primary Metal backend evicts the expert with the lowest decayed route-hotness counter (recency as tiebreak), while the CUDA backend uses plain LRU. This works because expert usage is heavily skewed in practice (ds4 preloads hot experts at startup by default). The simulation below replays one routing trace against LRU caches of shrinking capacity (both policies are stack algorithms, so the monotonicity it demonstrates carries over). LRU's inclusion property makes the hit rate exactly monotone in capacity, which is the "speed dial, not a cliff" behaviour, and the adversarial uniform-routing case shows the worst-case floor where a cache holds only its pro-rata share.

import numpy as np


def route_trace(n_tokens, n_experts, k, skew, seed=7):
    """Per-token routed-expert sets under a bounded power-law popularity."""
    rng = np.random.default_rng(seed)
    p = 1.0 / np.arange(1, n_experts + 1) ** skew
    p /= p.sum()
    return [rng.choice(n_experts, size=k, replace=False, p=p) for _ in range(n_tokens)]


def replay(trace, capacity, preload=None):
    """LRU expert cache: returns (hit_rate, per-token miss counts)."""
    cache = dict.fromkeys(preload[:capacity] if preload is not None else [])
    hits, misses_per_token = 0, []
    for token in trace:
        m = 0
        for e in token:
            e = int(e)
            if e in cache:
                del cache[e]                      # move to most-recent position
            else:
                m += 1
                if len(cache) >= capacity:
                    del cache[next(iter(cache))]  # evict least-recently used
            cache[e] = None
        hits += len(token) - m
        misses_per_token.append(m)
    return hits / (len(trace) * len(trace[0])), np.array(misses_per_token)


E, K = 256, 8                                     # one Flash layer: 256 routed experts
trace = route_trace(n_tokens=4000, n_experts=E, k=K, skew=0.9)

t_compute, t_miss = 1.0, 0.5                      # decode step vs one SSD expert read
capacities = [256, 192, 128, 96, 64, 32]
speeds = []
for c in capacities:
    hit, misses = replay(trace, c)
    speeds.append(1.0 / (t_compute + t_miss * misses.mean()))
    print(f"cache={c:3d}/{E} experts  hit_rate={hit:.3f}  tokens/s={speeds[-1]:.3f}")

# LRU's inclusion property makes hit rate exactly monotone in capacity on the
# same trace: shrinking RAM is a speed dial, never a cliff. The model always runs.
hit_rates = [replay(trace, c)[0] for c in capacities]
assert all(a >= b for a, b in zip(hit_rates, hit_rates[1:])), "hit rate not monotone"
assert all(a >= b for a, b in zip(speeds, speeds[1:])), "throughput not monotone"
assert speeds[-1] > 0.2, "even a 1/8 cache must keep generating tokens"
# skewed usage is what makes streaming pay: a 25% cache holds well over 25% of hits
assert hit_rates[capacities.index(64)] > 0.45, "power-law skew should beat pro-rata"

# preloading the popular experts removes the cold-start miss burst
popular = np.arange(E)                            # rank order = popularity order
_, cold = replay(trace, 128)
_, warm = replay(trace, 128, preload=popular)
assert warm[:50].sum() < cold[:50].sum(), "preload must cut cold-start misses"

# adversarial worst case: uniform routing gives no reuse beyond pro-rata capacity
flat = route_trace(n_tokens=4000, n_experts=E, k=K, skew=0.0)
hit_flat, _ = replay(flat, 64)
assert abs(hit_flat - 64 / E) < 0.05, f"uniform routing must approach C/E, got {hit_flat:.3f}"
print("ssd-streaming cache: all asserts passed")

Executed result: hit rate degrades smoothly from 0.992 (full cache) to 0.386 (1/8 cache) and modelled throughput from 0.97 to 0.29 tokens per unit compute, never to zero. Generation is more sensitive than prefill because every new token routes through experts again, which matches the repo's guidance.

How to use it

Download one published GGUF, build for your backend, run. The 2-bit imatrix build is the default for 96/128 GB machines:

git clone https://github.com/antirez/ds4 && cd ds4
./download_model.sh q2-imatrix     # 81 GB; q4-imatrix needs >= 256 GB RAM
make                               # macOS Metal; make cuda-spark for DGX Spark GB10
./ds4 -p "Explain Redis streams in one paragraph."
./ds4                              # interactive chat: /think, /nothink, /ctx N, /read FILE

On a 64 GB MacBook, add SSD streaming with an explicit expert-cache budget; on larger machines prefer the automatic budget (80% of the Metal recommended working set minus non-routed weights):

./ds4 -m ./ds4flash.gguf --ssd-streaming --ssd-streaming-cache-experts 32GB --ctx 32768 --nothink

The value is a budget for whole routed experts, not a byte cache, and oversized requests are capped so the cache stays mlock-able instead of silently paging. Watch the startup cache report line: start conservative, raise it only while the log still reports a lockable cache. On an M5 Max 128 GB running PRO q2, the automatic budget picked about 59 GB and manual 64-75 GB was no better.

The native agent (ds4-agent) runs inference in-process, so sessions are the on-disk KV cache itself: /save, /list, /switch <sha> resume full sessions with no prefill stage, and /strip keeps the text but drops the heavy KV payload. Sessions live in ~/.ds4/kvcache. --power N throttles GPU duty cycle (70 targets about 70%) for heat, battery, and fan noise on all the tools.

How to develop with it

The development loop is anchored on validation artifacts, and any change to kernels, quantization, rendering, or KV handling is expected to pass them:

  • Official-vector tests: tests/test-vectors holds greedy continuation vectors captured from the official DeepSeek API with top logprobs; make test compares local logits token by token (DS4_METAL_PREFILL_CHUNK=2048 pinned), so tokenizer, template, or attention regressions surface before they become generation failures.
  • Capability regression: ds4-eval embeds 92 questions (25 GPQA Diamond, 25 audited SuperGPQA, 25 AIME 2025, 17 COMPSEC vulnerability-localization items) and runs them through the same inference path users run. It is explicitly not a leaderboard; it answers whether a change broke hard reasoning. A deterministic four-question gate (--questions 4 --temp 0 --seed 1) must reproduce exact generated-token counts.
  • Quantization tooling: gguf-tools/ regenerates GGUFs and collects the imatrix. The calibration corpus is 4682 DS4-rendered prompts (about 2.91M tokens) spanning code review, contest math, long documents, DSML tool calls, and both thinking modes. Gate/up tensors record squared FFN-normalized input activations; down tensors record the squared routed SwiGLU row after route weighting, so the quantizer knows which columns the real graph exercises.
  • Debugging: --dump-tokens (tokenization only), --dump-logprobs (greedy continuation with top alternatives, separating sampling from logit bugs), --dump-logits, and ds4-server --trace (rendered prompts, cache decisions, tool-parser events for a whole agent session).

The repo also ships single-vector activation steering (dir-steering/), following the refusal-direction result (References), as a fast alternative to fine-tuning for tone and topic control.

How to integrate it

ds4-server exposes OpenAI chat completions, OpenAI Responses, legacy completions, and Anthropic messages endpoints from one binary:

./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

Codex CLI uses the Responses wire API, Claude Code the Anthropic endpoint via ANTHROPIC_BASE_URL, opencode and Pi the OpenAI-compatible one; the repo carries worked configs for all four. Set the client's context limit at or below the server's --ctx. Thinking mode is the default and uses DeepSeek's fixed sampling (temperature=1, top_p=1, min_p=0.05), ignoring client knobs, which matches the official API behaviour.

The subtle integration problem is tool calling. The model emits DSML text, but stateless clients send back normalized JSON tool calls; if the server re-rendered them even slightly differently, the byte prefix would no longer match the live KV checkpoint and every turn would re-prefill. ds4 keeps a bounded map from unguessable tool IDs to the exact sampled DSML bytes and replays those verbatim (surviving restarts via the KV files); deterministic canonicalization is only the fallback. During generation, DSML syntax is decoded greedily while argument payloads keep the request's sampling, so tool calls stay parseable without degrading long file bodies into repetition loops.

How to run it in production

Production here means a personal daily driver or a trusted-team box, not a fleet; within that envelope the operational surface is real:

  • Memory budget: weights plus KV plus scratch must fit. The 2-bit build is 81 GB; a full 1M-token context adds roughly 26 GB (the compressed indexer alone is about 22 GB), so on 128 GB configure 100-300k of context rather than the maximum. The KV-cache OOM runbook logic applies unchanged.
  • Disk KV cache on, always, for agents: agent harnesses resend the whole conversation each request and often open with a prompt around 25k tokens. --kv-disk-dir makes prefixes survive session switches and restarts; keys are the SHA1 of the rendered byte prefix, cold saves trim 32 tail tokens and align to 2048-token boundaries to dodge BPE-boundary retokenization misses.
  • Capacity model: one live graph, no cross-request batching; concurrent requests queue. Size expectations accordingly and keep one server per box.
  • Network posture: no authentication or encryption on either the HTTP server or the distributed protocol. Bind to loopback (the default; --host 0.0.0.0 is explicit), use --cors only for local browser clients, and treat any exposure as a tunnel-or-VPN problem.
  • Hygiene: the KV cache directory is disposable and contains verbatim prompts (inspectable with hexdump); wipe it on suspicion and treat it as sensitive data at rest.

How to maintain it

  • Pin the commit. The project is beta, flags change, and the README warns that model support itself is opportunistic and replaceable. Rebuild coordinator and workers from the same commit; the distributed protocol is not release-stable.
  • Re-run the gates after any upgrade: make test (official-vector comparison) and the deterministic ds4-eval four-question gate, then a spot check of your own agent workload with --trace.
  • Cache compatibility: KV checkpoints may be reused across 2-bit and 4-bit expert variants when the rendered prefix matches; pass --kv-cache-reject-different-quant if you want strict same-quant reuse. When behaviour looks odd after an upgrade, deleting the KV cache directory is the first cheap fix.
  • Know your escape hatches: --trace for full session logs when filing issues, --regrade-trace to audit evaluator changes without re-running the model, and the startup expert-cache report as the first thing to read on streaming machines.

How to scale it

Scaling is layer-splitting across machines, not replica count. Each machine loads only its layer slice from the GGUF (--layers 0:30, --layers 31:output, inclusive ranges), one process is the coordinator (tokenization, sampling, the prompt), the rest register as workers, and activations flow worker-to-worker over plain TCP:

# Machine A: coordinator, layers 0..30
./ds4 -m gguf/DeepSeek-V4-Pro-Q4K-Layers00-30.gguf --role coordinator --layers 0:30 --listen 169.254.43.68 1234
# Machine B: worker, layers 31..output (owns the output head, returns logits)
./ds4 -m gguf/DeepSeek-V4-Pro-Q4K-Layers-31-output.gguf --role worker --layers 31:output --coordinator 169.254.43.68 1234

What to expect, from the repo's two-MacBook (M5 Max, Thunderbolt 5, 0.45 ms ping) measurements:

  • Prefill accelerates: chunks pipeline through the machines assembly-line style, so two machines reached 1.38x at 9.4k prompt tokens, 1.66x at 28.7k, 1.85x at 63.8k versus a single process.
  • Generation slows about 19% (30.59 to 24.67 tokens/s on the same setup): decode is strictly autoregressive, every token pays a cross-machine hop, and no pipeline can hide it. Distributed exists for capacity (PRO Q4 across two 512 GB Mac Studios generated at 11.47 tokens/s) and prefill speed, not decode speed.
  • The link is the product: WiFi (77 ms ping) cut generation to 10.7 tokens/s and VPN (152 ms) to 3.6. Latency hurts decode directly; bandwidth pulls down long prefill. --dist-activation-bits 16 halves activation traffic and is the first knob on Ethernet.
  • Failure semantics are explicit: a disconnected worker drops out of the route and in-flight requests can fail; workers validate a rolling 64-bit token-prefix hash on every work item, so a restarted worker cannot silently accept work at the wrong position, and the coordinator replays the transcript to rebuild worker KV state.

The contrast with datacenter disaggregated inference is instructive: same physics (prefill parallelizes, decode is latency-bound), applied to Thunderbolt cables instead of InfiniBand.

Failure modes

  • Expert-cache thrash: streaming with a cache too small for the workload's expert mix collapses decode speed while prefill still looks fine (misses amortize over prefill chunks but repeat every generated token). The uniform-routing case in the simulation above is the floor. Fix: bigger cache budget, or drop to the automatic budget.
  • mlock failure under memory pressure: ds4 refuses pageable expert-cache entries and shrinks to the measured lockable size; if the startup report shows a much smaller cache than requested, other processes are squeezing you.
  • KV overcommit: a 1M context wants about 26 GB on top of 81 GB of weights; on 128 GB machines an over-generous --ctx plus a long agent session out-of-memories late, not at startup. Size the context to the machine.
  • Unexpected full re-prefill: a client that rewrites history (reordered JSON tool arguments, changed whitespace) breaks the rendered-prefix match. Exact DSML replay exists to prevent this; disabling it (--disable-exact-dsml-tool-replay) invites the failure back.
  • Distributed route incomplete: after a worker crash, calls fail until a compatible worker re-registers; mismatched builds between coordinator and workers surface as protocol errors. Same commit everywhere.
  • Wrong GGUF: arbitrary DeepSeek GGUFs will not load; the engine expects its own tensor layout, quant mix, and metadata. This is by design, not a bug to work around.
  • macOS CPU path: current macOS versions can kernel-panic on the CPU inference path (a macOS virtual-memory bug per the README); treat the CPU build as a Linux diagnostics tool.
  • MTP expectations: the speculative decoding path (--mtp) is experimental, greedy-only in practice, and currently yields at most a slight speedup; do not plan capacity around it (speculative decoding).

References

  • DwarfStar repository (antirez/ds4): https://github.com/antirez/ds4
  • Published GGUF weights: https://huggingface.co/antirez/deepseek-v4-gguf
  • DeepSeek-V4-Flash model card (284B/13B, 1M context, FP4+FP8 mixed): https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
  • DeepSeek-V4 technical report: https://arxiv.org/abs/2606.19348
  • "DwarfStar-4" write-up (engineerprompt.ai): https://engineerprompt.ai/writing/dwarfstar-4/
  • llama.cpp / GGML (quant formats and groundwork ds4 credits): https://github.com/ggml-org/llama.cpp
  • Refusal in Language Models Is Mediated by a Single Direction (steering basis): https://arxiv.org/abs/2406.11717

Related: Running local coding agents · Serving open-weight models · Quantization for inference · LLM inference efficiency · DGX Spark · KV cache token eviction · KV-cache OOM runbook · Disaggregated inference · Speculative decoding · Engine weight loading · vLLM DeepSeek-V3.2 cookbook · Consumer-GPU vLLM cookbook · Inference serving · Glossary