Skip to content
Markdown

DSpark speculative decoding (DeepSeek)

Scope: DeepSeek's DSpark drafter and scheduler (DeepSpec release, June 2026): a semi-autoregressive draft model, a heavy parallel backbone plus a lightweight Markov head, that fixes the suffix decay of parallel drafters, and confidence-scheduled verification that sizes each request's verify budget against live engine load. Deployed in DeepSeek-V4 serving in place of the MTP-1 baseline, where it speeds up per-user generation by 57-85% at matched throughput with the target model unchanged (no retraining, no quantization, output distribution preserved exactly); draft checkpoints and training code are open. The mechanics of speculative decoding itself (accept rule, losslessness, block efficiency) live in speculative decoding; measuring acceptance across workloads in evaluating speculative decoding.

What it is

DSpark is DeepSeek's speculative decoding framework, published June 2026 as a paper inside the DeepSpec repository together with training code and checkpoints.1 It attacks the two bottlenecks that appear once a drafter proposes long token blocks: parallel drafters lose acceptance rapidly at later positions because positions are predicted independently (suffix decay), and verifying every proposed token indiscriminately wastes target-model batch capacity under high-concurrency load.1 Two components address them:

  • Semi-autoregressive generation: a parallel backbone (DFlash-style) produces hidden states and base logits for the whole block in one forward pass, and a lightweight sequential head (a low-rank Markov transition by default) re-introduces intra-block dependency at negligible latency cost.1
  • Confidence-scheduled verification: a calibrated confidence head estimates, per position, the probability the draft prefix survives verification, and a hardware-aware scheduler picks each request's verification length to maximise expected system throughput given a profiled capacity curve of the engine.1

The target model is untouched: no retraining, no quantization, and the standard rejection-sampling rule preserves its output distribution exactly. The Hugging Face releases DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark state this explicitly: "not a new model. It is the same checkpoint with an additional speculative decoding module attached."3 The draft module itself is a trained add-on (the target stays frozen during its training).1

In DeepSeek's production serving, DSpark replaced the previous MTP-1 drafter two weeks after the DeepSeek-V4 preview release. At matched aggregate throughput it accelerates per-user generation by 60-85% on V4-Flash and 57-78% on V4-Pro, and it sustains interactivity tiers (120 tok/s/user on Flash, 50 on Pro) where the single-token baseline's capacity collapses.1

Why use it

Plain autoregressive decode is one full target forward per token, memory-bandwidth-bound at low batch sizes (see how the KV cache speeds up inference). Speculative decoding's per-token latency decomposes as L = (T_draft + T_verify) / τ, for accepted tokens per cycle τ. That leaves exactly three levers: draft faster (lower T_draft), draft better (higher τ), and verify smarter (lower effective T_verify).1 Prior drafter families each optimise one lever at another's expense:

  • Autoregressive drafters (Eagle3) condition each draft position on previously sampled tokens, so acceptance holds up deep into the block, but T_draft grows linearly with block size γ. That forces short blocks and shallow draft networks (Eagle3 runs 1 transformer layer in the paper's setup), which caps predictive capacity where it matters most: position 1, whose rejection invalidates the entire block.1
  • Parallel drafters (DFlash) emit all γ positions in a single forward pass, so T_draft is nearly independent of block size and the drafter can afford depth (5 layers in the same setup). The cost is independence: each position marginalises over all possible predecessors instead of conditioning on the one actually sampled, producing incoherent blocks ("of problem", "no course" when the context admits "of course" and "no problem"). This multi-modal collision shows up as suffix decay.1

The paper's position-wise conditional acceptance measurements (Qwen3-4B target) make the trade concrete: DFlash starts higher than Eagle3 at position 1 (0.88 vs 0.81 on math, 0.72 vs 0.53 on chat) thanks to depth, but decays along the block (0.87 to 0.78 on code, 0.72 to 0.63 on chat), while Eagle3 holds steady or climbs (0.53 to 0.74 on chat). DSpark inherits the deep backbone's position-1 capacity (0.93 on math) and the sequential head removes the decay.1

The third lever is verification. Acceptance varies strongly by domain (accepted length 5.57 on math vs 3.49 on chat for the same Qwen3-4B drafter), and the cost of verifying an extra token depends on engine load: nearly free when idle, batch capacity stolen from other requests when saturated. A fixed verification length is therefore wrong in both directions, which is what the confidence-scheduled verification corrects.1

Offline, DSpark improves macro-average accepted length over Eagle3 by 30.9%, 26.7%, and 30.0% on Qwen3-4B/8B/14B targets, and over DFlash by 16.3%, 18.4%, and 18.3%; the gains carry to Gemma4-12B.1

When to use it (and when not)

The general go/no-go rules for speculative decoding (acceptance measured on your traffic, drafter speed margin, load regime) are in speculative decoding and are unchanged. DSpark-specific considerations:

  • High-concurrency serving with interactivity SLAs is the design point. The scheduler's value comes from load variance: it extends verification budgets when the engine has slack and prunes them as concurrency saturates. A single-user or fixed-low-load deployment gets the semi-autoregressive acceptance gains but little from the scheduler (under light load, verifying extra tokens is nearly free anyway).1
  • Mixed-domain traffic (chat alongside code and math) benefits most from confidence scheduling, because per-request acceptance varies widely and a static block length wastes verify compute on open-ended chat.1
  • Engine integration is not free. The scheduler needs a profiled steps-per-second capacity table and, for the full production design, kernels that handle variable-length verification within a batch plus an asynchronous scheduling path compatible with CUDA-graph replay. DeepSeek implemented this in its internal V4 engine; as of early July 2026 DeepSpec ships training and evaluation code, and the V4-DSpark Hugging Face repos carry a minimal inference example, not a drop-in for vLLM/SGLang/TensorRT-LLM. Verify current engine support before planning a deployment around it.23
  • Domain-specific or thinking-mode serving needs a retrained draft. The released research checkpoints were trained on target outputs generated in non-thinking mode; the README says to fine-tune the draft again for domain-specific use, especially if the target runs in thinking mode.2

Architecture

One decoding cycle: the target's last committed token becomes the anchor; the parallel backbone produces the whole draft block and per-position hidden states in one forward; the sequential head samples the block left to right, adding a transition bias so each token conditions on its sampled predecessors; the confidence head scores every position; the scheduler keeps only the prefix worth verifying under current load; the target verifies that prefix in a single pass and commits the accepted tokens plus a corrected or bonus token.1

flowchart LR
    A["anchor token from previous round"] --> PB["parallel backbone: one forward over the whole block"]
    PB --> H["hidden states h1..hg and base logits U1..Ug"]
    H --> SH["sequential head: sample left to right with transition bias"]
    H --> CH["confidence head: calibrated survival estimates c1..cg"]
    SH --> CH
    CH --> SCH{"hardware-aware prefix scheduler: maximize tau * SPS(B)"}
    SPS["profiled SPS(B) capacity table"] --> SCH
    SCH -->|"keep prefix"| VER["target model verifies the scheduled prefix in one pass"]
    SCH -->|"drop tail"| X["low-confidence suffix: pruned, never verified"]
    VER --> OUT["commit accepted prefix plus corrected or bonus token"]
    OUT --> A

Semi-autoregressive generation

The backbone is DFlash: target hidden states from a set of layers are projected into the draft space and injected into every draft layer's keys and values, and the block positions attend bidirectionally to each other and that context. DSpark makes one modification: the anchor token itself becomes the first prediction position, so γ input tokens (anchor plus γ-1 masks) yield γ draft logits, which saves compute at equal quality. The draft shares the target's frozen embedding layer and LM head.14

The sequential stage turns the backbone's independent base logits U_k into a causal block distribution by adding a prefix-dependent bias inside the softmax: p_k(v | x_0, x_<k) = softmax(U_k(v) + B_k(x_0, x_<k, v)). Because each conditional stays an exact, locally normalised softmax, per-token probabilities remain available for the min(1, p/q) accept rule. That is the property globally normalised alternatives lose: a CRF-style head's partition function prevents exact per-token probabilities, and CTC drafting is restricted to greedy verification.1 Two instantiations:

  • Markov head (default): the bias depends only on the immediately preceding token, a V x V transition matrix approximated by a low-rank factorisation B = W1 @ W2 with rank 256, cheap enough that the sequential sampling loop adds almost nothing to draft latency.1
  • RNN head: a single gated recurrent update accumulates the full within-block prefix. It adds only marginal accepted-length gains, mostly at long blocks, so the Markov head stays the default.1

The block below reproduces the mechanism at toy scale: a parallel drafter with exact per-position marginals still puts 25% of its mass on each incoherent block, halving position-2 acceptance, and a first-order transition bias on the same base logits removes the decay while keeping every conditional an exact softmax. It also demonstrates the paper's low-rank claim (two modes need only a rank-2 transition) and the adversarial baseline: losslessness never breaks even under the incoherent drafter, only efficiency does.

# Runnable on system python3 (numpy). Why parallel drafters suffer suffix decay and how a
# Markov head fixes it. Target emits two-token phrases: "of course" (0.5) and "no problem"
# (0.5). A position-independent drafter learns the exact per-position marginals yet proposes
# incoherent blocks; a first-order transition bias on the same base logits restores the
# intra-block dependency. Acceptance per position is sum min(p, q) = 1 - TV(p, q).
import numpy as np

VOCAB = ["of", "no", "course", "problem"]
i_of, i_no, i_course, i_problem = 0, 1, 2, 3

def tv_accept(p, q):
    """Per-step acceptance probability of speculative sampling: 1 - TV(p, q)."""
    return np.minimum(p, q).sum()

# Target conditionals: two coherent modes, equal mass.
p1 = np.array([0.5, 0.5, 0.0, 0.0])                  # x1: "of" or "no"
p2_given = {i_of: np.array([0.0, 0.0, 1.0, 0.0]),    # "of" -> "course"
            i_no: np.array([0.0, 0.0, 0.0, 1.0])}    # "no" -> "problem"

# Parallel drafter: exact per-position marginals, positions independent.
q1 = np.array([0.5, 0.5, 0.0, 0.0])
q2 = np.array([0.0, 0.0, 0.5, 0.5])

# 1. Multi-modal collision: the independent joint puts 25% mass on EACH incoherent block.
joint = np.outer(q1, q2)
assert abs(joint[i_of, i_problem] - 0.25) < 1e-12    # "of problem": target never emits it
assert abs(joint[i_no, i_course] - 0.25) < 1e-12     # "no course"

# 2. Conditional acceptance: position 1 is perfect, position 2 collapses to 0.5 (suffix decay).
a1 = tv_accept(p1, q1)
a2_parallel = {x1: tv_accept(p2_given[x1], q2) for x1 in (i_of, i_no)}
assert abs(a1 - 1.0) < 1e-12
assert all(abs(a - 0.5) < 1e-12 for a in a2_parallel.values())

# 3. Markov head: base logits U2 from the parallel backbone plus transition bias B[x1].
#    q2(v | x1) = softmax(U2 + B[x1]) stays an exact per-token softmax, so the min(1, p/q)
#    rule still applies (globally normalized CRF-style heads cannot provide this).
U2 = np.log(q2 + 1e-12)
B = np.zeros((4, 4))
B[i_of, i_course] = 12.0
B[i_no, i_problem] = 12.0                            # learned bigram preference

def q2_markov(x1):
    z = U2 + B[x1]
    e = np.exp(z - z.max())
    return e / e.sum()

for x1 in (i_of, i_no):
    qm = q2_markov(x1)
    assert abs(qm.sum() - 1.0) < 1e-12               # exact softmax normalization
    assert tv_accept(p2_given[x1], qm) > 0.99        # suffix decay gone

# 4. Expected accepted draft tokens per block is the prefix survival sum a1 + a1*a2.
e_parallel = a1 + a1 * a2_parallel[i_of]
e_markov = a1 + a1 * tv_accept(p2_given[i_of], q2_markov(i_of))
assert abs(e_parallel - 1.5) < 1e-12 and e_markov > 1.99

# 5. Low-rank factorization (paper: B = W1 @ W2 with r=256 for V ~ 1e5): rank 2 suffices here.
u, s, vt = np.linalg.svd(B)
B_lr = (u[:, :2] * s[:2]) @ vt[:2]
assert np.allclose(B_lr, B, atol=1e-9)               # two modes -> rank-2 transition matrix

# 6. Losslessness is never at stake, only efficiency: rejection sampling commits tokens
#    distributed exactly as the target conditional even under the incoherent parallel drafter.
rng = np.random.default_rng(0)
N = 400_000
x2 = rng.choice(4, size=N, p=q2)
accept = rng.random(N) < np.minimum(1.0, p2_given[i_of][x2] / np.maximum(q2[x2], 1e-12))
resid = np.maximum(p2_given[i_of] - q2, 0.0)
resid /= resid.sum()
commits = np.where(accept, x2, rng.choice(4, size=N, p=resid))
emp = np.bincount(commits, minlength=4) / N
assert np.max(np.abs(emp - p2_given[i_of])) < 0.01   # committed dist == target conditional
assert abs(accept.mean() - 0.5) < 0.01               # but half the draft compute is wasted

print("V1 semi-autoregressive OK:",
      f"pos-2 acceptance parallel={a2_parallel[i_of]:.2f} -> "
      f"markov={tv_accept(p2_given[i_of], q2_markov(i_of)):.4f};",
      f"E[accepted] {e_parallel:.2f} -> {e_markov:.4f};",
      f"committed maxdev={np.max(np.abs(emp - p2_given[i_of])):.4f}")

Running this prints V1 semi-autoregressive OK: pos-2 acceptance parallel=0.50 -> markov=1.0000; E[accepted] 1.50 -> 2.0000; committed maxdev=0.0000: the same base logits go from 1.5 to 2.0 expected accepted tokens once the transition bias conditions position 2 on the sampled position 1, and the committed distribution matches the target either way.

Confidence-scheduled verification

The confidence head is a linear projection plus sigmoid over the backbone hidden state and the Markov embedding of the previous draft token: c_k = sigmoid(w^T [h_k; W1[x_{k-1}]]). It is trained to predict the analytical per-step acceptance rate c*_k = 1 - TV(p_draft, p_target)/2, so c_k estimates the conditional probability that position k survives verification given the prefix was accepted; the joint prefix-survival probability is the running product of the c_i.1

Raw neural confidences are overconfident (ECE 3-8% in the paper's reliability diagrams, ROC-AUC 0.81-0.90), and the scheduler needs absolute probabilities, not just correct rankings. Sequential Temperature Scaling (STS) calibrates the cumulative products left to right, one temperature scalar per position found by 1D grid search minimising expected calibration error on held-out data; this brings average ECE to about 1% without changing the head's rankings (temperature scaling is order-preserving).1

The hardware-aware prefix scheduler turns verification-length selection into throughput maximisation. For R active requests with prefix-survival probabilities a_{r,j}, verification batch size B = sum(1 + l_r) and expected accepted tokens τ = sum(1 + sum a_{r,j}), it maximises Θ = τ * SPS(B), where SPS(B) is the engine's steps-per-second at batch size B, profiled once at engine initialisation into a lookup table. Because a_{r,j} is non-increasing in j, a greedy pass over all candidate tokens sorted by survival probability respects prefix structure, and each admission updates Θ with an O(1) table lookup. The greedy stops the moment Θ fails to improve.1

That early stop is not an optimisation detail; it is the losslessness guard. Admission decisions must be non-anticipating (independent of tokens not yet processed), and the Markov-conditioned confidence for position k+1 depends on the realised token at k, so a retrospective search that peeks ahead leaks the token's value into its own admission decision. The paper's Appendix A constructs the exact failure. The block below implements Algorithm 1, reproduces that counterexample numerically (the retrospective policy shifts a 0.7/0.3 target to 0.85/0.15), and demonstrates the load-adaptive budget mechanism behind the production results.

# Runnable on system python3 (numpy). DSpark's hardware-aware prefix scheduler (Algorithm 1)
# and why its early-stopping break is load-bearing: reproducing the paper's Appendix A
# counterexample shows a retrospective global search skews the output distribution
# (0.85/0.15 instead of the target 0.7/0.3), while the causal variant stays lossless.
import numpy as np

def schedule(conf, sps):
    """Algorithm 1: greedy admission of draft tokens sorted by prefix survival probability.

    conf: per-request confidence sequences c_{r,1..gamma}; sps: profiled SPS(B) step-rate
    table (steps/s at verification batch size B tokens). Returns verification lengths l_r.
    """
    R = len(conf)
    surv = [np.cumprod(c) for c in conf]                    # a_{r,j} = prod_{i<=j} c_{r,i}
    cand = sorted(((a[j], r, j) for r, a in enumerate(surv) for j in range(len(a))
                   if a[j] > 0), reverse=True)
    lens, best_lens = [0] * R, [0] * R
    B, tau = R, float(R)                                    # every request commits >= 1 token
    theta_best = R * sps[R]
    for a, r, j in cand:
        lens[r] = j + 1
        B += 1
        tau += a
        theta = tau * sps[B]                                # expected throughput tau * SPS(B)
        if theta > theta_best:
            theta_best, best_lens = theta, lens.copy()
        else:
            break                                           # the causal barrier
    return best_lens

# 1. Appendix A counterexample, exact numbers: one request, gamma=2, a1 = 0.8,
#    SPS(1)=1.0, SPS(2)=0.5, SPS(3)=0.45.
SPS = {1: 1.0, 2: 0.5, 3: 0.45}
assert abs(1 * SPS[1] - 1.0) < 1e-12                        # Theta_0 = 1.0
assert abs((1 + 0.8) * SPS[2] - 0.9) < 1e-12                # Theta_1 = 0.9 < Theta_0
assert abs((1 + 0.8 + 0.8 * 0.9) * SPS[3] - 1.134) < 1e-12  # Theta_2 if x1 lands lucky
# The causal scheduler halts at l=0 (Theta_1 < Theta_0) before c2 is ever observable.
assert schedule([[0.8, 0.9]], SPS) == [0]

# 2. The retrospective policy peeks past the break: it admits x1 iff the realized token
#    leads to high c2. p_t=(0.7, 0.3), p_d=(0.5, 0.5); x1=A -> c2=0.9 -> Theta_2=1.134 -> l=2;
#    x1=B -> c2=0 -> Theta_2=0.81 -> l=0. Admission now depends on the value of x1 itself.
p_t, p_d = np.array([0.7, 0.3]), np.array([0.5, 0.5])
rng = np.random.default_rng(1)
N = 400_000
x1 = rng.choice(2, size=N, p=p_d)
admitted = x1 == 0                                          # only A reaches Theta_2 > Theta_0
accept = rng.random(N) < np.minimum(1.0, p_t[x1] / p_d[x1])  # min(1, 0.7/0.5) = 1 for A
fresh = rng.choice(2, size=N, p=p_t)                        # un-admitted: fresh target sample
committed = np.where(admitted & accept, x1, fresh)
p_A = (committed == 0).mean()
assert abs(p_A - 0.85) < 0.005                              # biased: 0.85 != 0.70
assert abs(p_A - 0.70) > 0.10                               # losslessness provably broken
# Causal policy on the same problem: l=0 for every request, so the target samples fresh.
p_A_causal = (rng.choice(2, size=N, p=p_t) == 0).mean()
assert abs(p_A_causal - 0.70) < 0.005

# 3. Load adaptivity (the mechanism behind the paper's Figure 8): identical per-request
#    confidences, saturating capacity curve; the granted budget shrinks as concurrency grows.
gamma, C, B_half = 6, 8000.0, 96.0
sps_curve = {b: C / (B_half + b) for b in range(1, 4096)}
conf = lambda R: [[0.9, 0.85, 0.8, 0.7, 0.6, 0.5]] * R
avg = {R: sum(schedule(conf(R), sps_curve)) / R for R in (4, 32, 256)}
assert avg[4] > avg[32] > avg[256]                          # budget decreases with load
assert avg[4] >= 5.0 and avg[256] <= 1.0                    # long when idle, short when loaded

# 4. Degenerate inputs: zero confidence -> zero budget; lengths always within [0, gamma].
assert schedule([[0.0] * gamma] * 8, sps_curve) == [0] * 8
assert all(0 <= l <= gamma for l in schedule(conf(64), sps_curve))

print("V2 scheduler OK:",
      f"retrospective P(A)={p_A:.3f} (target 0.700, BIASED), causal P(A)={p_A_causal:.3f};",
      f"avg budget R=4:{avg[4]:.2f} R=32:{avg[32]:.2f} R=256:{avg[256]:.2f}")

Running this prints V2 scheduler OK: retrospective P(A)=0.851 (target 0.700, BIASED), causal P(A)=0.700; avg budget R=4:5.00 R=32:3.00 R=256:1.00: the retrospective search provably breaks the exact-distribution guarantee, the causal one preserves it, and the same scheduler grants a 5-token budget to 4 concurrent requests but only 1 token at 256.

Training

Three position-weighted losses train the drafter and confidence head jointly while the target, its embeddings, and its LM head stay frozen: cross-entropy on the ground-truth next token, a total-variation distribution-matching loss (a direct proxy for acceptance, weight 0.9 vs 0.1 for CE), and binary cross-entropy on the confidence head against the analytic acceptance label. Position weights w_k = exp(-(k-1)/γ) emphasise early block positions, which dominate expected accepted length under prefix verification.1

How to use it

With DeepSeek-V4: the DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark Hugging Face repos are the production V4 checkpoints with the DSpark module attached; each carries a minimal inference example in its inference/ folder.3 The production configuration is a 3-MoE-layer backbone with manifold-constrained hyper-connections and sliding-window attention of 128, a Markov sequential head, maximum block size γ = 5 (DSpark-5), and an STS-calibrated confidence head.1

Research checkpoints: DeepSpec releases the twelve drafts behind the paper's Table 1, one per algorithm and target, all trained on open-perfectblend prompts with answers regenerated by the target in non-thinking mode:2

Algorithm Targets covered Example checkpoint
DSpark Qwen3-4B/8B/14B, Gemma4-12B-it deepseek-ai/dspark_qwen3_4b_block7
DFlash Qwen3-4B/8B/14B, Gemma4-12B-it deepseek-ai/dflash_qwen3_4b_block7
Eagle3 Qwen3-4B/8B/14B, Gemma4-12B-it deepseek-ai/eagle3_qwen3_4b_ttt7
# Reference template (needs the DeepSpec repo, its requirements, and GPUs). Not executed here.
git clone https://github.com/deepseek-ai/DeepSpec && cd DeepSpec
python -m pip install -r requirements.txt

# Evaluate a released draft checkpoint against its target on the bundled benchmarks
# (gsm8k, math500, aime25, humaneval, mbpp, livecodebench, mt-bench, alpaca, arena-hard-v2):
# set target_name_or_path (e.g. Qwen/Qwen3-4B) and draft_name_or_path
# (e.g. deepseek-ai/dspark_qwen3_4b_block7) as described in the script header.
bash scripts/eval/eval.sh

The README is explicit that cited comparisons are only meaningful under the repo's training settings; align your setup before quoting numbers against these checkpoints.2

How to develop with it

DeepSpec is a three-stage pipeline; each stage feeds the next:2

  1. Data preparation: download and split prompts, regenerate answers with the target model behind an inference engine, and build the target cache. Budget storage first: the cache is about 38 TB for the default Qwen3-4B setting.
  2. Training: bash scripts/train/train.sh launches one worker per visible GPU (default assumes a single 8-GPU node); select algorithm and target by pointing config_path at a config under config/ (e.g. config/dspark/dspark_qwen3_4b.py). The paper's checkpoints trained 10 epochs. Checkpoints land in ~/checkpoints/<project>/<exp>/step_*.
  3. Evaluation: bash scripts/eval/eval.sh measures accepted length on the bundled benchmarks.

Design guidance from the paper's ablations (Qwen3-4B, fixed protocol):1

  • Depth: accepted length grows monotonically with drafter layers, steepest from 1 to 2; a 2-layer DSpark already beats the 5-layer DFlash baseline across all domains. Sequential modelling buys more than stacking parallel layers.
  • Block size: DSpark's advantage over DFlash widens with block length, from about 16%/15%/18% (math/code/chat) at γ = 7 to 30%/26%/22% at γ = 15, because the suffix decay it removes compounds with depth into the block.
  • Head choice: the RNN head adds only marginal gains over the Markov head, mostly at long blocks, and deploys less cleanly; default to Markov.
  • Overhead: the sequential sampling loop is negligible at serving batch sizes; scaling draft length from 4 to 16 added 0.2-1.3% to full-round latency at batch 128 while accepted length improved by up to 30%.

At production scale, training the draft against a large frozen target has two infrastructure lessons: communicate hidden states, not full-vocabulary logits, between target and draft workers (the LM head projection is recomputed locally only at sampled positions, dropping per-token communication to O(hidden-dim)), and pack fixed-count anchor blocks with token-level attention indices so draft compute decouples from the target's context length.1

How to maintain it

  • Recalibrate STS whenever the draft, target, or traffic mix changes. The scheduler consumes absolute cumulative survival probabilities; a miscalibrated head distorts the throughput estimates that drive admission and silently degrades scheduling even when rankings stay correct.1
  • Re-profile the SPS(B) capacity table on any engine, parallelism, or hardware change. It is captured once at engine initialisation and everything downstream trusts it.1
  • Retrain the draft when the target's serving mode shifts. The released drafts saw non-thinking-mode outputs; a target serving long reasoning traces is a distribution shift that erodes acceptance (the README calls this out explicitly).2
  • Track accepted length per domain, not just the average. The math/code vs chat gap is the signal the scheduler exploits; a collapse in one domain hides inside a healthy-looking mean.1

How to run it in production

DeepSeek's deployment (V4-Flash and V4-Pro previews, live user traffic, MTP-1 as the prior production baseline) is the reference. MTP-1 had been kept single-token precisely because static multi-token drafting degrades aggregate throughput under high concurrency; DSpark's scheduler is what makes larger blocks safe.1

  • Pareto frontier: at a moderate interactivity SLA (80 tok/s/user on Flash, 35 on Pro), DSpark improves aggregate throughput by 51-52%. At strict SLAs (120 tok/s/user Flash, 50 Pro) the baseline nears its operational boundary and sustains only tiny batches, so the nominal ratios (661% and 406%) should be read as the paper reads them: DSpark makes those tiers usable at all, not that a well-utilised baseline got 4-7x faster. At matched throughput, per-user speed improves 60-85% (Flash) and 57-78% (Pro).1 Headline summaries quoting a single "roughly 50-400% faster" range are collapsing these two different measurements, the 51-52% moderate-SLA throughput gains and the nominal 406% strict-SLA ratio on V4-Pro, into one number; keep them separate when sizing capacity.1
  • Budget dynamics: under moderate concurrency the scheduler expands verification from MTP-1's static 2 tokens to roughly 4-6 per request; as concurrency saturates the target, the average budget shrinks smoothly and prunes low-confidence tokens before they consume batch capacity.1
  • Asynchronous scheduling: zero-overhead scheduling needs the next step's batch size before the current step completes, so the production scheduler sets each step's truncation capacity from confidence outputs two steps prior, which casts admission as dynamic top-K selection. Current-step candidates are still sorted by up-to-date scores, so selection stays rank-preserving, and because the capacity decision never sees the current tokens, the asynchrony itself forms the causal barrier: DeepSeek then removes the early-stopping break and searches the whole admission path globally, riding over jagged, step-wise SPS cliffs without violating losslessness.1
  • Variable-length verification kernels: naive variable-length batches waste GPU on padding. The V4 engine flattens all tokens across requests into identical independent elements and conveys intra-sequence structure through a marker tensor in the sparse-attention implementation; only the index-attention and compress kernels needed modification.1
  • Known limitation: the parallel backbone always pays the full block's draft cost up front. For requests with inherently low acceptance that compute is unrecoverable; the paper flags difficulty-aware early exit in the drafter as future work.1

Failure modes

Failure mode Cause Mitigation
Acceptance decays along the block Pure parallel drafting: positions marginalise over predecessors (multi-modal collision).1 Semi-autoregressive head; verify position-wise conditional acceptance, not the mean alone.
Scheduler makes poor admission decisions despite good rankings Overconfident raw confidence scores distort absolute survival estimates (ECE 3-8%).1 Apply STS calibration; monitor ECE on held-out traffic (target about 1%).
Output distribution drifts from the target Retrospective admission: the decision for token k depends on the realised token (Appendix A counterexample).1 Keep the early-stopping break, or isolate admission behind the two-steps-prior asynchronous barrier.
Throughput collapses at high concurrency Fixed or threshold-static verification lengths ignore load; rejected tokens occupy batch capacity.1 Hardware-aware scheduling against the profiled SPS(B) table.
Scheduler stuck at short budgets despite idle capacity Jagged, step-wise SPS(B) cliffs trap the early-stopping greedy in local minima.1 Unconstrained global search, made admissible by the asynchronous causal barrier.
GPU under-utilisation on the verify pass Standard decode kernels assume fixed query lengths; variable-length prefixes force padding.1 Flatten tokens and carry sequence structure in a marker tensor (V4: index-attention and compress kernels).
Acceptance collapses after a target-mode change Draft trained on non-thinking outputs serving a thinking-mode target.2 Retrain the draft on regenerated outputs for the new mode or domain.
Data preparation stalls Target cache underestimated (about 38 TB for the default Qwen3-4B setting).2 Provision cache storage before starting the pipeline.
Misleading benchmark comparisons Quoting the released checkpoints against a differently-configured setup.2 Align training settings with the repo before citing Table 1 numbers.
Wasted draft compute on hard requests Fixed up-front block cost with inherently low acceptance.1 Known limitation; watch per-domain acceptance and consider gating speculation per request class.

References

  • Cheng, Yu, Shao, Li, Xiong, et al. (DeepSeek-AI and Peking University), "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation" (no arXiv ID as of July 2026; PDF in the DeepSpec repository): https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf
  • DeepSpec, "a full-stack codebase for training and evaluating draft models for speculative decoding" (data pipeline, DSpark/DFlash/Eagle3 implementations, configs, released checkpoints): https://github.com/deepseek-ai/DeepSpec
  • DeepSeek-V4 with DSpark attached: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark and https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark; research draft checkpoints, e.g. https://huggingface.co/deepseek-ai/dspark_qwen3_4b_block7
  • Chen, Liang, Liu, "DFlash: Block Diffusion for Flash Speculative Decoding" (the parallel backbone): https://arxiv.org/abs/2602.06036 and https://github.com/z-lab/dflash
  • Li et al., "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test" (the autoregressive baseline): https://arxiv.org/abs/2503.01840
  • DeepSeek-AI, "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence" (the deployment target): https://arxiv.org/abs/2606.19348; "DeepSeek-V3 Technical Report" (the MTP-1 production baseline): https://arxiv.org/abs/2412.19437
  • Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding" (accept rule, losslessness): https://arxiv.org/abs/2211.17192; Chen et al., "Accelerating Large Language Model Decoding with Speculative Sampling": https://arxiv.org/abs/2302.01318

Related: Speculative Decoding · Evaluating Speculative Decoding (SPEED-Bench) · Inference Serving · LLM Inference Efficiency · Continuous Batching Internals · KV Cache Inference Speedup · QoS & Admission Control · SLOs: Inference Serving · Disaggregated Inference


  1. "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation," DeepSeek-AI and Peking University, June 2026, https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf. Latency decomposition L = (T_draft + T_verify) / τ (Eq. 1); semi-autoregressive factorisation with locally normalised softmax conditionals (Eq. 4); Markov head rank 256; confidence head c_k = σ(w^T [h_k; W1[x_{k-1}]]) supervised by c* = 1 - TV/2 (Eqs. 7-8); STS calibration (ECE 3-8% to ~1%, ROC-AUC 0.81-0.90); Algorithm 1 and the Appendix A counterexample (retrospective admission yields output (0.85, 0.15) against target (0.7, 0.3)); losses with weights 0.1/0.9/1.0 and position weighting exp(-(k-1)/γ); Table 1 (macro-average accepted length +30.9/26.7/30.0% over Eagle3, +16.3/18.4/18.3% over DFlash on Qwen3-4B/8B/14B); Figure 2 position-wise conditional acceptance (DFlash 0.88 vs Eagle3 0.81 at position 1 on math, 0.72 vs 0.53 on chat; decay 0.87 to 0.78 on code; DSpark starts 0.93 on math); Figures 3-4 (2-layer DSpark beats 5-layer DFlash; gains 16/15/18% at γ=7 growing to 30/26/22% at γ=15; +0.2-1.3% round latency for draft length 4 to 16 at batch 128); Figure 5 threshold sweep (chat acceptance 45.7% to 95.7%); Section 5 deployment (3 MoE layers, mHC, sliding window 128, γ=5, Markov head; hidden-state communication and anchor-bounded packing in HAI-LLM; two-steps-prior asynchronous top-K scheduling under ZOS and CUDA graphs; flattened tokens with marker tensor, index-attention and compress kernels modified); Figure 7-8 live-traffic results (throughput +51%/+661% at 80/120 tok/s/user on V4-Flash, +52%/+406% at 35/50 on V4-Pro, per-user speed +60-85% and +57-78% at matched throughput, budgets 4-6 tokens under moderate load vs MTP-1's static 2; the paper reads the 661%/406% points as frontier extension, not representative speedup); MTP-1 superseded two weeks after the V4-preview release. 

  2. DeepSpec README, https://github.com/deepseek-ai/DeepSpec (retrieved 2026-07-03). Three-stage workflow (data preparation, training, evaluation); target cache "roughly 38 TB for the default Qwen/Qwen3-4B setting"; train.sh spawns one worker per visible GPU, default single node with 8 GPUs, configs under config/; eval benchmarks gsm8k, math500, aime25, humaneval, mbpp, livecodebench, mt-bench, alpaca, arena-hard-v2; released checkpoints trained on open-perfectblend data "generated by its corresponding target model in non-thinking mode"; "If you cite these results in a new paper, align your setup with the training settings in this repository; otherwise, the comparison is not meaningful. For domain-specific use, fine-tune the draft model again for better results, especially if the target model is expected to run in thinking mode." MIT licensed; builds on SpecForge (Apache-2.0) and DFlash (MIT). 

  3. Hugging Face model cards, https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark and https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark (retrieved 2026-07-03): "DeepSeek-V4-Pro-DSpark is not a new model. It is the same checkpoint with an additional speculative decoding module attached. A minimal inference example is available in the inference folder." V4-Pro: 1.6T total / 49B activated parameters; V4-Flash: 284B / 13B; both 1M context. 

  4. Chen, Liang, Liu, "DFlash: Block Diffusion for Flash Speculative Decoding," https://arxiv.org/abs/2602.06036; implementation https://github.com/z-lab/dflash. Parallel drafter with target-context KV injection: hidden states from selected target layers are projected and concatenated into every draft layer's keys and values; block positions attend bidirectionally; the draft shares the target's frozen embeddings and LM head.