Skip to content
Markdown

Speculative decoding

Scope: accelerating the decode stage by proposing several tokens cheaply (small draft model, n-gram/suffix lookup, or EAGLE/Medusa/MTP heads) and verifying them in one target-model forward pass, accepting the longest valid prefix. Same output distribution, fewer target steps. The latency-side complement to continuous batching under inference serving.

What it is

Autoregressive decode is serial: each token depends on the previous one, so a long reply costs hundreds of sequential target-model forwards, each memory-bandwidth-bound at batch size 1.1 Speculative decoding breaks the serial chain. A cheap drafter proposes k tokens ahead; the expensive target model then scores all k candidate positions in a single batched forward, raising arithmetic intensity (more FLOPs per byte moved) instead of doing k separate bandwidth-bound steps.1

Verification uses rejection sampling, not string equality. For each drafted token with draft probability q(x) and target probability p(x), accept with probability min(1, p(x)/q(x)); on the first rejection, resample that position from the residual distribution norm(max(0, p−q)) and discard the rest of the draft.2 At least one token always commits per step (the resampled or bonus token), so the worst case is one token like plain decode. The key property: the committed sequence is distributed exactly as if sampled from the target alone. Speculative decoding is lossless when sampling settings match, not an approximation.2 The book states the same: "the target model's output distribution is preserved ... the final samples match the large model's distribution when sampling settings are aligned."1

Theoretical ceiling is fewer target steps; real speedup is governed by the acceptance rate (mean accepted tokens per target forward) and the drafter's cost. With overhead and rejections the practical gain is typically ~1.5–2.5× for a two-model setup.13

Why use it

Decode latency dominates interactive serving and is hard to attack any other way: the target's weights still stream from HBM every step regardless of batch size. Speculative decoding is the main lever that cuts per-request TTOT/TPOT without changing the model's outputs, orthogonal to batching, which raises throughput but not single-stream latency.1 It shifts decode from bandwidth-bound toward compute-bound by verifying many tokens per HBM weight read, exactly where modern GPUs have spare FLOPs.1

It is not free. The book is explicit that speculative decoding "adds a draft model" and Medusa "adds multihead parallel decoding," and these are "typically reserved for extreme cases such as ultralong contexts or erratic latency variances," while "lighter-weight methods, including sparsity, batching, and disaggregation, deliver the bulk of benefits in production."1 Acceptance rate collapses the benefit when the drafter mispredicts: rejected drafts are wasted target compute, so a bad drafter can be slower than plain decode under load.1

When to use it (and when not)

Use it when:

  • Single-stream latency is the SLO and you have GPU compute headroom at low-to-moderate batch sizes (the regime where decode is bandwidth-bound).1
  • Output is partly predictable: code, structured/JSON, long quoting/RAG, agentic boilerplate → n-gram/suffix is almost pure upside.3
  • A high-fidelity, much-faster drafter exists. The book's rule: the draft must share the target's tokenizer/vocabulary and be ~4× or more faster, or the speedup evaporates.1

Avoid or reconsider when:

  • The server already runs at large batch sizes / high throughput saturation: decode is compute-bound, spare FLOPs are gone, and speculation's extra verify work competes with real requests. Model-based drafters (EAGLE/Medusa) add load at peak; n-gram/suffix is the safer choice there.3
  • Acceptance is low for your traffic (high-entropy, creative, low draft–target overlap). Measure before committing.1
  • The operational cost (a second model to serve, or retraining heads for Medusa/EAGLE) outweighs the latency win.1

Architecture

Two components and a verify step. The drafter proposes k tokens cheaply; the target scores all k positions in one batched forward; the verifier applies the min(1, p/q) accept rule position by position, commits the longest accepted prefix, then either resamples the first rejected position from norm(max(0, p−q)) or, if all k land, appends a bonus token. The drafter is the only interchangeable part; the target and the verify math are fixed.

Drafter families, cheapest to richest:

  • n-gram / prompt-lookup / suffix: no model at all. Propose continuations by matching the recent suffix against the prompt and prior output. Near-zero cost, wins on repetitive/structured text (code, RAG, summarization with quoting); modest speedup and no extra GPU work at peak load.3
  • Separate draft model: a small, fast LLM (distilled or a smaller sibling) sharing the target's tokenizer and vocabulary. The classic Leviathan/Chen scheme.21
  • EAGLE / EAGLE-2 / EAGLE-3: a lightweight head that drafts at the feature level from the target's own hidden states rather than the token level, raising acceptance. EAGLE reports up to ~3.5× over vanilla for a 4-token draft; EAGLE-2 adds a context-aware dynamic draft tree; EAGLE-3 predicts tokens more directly and fuses low/mid/high layer features, reporting up to ~6.5× over an unoptimized baseline.16 (Independent sources put EAGLE-3 at roughly 4–6× for 70B-class models, so treat the 6.5× as a favorable upper bound, not a typical figure.6)
  • Medusa: adds extra decoding heads to the target so one forward emits a tree of candidates, verified with tree attention. Published ~2.2–3.6×, practically ~2–3×, but requires training the heads.17
  • MTP (multi-token prediction): native multi-token heads built into the model (e.g. DeepSeek-V3/R1); the drafter is the model itself.5
flowchart LR
    subgraph draft["Drafter (cheap)"]
        D["propose k tokens: t1..tk"]
    end
    subgraph target["Target (one batched forward)"]
        V["score all k positions in parallel"]
    end
    D --> V
    V --> A{"accept t_i with min(1, p/q)?"}
    A -->|"all k accepted"| B["commit k + 1 bonus token"]
    A -->|"reject at i"| R["commit t1..t_{i-1} + resample from max(0, p-q)"]
    B --> N["next speculative round"]
    R --> N

The accept rule is the load-bearing math: it is what makes speculation lossless rather than an approximation. The block below makes the guarantee concrete. It runs the single-draft accept/reject/residual step many times and asserts the committed tokens are distributed exactly as the target p, that the empirical acceptance rate equals sum min(p, q) = 1 − TV(p, q), and (adversarial case) that when the drafter puts mass on a token the target assigns probability zero, that token is always rejected and never committed.

# Runnable on system python3 (numpy). Core math of speculative sampling. The rejection rule
# min(1, p/q) with residual resampling norm(max(0, p-q)) proves the committed token is
# distributed EXACTLY as the target p (losslessness), independent of the drafter q.
import numpy as np


def residual(p, q):
    """Normalized positive part of (p - q); the distribution to resample from on rejection."""
    r = np.maximum(0.0, p - q)
    s = r.sum()
    return r / s if s > 0 else p.copy()  # p==q -> residual never used (accept always)


def spec_step_vec(p, q, n, rng):
    """n independent single-draft speculative steps. Returns (committed_tokens, accepted_mask)."""
    V = len(p)
    x = rng.choice(V, size=n, p=q)                      # drafter proposes from q
    accept = rng.random(n) < np.minimum(1.0, p[x] / q[x])
    commits = x.copy()
    n_rej = int((~accept).sum())
    if n_rej:
        commits[~accept] = rng.choice(V, size=n_rej, p=residual(p, q))  # resample residual
    return commits, accept


rng = np.random.default_rng(0)
V, N = 8, 400_000
p = rng.random(V) + 0.05; p /= p.sum()
q = rng.random(V) + 0.05; q /= q.sum()

commits, accept = spec_step_vec(p, q, N, rng)
emp = np.bincount(commits, minlength=V) / N

# 1. Losslessness / equivalence to slow reference: committed tokens match sampling p DIRECTLY.
assert np.max(np.abs(emp - p)) < 0.01, np.max(np.abs(emp - p))
# 2. Acceptance identity: E[accept] == sum min(p,q) == 1 - TV(p,q). Quantitative reference check.
alpha_hat = accept.mean()
alpha_true = np.minimum(p, q).sum()
tv = 0.5 * np.abs(p - q).sum()
assert abs(alpha_hat - alpha_true) < 0.01, (alpha_hat, alpha_true)
assert abs(alpha_true - (1.0 - tv)) < 1e-12
# 3. Residual is a valid probability distribution.
r = residual(p, q)
assert np.all(r >= 0) and abs(r.sum() - 1.0) < 1e-12

# 4. Edge: perfect drafter q==p -> acceptance is EXACTLY 1.0 (never a rejection).
c2, a2 = spec_step_vec(p, p.copy(), 50_000, rng)
assert a2.all(), "q==p must accept every draft"
assert abs(np.minimum(p, p).sum() - 1.0) < 1e-12

# 5. Adversarial: drafter puts mass where target is ZERO (hallucinated token). Those drafts are
#    always rejected, yet the committed distribution still equals p and NEVER emits the p==0 token.
p3 = np.array([0.0, 0.4, 0.6]); q3 = np.array([0.5, 0.25, 0.25])
c3, a3 = spec_step_vec(p3, q3, 300_000, rng)
emp3 = np.bincount(c3, minlength=3) / c3.size
assert emp3[0] == 0.0, "token with target prob 0 must never be committed"
assert np.max(np.abs(emp3 - p3)) < 0.01, np.max(np.abs(emp3 - p3))
# drafts of token 0 (prob q=0.5) are guaranteed rejected -> acceptance strictly below 1.
assert a3.mean() < 0.9

print("V1 spec-sampling OK:",
      f"maxdev={np.max(np.abs(emp - p)):.4f}, alpha_hat={alpha_hat:.3f}, alpha_true={alpha_true:.3f},",
      f"1-TV={1 - tv:.3f}, zero-mass token committed={emp3[0]:.4f}")

Running this prints V1 spec-sampling OK: maxdev=0.0009, alpha_hat=0.649, alpha_true=0.649, 1-TV=0.649, zero-mass token committed=0.0000, confirming the committed stream matches the target distribution to within Monte-Carlo noise and the acceptance rate equals 1 − TV(p, q).

How to use it

vLLM: enable via speculative_config; the same keys work on vllm serve (--speculative-config '<json>') and the LLM(...) constructor.3

# Reference template (needs vllm installed + a GPU). Not executed here.
from vllm import LLM

# n-gram / prompt-lookup: no draft model, propose from the context
LLM(
    model="<target-model>",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 4,
        "prompt_lookup_min": 2,
        "prompt_lookup_max": 5,
    },
)

# separate draft model
LLM(
    model="<target-model>",
    speculative_config={
        "method": "draft_model",
        "model": "<draft-model>",   # must share the target's tokenizer/vocab
        "num_speculative_tokens": 5,
    },
)

# EAGLE-3 head (run the draft without TP: draft_tensor_parallel_size = 1)
LLM(
    model="<target-model>",
    speculative_config={
        "method": "eagle3",
        "model": "<eagle3-head-repo-or-path>",
        "num_speculative_tokens": 5,
    },
)

EAGLE draft heads must run with draft_tensor_parallel_size=1 even when the target uses tensor parallelism; set "method": "eagle" for EAGLE-1/2 and "eagle3" for EAGLE-3.3

SGLang: pass the algorithm and draft path on the launch command.4

python3 -m sglang.launch_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 4 \
  --speculative-num-draft-tokens 16

--speculative-algorithm accepts EAGLE, EAGLE3, and NGRAM (EAGLE3 recommended for best speed/quality). The NGRAM path is CUDA-only and disables the overlap scheduler and mixed chunked prefill, so verify it fits your throughput budget before enabling.4

TensorRT-LLM: the LLM API takes a typed speculative_config object (DraftTargetDecodingConfig, EagleDecodingConfig, Eagle3DecodingConfig, MedusaDecodingConfig, MTPDecodingConfig, NGramDecodingConfig, LookaheadDecodingConfig, ...):5

# Reference template (needs tensorrt_llm installed + a GPU). Not executed here.
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import NGramDecodingConfig

spec = NGramDecodingConfig(
    max_draft_len=3,
    max_matching_ngram_size=4,
    is_public_pool=True,
)
llm = LLM(
    "/path/to/target_model",
    speculative_config=spec,
    disable_overlap_scheduler=True,  # required for the NGram path
)

The n-gram / prompt-lookup drafter above (vLLM method="ngram", TensorRT-LLM NGramDecodingConfig) uses no model at all: it matches the recent suffix against earlier text and proposes the tokens that followed. The block below implements exactly that matcher, asserts it reproduces the continuation of a repeating phrase, shows why a longer suffix disambiguates where a bare last-token match is wrong, and (adversarial cases) returns an empty proposal rather than fabricating tokens when nothing matches or the suffix only occurs at the end.

# Runnable on system python3 (standard library). Core math of the n-gram / prompt-lookup
# drafter: match the LONGEST recent suffix against an earlier occurrence in the running context
# and propose the tokens that followed it. No fabrication when there is no match.
def ngram_propose(tokens, max_ngram, max_draft):
    n = len(tokens)
    for size in range(min(max_ngram, n - 1), 0, -1):        # prefer the longest suffix match
        suffix = tokens[n - size:]
        for start in range(n - size - 1, -1, -1):           # most recent earlier occurrence first
            if tokens[start:start + size] == suffix:
                return tokens[start + size: start + size + max_draft]
    return []


def last_token_propose(tokens):
    """Naive 1-gram baseline: follow the most recent earlier copy of just the last token."""
    n = len(tokens)
    for start in range(n - 2, -1, -1):
        if tokens[start] == tokens[-1]:
            return tokens[start + 1: start + 2]
    return []


# 1. Repetitive sequence: propose the exact continuation of the repeating phrase (ground truth).
rep = [1, 2, 3, 1, 2, 3, 1, 2]
assert ngram_propose(rep, max_ngram=3, max_draft=3) == [3, 1, 2], ngram_propose(rep, 3, 3)

# 2. Longest-match disambiguation: the bare last token is ambiguous, the 2-gram is not.
#    "1 2" earlier is followed by 3; the naive last-token match would wrongly pick 5.
seq = [1, 2, 3, 4, 2, 5, 6, 1, 2]
assert ngram_propose(seq, max_ngram=3, max_draft=2) == [3, 4], ngram_propose(seq, 3, 2)
assert last_token_propose(seq) == [5]                        # proof the disambiguation matters

# 3. Adversarial: no earlier match -> empty proposal (never fabricate tokens).
assert ngram_propose([1, 2, 3, 4, 5], max_ngram=3, max_draft=4) == []

# 4. Adversarial: the suffix occurs ONLY as the trailing tokens (self-match) -> empty, not a
#    look-past-the-end crash. The search excludes the trailing suffix itself.
assert ngram_propose([9, 8, 7], max_ngram=2, max_draft=2) == []

# 5. Draft length is capped at max_draft, and clipped at the end of the context.
long_rep = [4, 5, 6, 7, 4, 5, 6, 7, 4, 5]
out = ngram_propose(long_rep, max_ngram=2, max_draft=3)
assert out == [6, 7, 4] and len(out) <= 3, out                # "4 5" -> 6,7,4

print("V3 ngram-drafter OK:",
      f"rep->{ngram_propose(rep,3,3)}, disambig 2-gram->{ngram_propose(seq,3,2)} vs 1-gram->{last_token_propose(seq)}")

Running this prints V3 ngram-drafter OK: rep->[3, 1, 2], disambig 2-gram->[3, 4] vs 1-gram->[5]. The max_draft cap here is the same knob as vLLM's num_speculative_tokens and TensorRT-LLM's max_draft_len; max_ngram is prompt_lookup_max / max_matching_ngram_size.

How to integrate it

Speculative decoding composes with continuous batching, KV-cache management, and constrained decoding; under constrained/structured output, the token mask must be applied during verification too, not only during drafting, or the accepted prefix can violate the grammar.1 The drafter must share the target's tokenizer and vocabulary: the accept rule compares p(x) and q(x) over the same token ids, so a mismatched vocabulary makes the comparison meaningless.1 With EAGLE the head runs alongside a tensor-parallel target but must itself use draft_tensor_parallel_size=1.3

How to run it in production

Co-running the drafter on the Grace CPU. On Grace-Blackwell (GB200/GB300) the Grace CPU and Blackwell GPU share a cache-coherent address space over NVLink-C2C (~900 GB/s), so the book calls out "leveraging the Grace CPU in the NVL72 for ... co-running smaller 'draft' models for high-performance inference algorithms such as speculative decoding."8 This keeps the small drafter off the critical GPU SMs (the GPU spends its cycles on the high-arithmetic-intensity verify forward while the CPU produces the next proposal), and the coherent fabric makes handing draft logits/tokens to the GPU cheap. Treat it as a placement option (drafter on CPU, target on GPU), and profile: a CPU drafter only helps if it still clears the ~4×-faster-than-target bar and keeps acceptance high.18 This page does not claim hardware-measured numbers; validate acceptance rate and end-to-end latency on your own GB200/GB300 and traffic.

Instrument acceptance rate (mean accepted tokens / verify step) and per-token latency as first-class metrics; they, not the headline , predict the real win, and they drift with traffic mix.13

How to maintain it

Re-verify losslessness whenever sampling settings (temperature, top-p) change: the distribution-preservation guarantee holds only when draft and target sampling are aligned.21 When acceptance sags for your workload, switch drafter family (n-gram for repetitive text, EAGLE for general chat) rather than raising num_speculative_tokens blindly; deeper drafts waste more compute on rejection.

How to scale it

The scaling knob is draft depth k (num_speculative_tokens / max_draft_len), and it is governed by the acceptance rate, not chosen freely. Leviathan's block efficiency gives the expected tokens committed per target forward as (1 − α^(k+1)) / (1 − α) for per-token acceptance α.2 The consequence: returns diminish fast once α^(k+1) stops shrinking, so at low acceptance a deeper draft mostly adds rejected (wasted) verify compute, while at high acceptance the same extra depth still pays. Tune k to the measured α for your traffic. Across the batch-size axis, the win shrinks as the server saturates: at large batch sizes decode turns compute-bound and speculation's verify work competes with real requests (see When to use it), which is why n-gram/suffix (no peak-time model load) scales more gracefully under saturation than EAGLE/Medusa.3

The block below validates the block-efficiency formula against a Monte-Carlo simulation of the accept/reject chain, checks the boundaries (α=0 gives just the bonus token, α=1 gives k+1), and quantifies the diminishing returns that make blindly raising k wasteful.

# Runnable on system python3 (numpy). Why acceptance rate governs speedup: Leviathan's block
# efficiency E[tokens per target forward] = (1 - a^(g+1)) / (1 - a), for per-token acceptance a
# and g draft tokens, including the +1 bonus/resampled token that always commits.
import numpy as np


def expected_tokens_per_step(a, g):
    if a >= 1.0:
        return float(g + 1)                     # limit of the closed form as a -> 1
    return (1.0 - a ** (g + 1)) / (1.0 - a)


def simulate(a, g, n, rng):
    """Monte Carlo: count leading accepted draft tokens (i.i.d. Bernoulli a), capped at g, +1."""
    accepts = rng.random((n, g)) < a
    leading = np.cumprod(accepts, axis=1).sum(axis=1)   # number of leading Trues per row
    return float((leading + 1).mean())


rng = np.random.default_rng(7)
# 1. Closed form matches Monte Carlo across a range of (acceptance, draft depth).
for a, g in [(0.7, 4), (0.5, 3), (0.9, 6), (0.3, 8)]:
    mc = simulate(a, g, 300_000, rng)
    cf = expected_tokens_per_step(a, g)
    assert abs(mc - cf) < 0.02, (a, g, mc, cf)
# 2. Boundaries: a=0 -> only the bonus token (1.0); a=1 -> every draft lands (g+1).
assert expected_tokens_per_step(0.0, 5) == 1.0
assert expected_tokens_per_step(1.0, 5) == 6.0
# 3. Diminishing returns on draft depth: going 4 -> 8 tokens adds far less at low acceptance
#    than at high acceptance (why blindly raising num_speculative_tokens wastes compute).
gain_lo = expected_tokens_per_step(0.3, 8) - expected_tokens_per_step(0.3, 4)
gain_hi = expected_tokens_per_step(0.9, 8) - expected_tokens_per_step(0.9, 4)
assert gain_lo < gain_hi, (gain_lo, gain_hi)
assert gain_lo < 0.1, gain_lo                   # at a=0.3 the 5th-8th tokens barely help
# 4. Monotonic non-decreasing in g for fixed a (more draft slots never lowers the expectation).
vals = [expected_tokens_per_step(0.6, g) for g in range(0, 12)]
assert all(b >= a for a, b in zip(vals, vals[1:]))

print("V2 block-efficiency OK:",
      f"a=0.7,g=4 -> {expected_tokens_per_step(0.7,4):.3f} tok/step;",
      f"a=0.3 depth gain(4->8)={gain_lo:.3f} vs a=0.9 {gain_hi:.3f}")

Running this prints V2 block-efficiency OK: a=0.7,g=4 -> 2.773 tok/step; a=0.3 depth gain(4->8)=0.003 vs a=0.9 2.031: at α=0.3, growing the draft from 4 to 8 tokens buys 0.003 extra tokens/step (pure wasted verify work), while at α=0.9 the same growth buys ~2.

Failure modes

Failure mode Cause Mitigation
Net slowdown versus plain decode Low acceptance: the drafter mispredicts, so rejected drafts are wasted target compute.1 Measure acceptance rate; switch drafter family; reduce draft depth k.
Benefit collapses under load At large batch sizes decode is compute-bound; verify work competes with real requests, and model-based drafters add load at peak.3 Prefer n-gram/suffix at saturation, or gate speculation above a batch-size threshold.
Low acceptance on some traffic High-entropy, creative, low draft–target overlap.1 Measure per traffic mix before committing; n-gram will not help creative text, consider EAGLE or skip.
Output distribution drifts (not lossless) Draft and target sampling settings (temperature, top-p) are not aligned.21 Re-verify distribution preservation on any sampling change; keep draft and target sampling aligned.
Speedup evaporates Drafter does not share the target's tokenizer/vocabulary, or is not ~4×+ faster.1 Enforce shared vocab; require a ~4× speed margin over the target.
EAGLE misconfiguration Draft head launched with tensor parallelism > 1.3 Set draft_tensor_parallel_size=1 on the draft even when the target is tensor-parallel.
Throughput regression from the n-gram path SGLang NGRAM is CUDA-only and disables the overlap scheduler and mixed chunked prefill; TensorRT-LLM NGram needs disable_overlap_scheduler=True.45 Confirm it fits the throughput budget before enabling.
Structured output violated The constraint token mask is applied only during drafting, not during verification.1 Apply the grammar/JSON mask on the verify pass too.
Deeper drafts waste compute Raising num_speculative_tokens past what acceptance supports: block efficiency saturates.21 Tune k to measured acceptance (see How to scale it).

References

  • Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Chapter 15: "Speculative Decoding and Parallel Token Generation Techniques" (two-model draft+verify, EAGLE/EAGLE-2/EAGLE-3, self-speculative decoding, Medusa); Chapter 1: Grace CPU co-running draft models in the NVL72.
  • Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," ICML 2023 (rejection-sampling accept rule min(1, p/q), exact target-distribution preservation): https://arxiv.org/abs/2211.17192. Chen et al., "Accelerating Large Language Model Decoding with Speculative Sampling": https://arxiv.org/abs/2302.01318
  • vLLM, "Speculative Decoding" (methods, speculative_config, method/num_speculative_tokens/prompt_lookup_*): https://docs.vllm.ai/en/latest/features/speculative_decoding/. "EAGLE Draft Models" (draft_tensor_parallel_size=1, eagle/eagle3): https://docs.vllm.ai/en/latest/features/speculative_decoding/eagle/
  • SGLang, "Speculative Decoding" (--speculative-algorithm EAGLE/EAGLE3/NGRAM, --speculative-draft-model-path, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens): https://docs.sglang.ai/advanced_features/speculative_decoding.html
  • TensorRT-LLM, "Speculative Decoding" (draft-target, EAGLE/EAGLE-3, Medusa, NGram, Lookahead, MTP; NGramDecodingConfig, EagleDecodingConfig, ...): https://nvidia.github.io/TensorRT-LLM/latest/features/speculative-decoding.html and LLM API reference: https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html
  • Li et al., "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test," NeurIPS 2025: https://arxiv.org/abs/2503.01840. SafeAILab EAGLE (1/2/3): https://github.com/SafeAILab/EAGLE
  • Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads": https://arxiv.org/abs/2401.10774
  • NVIDIA Technical Blog, "An Introduction to Speculative Decoding for Reducing Latency in AI Inference": https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

Related: Continuous Batching Internals · Inference Serving · Serving OSS Models · KV Cache Management · Constrained Decoding · Disaggregated Inference · Inference Parallelism Strategies · SLO/SLI Catalog · Glossary


  1. Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 15: "Speculative Decoding and Parallel Token Generation Techniques." Draft proposes k tokens "speculatively beyond the current context"; target "validates the draft tokens by predicting next-token probabilities for the entire k-token sequence in a single batch," which "increases arithmetic intensity." "In theory, it provides a theoretical k× speedup ... In practice, with overhead and occasional speculative-token rejections, the gain is more like a 2× speedup." Draft "must use the same tokenizer and vocabulary" and be "much faster than the large model — typically by a factor of 4× or more." "Under the standard speculative decoding acceptance procedure, the target model's output distribution is preserved." EAGLE "~3.5× speedup over vanilla decoding for a 4-token draft"; EAGLE-2 "20%–40% faster than EAGLE"; EAGLE-3 "up to 6.5× speedups over non-optimized baseline." Medusa "~2.2–3.6× speedups ... in practice ... about a 2–3× speedup." Self-speculative decoding runs "only half of its layers" as the draft. Ch. 16: speculative decoding "adds a draft model ... typically reserved for extreme cases such as ultralong contexts or erratic latency variances," while "sparsity, batching, and disaggregation, deliver the bulk of benefits in production." 

  2. Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," ICML 2023, https://arxiv.org/abs/2211.17192; Chen et al., https://arxiv.org/abs/2302.01318. Each draft token is accepted with probability min(1, p(x)/q(x)); on rejection it is resampled from norm(max(0, p−q)), plus a bonus token when all are accepted. The procedure yields output distributed identically to sampling from the target model alone. Expected tokens per target forward for acceptance α and γ draft tokens is (1 − α^(γ+1)) / (1 − α)

  3. vLLM, "Speculative Decoding," https://docs.vllm.ai/en/latest/features/speculative_decoding/ and EAGLE guide, https://docs.vllm.ai/en/latest/features/speculative_decoding/eagle/. Model-based methods (EAGLE/MTP/draft model) give the best latency reduction; n-gram and suffix decoding give modest speedups "without increasing workload during peak traffic." speculative_config keys: method, num_speculative_tokens (required), model (required for draft_model/eagle/eagle3/medusa), prompt_lookup_min/prompt_lookup_max (ngram). EAGLE draft must run with draft_tensor_parallel_size=1

  4. SGLang, "Speculative Decoding," https://docs.sglang.ai/advanced_features/speculative_decoding.html. --speculative-algorithm EAGLE (and EAGLE3, NGRAM); example: --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 16. EAGLE3 recommended for best speed/quality; the NGRAM path is CUDA-only and disables the overlap scheduler and mixed chunked prefill. 

  5. TensorRT-LLM, "Speculative Decoding," https://nvidia.github.io/TensorRT-LLM/latest/features/speculative-decoding.html, and LLM API reference, https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html. Supports Draft-Target, NGram, Medusa, ReDrafter, EAGLE/EAGLE-2/EAGLE-3, Lookahead, MTP. speculative_config accepts typed configs incl. NGramDecodingConfig(max_draft_len=..., max_matching_ngram_size=..., is_public_pool=...), EagleDecodingConfig, Eagle3DecodingConfig, MedusaDecodingConfig, MTPDecodingConfig, LookaheadDecodingConfig; the NGram path uses disable_overlap_scheduler=True

  6. Li et al., "EAGLE-3," NeurIPS 2025, https://arxiv.org/abs/2503.01840; SafeAILab implementation, https://github.com/SafeAILab/EAGLE. EAGLE-3 fuses low/mid/high-layer features and predicts tokens directly; reported up to ~6.5× over an unoptimized baseline and ~20–40% over EAGLE-2. Independent reports (NVIDIA, vendor blogs) cite ~4–6× for 70B-class models; the 6.5× figure is a favorable upper bound, not a typical production number. 

  7. Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," https://arxiv.org/abs/2401.10774. Extra decoding heads on the target emit a tree of candidates verified with tree attention; requires training the heads. Reported ~2.2–3.6× (Medusa-1/Medusa-2). 

  8. Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 1: "leveraging the Grace CPU in the NVL72 for preprocessing, co-running smaller 'draft' models for high-performance inference algorithms such as speculative decoding." Ch. 2: Grace CPU and Blackwell GPU are cache-coherent over NVLink-C2C at up to ~900 GB/s, sharing one address space.