Markdown

Speculative decoding¶

Scope: accelerating the decode stage by proposing several tokens cheaply (small draft model, n-gram/suffix lookup, or EAGLE/Medusa/MTP heads) and verifying them in one target-model forward pass, accepting the longest valid prefix. Same output distribution, fewer target steps. The latency-side complement to continuous batching under inference serving.

What it is¶

Autoregressive decode is serial: each token depends on the previous one, so a long reply costs hundreds of sequential target-model forwards, each memory-bandwidth-bound at batch size 1.¹ Speculative decoding breaks the serial chain. A cheap drafter proposes k tokens ahead; the expensive target model then scores all k candidate positions in a single batched forward, raising arithmetic intensity (more FLOPs per byte moved) instead of doing k separate bandwidth-bound steps.¹

Verification uses rejection sampling, not string equality. For each drafted token with draft probability q(x) and target probability p(x), accept with probability min(1, p(x)/q(x)); on the first rejection, resample that position from the residual distribution norm(max(0, p−q)) and discard the rest of the draft.² At least one token always commits per step (the resampled or bonus token), so the worst case is one token like plain decode. The key property: the committed sequence is distributed exactly as if sampled from the target alone. Speculative decoding is lossless when sampling settings match, not an approximation.² The book states the same: "the target model's output distribution is preserved ... the final samples match the large model's distribution when sampling settings are aligned."¹

Theoretical ceiling is k× fewer target steps; real speedup is governed by the acceptance rate (mean accepted tokens per target forward) and the drafter's cost. With overhead and rejections the practical gain is typically ~1.5–2.5× for a two-model setup.¹³

Why use it¶

Decode latency dominates interactive serving and is hard to attack any other way: the target's weights still stream from HBM every step regardless of batch size. Speculative decoding is the main lever that cuts per-request TTOT/TPOT without changing the model's outputs, orthogonal to batching, which raises throughput but not single-stream latency.¹ It shifts decode from bandwidth-bound toward compute-bound by verifying many tokens per HBM weight read, exactly where modern GPUs have spare FLOPs.¹

It is not free. The book is explicit that speculative decoding "adds a draft model" and Medusa "adds multihead parallel decoding," and these are "typically reserved for extreme cases such as ultralong contexts or erratic latency variances," while "lighter-weight methods, including sparsity, batching, and disaggregation, deliver the bulk of benefits in production."¹ Acceptance rate collapses the benefit when the drafter mispredicts: rejected drafts are wasted target compute, so a bad drafter can be slower than plain decode under load.¹

When to use it (and when not)¶

Use it when:

Single-stream latency is the SLO and you have GPU compute headroom at low-to-moderate batch sizes (the regime where decode is bandwidth-bound).¹
Output is partly predictable: code, structured/JSON, long quoting/RAG, agentic boilerplate → n-gram/suffix is almost pure upside.³
A high-fidelity, much-faster drafter exists. The book's rule: the draft must share the target's tokenizer/vocabulary and be ~4× or more faster, or the speedup evaporates.¹

Avoid or reconsider when:

The server already runs at large batch sizes / high throughput saturation: decode is compute-bound, spare FLOPs are gone, and speculation's extra verify work competes with real requests. Model-based drafters (EAGLE/Medusa) add load at peak; n-gram/suffix is the safer choice there.³
Acceptance is low for your traffic (high-entropy, creative, low draft–target overlap). Measure before committing.¹
The operational cost (a second model to serve, or retraining heads for Medusa/EAGLE) outweighs the latency win.¹

Architecture¶

Two components and a verify step. The drafter proposes k tokens cheaply; the target scores all k positions in one batched forward; the verifier applies the min(1, p/q) accept rule position by position, commits the longest accepted prefix, then either resamples the first rejected position from norm(max(0, p−q)) or, if all k land, appends a bonus token. The drafter is the only interchangeable part; the target and the verify math are fixed.

Drafter families, cheapest to richest:

n-gram / prompt-lookup / suffix: no model at all. Propose continuations by matching the recent suffix against the prompt and prior output. Near-zero cost, wins on repetitive/structured text (code, RAG, summarization with quoting); modest speedup and no extra GPU work at peak load.³
Separate draft model: a small, fast LLM (distilled or a smaller sibling) sharing the target's tokenizer and vocabulary. The classic Leviathan/Chen scheme.²¹
EAGLE / EAGLE-2 / EAGLE-3: a lightweight head that drafts at the feature level from the target's own hidden states rather than the token level, raising acceptance. EAGLE reports up to ~3.5× over vanilla for a 4-token draft; EAGLE-2 adds a context-aware dynamic draft tree; EAGLE-3 predicts tokens more directly and fuses low/mid/high layer features, reporting up to ~6.5× over an unoptimized baseline.¹⁶ (Independent sources put EAGLE-3 at roughly 4–6× for 70B-class models, so treat the 6.5× as a favorable upper bound, not a typical figure.⁶)
Medusa: adds extra decoding heads to the target so one forward emits a tree of candidates, verified with tree attention. Published ~2.2–3.6×, practically ~2–3×, but requires training the heads.¹⁷
MTP (multi-token prediction): native multi-token heads built into the model (e.g. DeepSeek-V3/R1); the drafter is the model itself.⁵

flowchart LR
    subgraph draft["Drafter (cheap)"]
        D["propose k tokens: t1..tk"]
    end
    subgraph target["Target (one batched forward)"]
        V["score all k positions in parallel"]
    end
    D --> V
    V --> A{"accept t_i with min(1, p/q)?"}
    A -->|"all k accepted"| B["commit k + 1 bonus token"]
    A -->|"reject at i"| R["commit t1..t_{i-1} + resample from max(0, p-q)"]
    B --> N["next speculative round"]
    R --> N

The accept rule is the load-bearing math: it is what makes speculation lossless rather than an approximation. The block below makes the guarantee concrete. It runs the single-draft accept/reject/residual step many times and asserts the committed tokens are distributed exactly as the target p, that the empirical acceptance rate equals sum min(p, q) = 1 − TV(p, q), and (adversarial case) that when the drafter puts mass on a token the target assigns probability zero, that token is always rejected and never committed.

# Runnable on system python3 (numpy). Core math of speculative sampling. The rejection rule
# min(1, p/q) with residual resampling norm(max(0, p-q)) proves the committed token is
# distributed EXACTLY as the target p (losslessness), independent of the drafter q.
import numpy as np


def residual(p, q):
    """Normalized positive part of (p - q); the distribution to resample from on rejection."""
    r = np.maximum(0.0, p - q)
    s = r.sum()
    return r / s if s > 0 else p.copy()  # p==q -> residual never used (accept always)


def spec_step_vec(p, q, n, rng):
    """n independent single-draft speculative steps. Returns (committed_tokens, accepted_mask)."""
    V = len(p)
    x = rng.choice(V, size=n, p=q)                      # drafter proposes from q
    accept = rng.random(n) < np.minimum(1.0, p[x] / q[x])
    commits = x.copy()
    n_rej = int((~accept).sum())
    if n_rej:
        commits[~accept] = rng.choice(V, size=n_rej, p=residual(p, q))  # resample residual
    return commits, accept


rng = np.random.default_rng(0)
V, N = 8, 400_000
p = rng.random(V) + 0.05; p /= p.sum()
q = rng.random(V) + 0.05; q /= q.sum()

commits, accept = spec_step_vec(p, q, N, rng)
emp = np.bincount(commits, minlength=V) / N

# 1. Losslessness / equivalence to slow reference: committed tokens match sampling p DIRECTLY.
assert np.max(np.abs(emp - p)) < 0.01, np.max(np.abs(emp - p))
# 2. Acceptance identity: E[accept] == sum min(p,q) == 1 - TV(p,q). Quantitative reference check.
alpha_hat = accept.mean()
alpha_true = np.minimum(p, q).sum()
tv = 0.5 * np.abs(p - q).sum()
assert abs(alpha_hat - alpha_true) < 0.01, (alpha_hat, alpha_true)
assert abs(alpha_true - (1.0 - tv)) < 1e-12
# 3. Residual is a valid probability distribution.
r = residual(p, q)
assert np.all(r >= 0) and abs(r.sum() - 1.0) < 1e-12

# 4. Edge: perfect drafter q==p -> acceptance is EXACTLY 1.0 (never a rejection).
c2, a2 = spec_step_vec(p, p.copy(), 50_000, rng)
assert a2.all(), "q==p must accept every draft"
assert abs(np.minimum(p, p).sum() - 1.0) < 1e-12

# 5. Adversarial: drafter puts mass where target is ZERO (hallucinated token). Those drafts are
#    always rejected, yet the committed distribution still equals p and NEVER emits the p==0 token.
p3 = np.array([0.0, 0.4, 0.6]); q3 = np.array([0.5, 0.25, 0.25])
c3, a3 = spec_step_vec(p3, q3, 300_000, rng)
emp3 = np.bincount(c3, minlength=3) / c3.size
assert emp3[0] == 0.0, "token with target prob 0 must never be committed"
assert np.max(np.abs(emp3 - p3)) < 0.01, np.max(np.abs(emp3 - p3))
# drafts of token 0 (prob q=0.5) are guaranteed rejected -> acceptance strictly below 1.
assert a3.mean() < 0.9

print("V1 spec-sampling OK:",
      f"maxdev={np.max(np.abs(emp - p)):.4f}, alpha_hat={alpha_hat:.3f}, alpha_true={alpha_true:.3f},",
      f"1-TV={1 - tv:.3f}, zero-mass token committed={emp3[0]:.4f}")

Running this prints V1 spec-sampling OK: maxdev=0.0009, alpha_hat=0.649, alpha_true=0.649, 1-TV=0.649, zero-mass token committed=0.0000, confirming the committed stream matches the target distribution to within Monte-Carlo noise and the acceptance rate equals 1 − TV(p, q).

How to use it¶

vLLM: enable via speculative_config; the same keys work on vllm serve (--speculative-config '<json>') and the LLM(...) constructor.³

# Reference template (needs vllm installed + a GPU). Not executed here.
from vllm import LLM

# n-gram / prompt-lookup: no draft model, propose from the context
LLM(
    model="<target-model>",
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 4,
        "prompt_lookup_min": 2,
        "prompt_lookup_max": 5,
    },
)

# separate draft model
LLM(
    model="<target-model>",
    speculative_config={
        "method": "draft_model",
        "model": "<draft-model>",   # must share the target's tokenizer/vocab
        "num_speculative_tokens": 5,
    },
)

# EAGLE-3 head (run the draft without TP: draft_tensor_parallel_size = 1)
LLM(
    model="<target-model>",
    speculative_config={
        "method": "eagle3",
        "model": "<eagle3-head-repo-or-path>",
        "num_speculative_tokens": 5,
    },
)

EAGLE draft heads must run with draft_tensor_parallel_size=1 even when the target uses tensor parallelism; set "method": "eagle" for EAGLE-1/2 and "eagle3" for EAGLE-3.³

SGLang: pass the algorithm and draft path on the launch command.⁴

python3 -m sglang.launch_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 4 \
  --speculative-num-draft-tokens 16

--speculative-algorithm accepts EAGLE, EAGLE3, and NGRAM (EAGLE3 recommended for best speed/quality). The NGRAM path is CUDA-only and disables the overlap scheduler and mixed chunked prefill, so verify it fits your throughput budget before enabling.⁴

TensorRT-LLM: the LLM API takes a typed speculative_config object (DraftTargetDecodingConfig, EagleDecodingConfig, Eagle3DecodingConfig, MedusaDecodingConfig, MTPDecodingConfig, NGramDecodingConfig, LookaheadDecodingConfig, ...):⁵

# Reference template (needs tensorrt_llm installed + a GPU). Not executed here.
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import NGramDecodingConfig

spec = NGramDecodingConfig(
    max_draft_len=3,
    max_matching_ngram_size=4,
    is_public_pool=True,
)
llm = LLM(
    "/path/to/target_model",
    speculative_config=spec,
    disable_overlap_scheduler=True,  # required for the NGram path
)

The n-gram / prompt-lookup drafter above (vLLM method="ngram", TensorRT-LLM NGramDecodingConfig) uses no model at all: it matches the recent suffix against earlier text and proposes the tokens that followed. The block below implements exactly that matcher, asserts it reproduces the continuation of a repeating phrase, shows why a longer suffix disambiguates where a bare last-token match is wrong, and (adversarial cases) returns an empty proposal rather than fabricating tokens when nothing matches or the suffix only occurs at the end.

# Runnable on system python3 (standard library). Core math of the n-gram / prompt-lookup
# drafter: match the LONGEST recent suffix against an earlier occurrence in the running context
# and propose the tokens that followed it. No fabrication when there is no match.
def ngram_propose(tokens, max_ngram, max_draft):
    n = len(tokens)
    for size in range(min(max_ngram, n - 1), 0, -1):        # prefer the longest suffix match
        suffix = tokens[n - size:]
        for start in range(n - size - 1, -1, -1):           # most recent earlier occurrence first
            if tokens[start:start + size] == suffix:
                return tokens[start + size: start + size + max_draft]
    return []


def last_token_propose(tokens):
    """Naive 1-gram baseline: follow the most recent earlier copy of just the last token."""
    n = len(tokens)
    for start in range(n - 2, -1, -1):
        if tokens[start] == tokens[-1]:
            return tokens[start + 1: start + 2]
    return []


# 1. Repetitive sequence: propose the exact continuation of the repeating phrase (ground truth).
rep = [1, 2, 3, 1, 2, 3, 1, 2]
assert ngram_propose(rep, max_ngram=3, max_draft=3) == [3, 1, 2], ngram_propose(rep, 3, 3)

# 2. Longest-match disambiguation: the bare last token is ambiguous, the 2-gram is not.
#    "1 2" earlier is followed by 3; the naive last-token match would wrongly pick 5.
seq = [1, 2, 3, 4, 2, 5, 6, 1, 2]
assert ngram_propose(seq, max_ngram=3, max_draft=2) == [3, 4], ngram_propose(seq, 3, 2)
assert last_token_propose(seq) == [5]                        # proof the disambiguation matters

# 3. Adversarial: no earlier match -> empty proposal (never fabricate tokens).
assert ngram_propose([1, 2, 3, 4, 5], max_ngram=3, max_draft=4) == []

# 4. Adversarial: the suffix occurs ONLY as the trailing tokens (self-match) -> empty, not a
#    look-past-the-end crash. The search excludes the trailing suffix itself.
assert ngram_propose([9, 8, 7], max_ngram=2, max_draft=2) == []

# 5. Draft length is capped at max_draft, and clipped at the end of the context.
long_rep = [4, 5, 6, 7, 4, 5, 6, 7, 4, 5]
out = ngram_propose(long_rep, max_ngram=2, max_draft=3)
assert out == [6, 7, 4] and len(out) <= 3, out                # "4 5" -> 6,7,4

print("V3 ngram-drafter OK:",
      f"rep->{ngram_propose(rep,3,3)}, disambig 2-gram->{ngram_propose(seq,3,2)} vs 1-gram->{last_token_propose(seq)}")

Running this prints V3 ngram-drafter OK: rep->[3, 1, 2], disambig 2-gram->[3, 4] vs 1-gram->[5]. The max_draft cap here is the same knob as vLLM's num_speculative_tokens and TensorRT-LLM's max_draft_len; max_ngram is prompt_lookup_max / max_matching_ngram_size.

How to integrate it¶

Speculative decoding composes with continuous batching, KV-cache management, and constrained decoding; under constrained/structured output, the token mask must be applied during verification too, not only during drafting, or the accepted prefix can violate the grammar.¹ The drafter must share the target's tokenizer and vocabulary: the accept rule compares p(x) and q(x) over the same token ids, so a mismatched vocabulary makes the comparison meaningless.¹ With EAGLE the head runs alongside a tensor-parallel target but must itself use draft_tensor_parallel_size=1.³

How to run it in production¶

Co-running the drafter on the Grace CPU. On Grace-Blackwell (GB200/GB300) the Grace CPU and Blackwell GPU share a cache-coherent address space over NVLink-C2C (~900 GB/s), so the book calls out "leveraging the Grace CPU in the NVL72 for ... co-running smaller 'draft' models for high-performance inference algorithms such as speculative decoding."⁸ This keeps the small drafter off the critical GPU SMs (the GPU spends its cycles on the high-arithmetic-intensity verify forward while the CPU produces the next proposal), and the coherent fabric makes handing draft logits/tokens to the GPU cheap. Treat it as a placement option (drafter on CPU, target on GPU), and profile: a CPU drafter only helps if it still clears the ~4×-faster-than-target bar and keeps acceptance high.¹⁸ This page does not claim hardware-measured numbers; validate acceptance rate and end-to-end latency on your own GB200/GB300 and traffic.

Instrument acceptance rate (mean accepted tokens / verify step) and per-token latency as first-class metrics; they, not the headline k×, predict the real win, and they drift with traffic mix.¹³

How to maintain it¶

Re-verify losslessness whenever sampling settings (temperature, top-p) change: the distribution-preservation guarantee holds only when draft and target sampling are aligned.²¹ When acceptance sags for your workload, switch drafter family (n-gram for repetitive text, EAGLE for general chat) rather than raising num_speculative_tokens blindly; deeper drafts waste more compute on rejection.

How to scale it¶

The scaling knob is draft depth k (num_speculative_tokens / max_draft_len), and it is governed by the acceptance rate, not chosen freely. Leviathan's block efficiency gives the expected tokens committed per target forward as (1 − α^(k+1)) / (1 − α) for per-token acceptance α.² The consequence: returns diminish fast once α^(k+1) stops shrinking, so at low acceptance a deeper draft mostly adds rejected (wasted) verify compute, while at high acceptance the same extra depth still pays. Tune k to the measured α for your traffic. Across the batch-size axis, the win shrinks as the server saturates: at large batch sizes decode turns compute-bound and speculation's verify work competes with real requests (see When to use it), which is why n-gram/suffix (no peak-time model load) scales more gracefully under saturation than EAGLE/Medusa.³

The block below validates the block-efficiency formula against a Monte-Carlo simulation of the accept/reject chain, checks the boundaries (α=0 gives just the bonus token, α=1 gives k+1), and quantifies the diminishing returns that make blindly raising k wasteful.

# Runnable on system python3 (numpy). Why acceptance rate governs speedup: Leviathan's block
# efficiency E[tokens per target forward] = (1 - a^(g+1)) / (1 - a), for per-token acceptance a
# and g draft tokens, including the +1 bonus/resampled token that always commits.
import numpy as np


def expected_tokens_per_step(a, g):
    if a >= 1.0:
        return float(g + 1)                     # limit of the closed form as a -> 1
    return (1.0 - a ** (g + 1)) / (1.0 - a)


def simulate(a, g, n, rng):
    """Monte Carlo: count leading accepted draft tokens (i.i.d. Bernoulli a), capped at g, +1."""
    accepts = rng.random((n, g)) < a
    leading = np.cumprod(accepts, axis=1).sum(axis=1)   # number of leading Trues per row
    return float((leading + 1).mean())


rng = np.random.default_rng(7)
# 1. Closed form matches Monte Carlo across a range of (acceptance, draft depth).
for a, g in [(0.7, 4), (0.5, 3), (0.9, 6), (0.3, 8)]:
    mc = simulate(a, g, 300_000, rng)
    cf = expected_tokens_per_step(a, g)
    assert abs(mc - cf) < 0.02, (a, g, mc, cf)
# 2. Boundaries: a=0 -> only the bonus token (1.0); a=1 -> every draft lands (g+1).
assert expected_tokens_per_step(0.0, 5) == 1.0
assert expected_tokens_per_step(1.0, 5) == 6.0
# 3. Diminishing returns on draft depth: going 4 -> 8 tokens adds far less at low acceptance
#    than at high acceptance (why blindly raising num_speculative_tokens wastes compute).
gain_lo = expected_tokens_per_step(0.3, 8) - expected_tokens_per_step(0.3, 4)
gain_hi = expected_tokens_per_step(0.9, 8) - expected_tokens_per_step(0.9, 4)
assert gain_lo < gain_hi, (gain_lo, gain_hi)
assert gain_lo < 0.1, gain_lo                   # at a=0.3 the 5th-8th tokens barely help
# 4. Monotonic non-decreasing in g for fixed a (more draft slots never lowers the expectation).
vals = [expected_tokens_per_step(0.6, g) for g in range(0, 12)]
assert all(b >= a for a, b in zip(vals, vals[1:]))

print("V2 block-efficiency OK:",
      f"a=0.7,g=4 -> {expected_tokens_per_step(0.7,4):.3f} tok/step;",
      f"a=0.3 depth gain(4->8)={gain_lo:.3f} vs a=0.9 {gain_hi:.3f}")

Running this prints V2 block-efficiency OK: a=0.7,g=4 -> 2.773 tok/step; a=0.3 depth gain(4->8)=0.003 vs a=0.9 2.031: at α=0.3, growing the draft from 4 to 8 tokens buys 0.003 extra tokens/step (pure wasted verify work), while at α=0.9 the same growth buys ~2.

Failure modes¶

Failure mode	Cause	Mitigation
Net slowdown versus plain decode	Low acceptance: the drafter mispredicts, so rejected drafts are wasted target compute.¹	Measure acceptance rate; switch drafter family; reduce draft depth `k`.
Benefit collapses under load	At large batch sizes decode is compute-bound; verify work competes with real requests, and model-based drafters add load at peak.³	Prefer n-gram/suffix at saturation, or gate speculation above a batch-size threshold.
Low acceptance on some traffic	High-entropy, creative, low draft–target overlap.¹	Measure per traffic mix before committing; n-gram will not help creative text, consider EAGLE or skip.
Output distribution drifts (not lossless)	Draft and target sampling settings (temperature, top-p) are not aligned.²¹	Re-verify distribution preservation on any sampling change; keep draft and target sampling aligned.
Speedup evaporates	Drafter does not share the target's tokenizer/vocabulary, or is not `~4×`+ faster.¹	Enforce shared vocab; require a `~4×` speed margin over the target.
EAGLE misconfiguration	Draft head launched with tensor parallelism > 1.³	Set `draft_tensor_parallel_size=1` on the draft even when the target is tensor-parallel.
Throughput regression from the n-gram path	SGLang `NGRAM` is CUDA-only and disables the overlap scheduler and mixed chunked prefill; TensorRT-LLM NGram needs `disable_overlap_scheduler=True`.⁴⁵	Confirm it fits the throughput budget before enabling.
Structured output violated	The constraint token mask is applied only during drafting, not during verification.¹	Apply the grammar/JSON mask on the verify pass too.
Deeper drafts waste compute	Raising `num_speculative_tokens` past what acceptance supports: block efficiency saturates.²¹	Tune `k` to measured acceptance (see How to scale it).

References¶

Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Chapter 15: "Speculative Decoding and Parallel Token Generation Techniques" (two-model draft+verify, EAGLE/EAGLE-2/EAGLE-3, self-speculative decoding, Medusa); Chapter 1: Grace CPU co-running draft models in the NVL72.
Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," ICML 2023 (rejection-sampling accept rule min(1, p/q), exact target-distribution preservation): https://arxiv.org/abs/2211.17192. Chen et al., "Accelerating Large Language Model Decoding with Speculative Sampling": https://arxiv.org/abs/2302.01318
vLLM, "Speculative Decoding" (methods, speculative_config, method/num_speculative_tokens/prompt_lookup_*): https://docs.vllm.ai/en/latest/features/speculative_decoding/. "EAGLE Draft Models" (draft_tensor_parallel_size=1, eagle/eagle3): https://docs.vllm.ai/en/latest/features/speculative_decoding/eagle/
SGLang, "Speculative Decoding" (--speculative-algorithm EAGLE/EAGLE3/NGRAM, --speculative-draft-model-path, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens): https://docs.sglang.ai/advanced_features/speculative_decoding.html
TensorRT-LLM, "Speculative Decoding" (draft-target, EAGLE/EAGLE-3, Medusa, NGram, Lookahead, MTP; NGramDecodingConfig, EagleDecodingConfig, ...): https://nvidia.github.io/TensorRT-LLM/latest/features/speculative-decoding.html and LLM API reference: https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html
Li et al., "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test," NeurIPS 2025: https://arxiv.org/abs/2503.01840. SafeAILab EAGLE (1/2/3): https://github.com/SafeAILab/EAGLE
Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads": https://arxiv.org/abs/2401.10774
NVIDIA Technical Blog, "An Introduction to Speculative Decoding for Reducing Latency in AI Inference": https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 15: "Speculative Decoding and Parallel Token Generation Techniques." Draft proposes k tokens "speculatively beyond the current context"; target "validates the draft tokens by predicting next-token probabilities for the entire k-token sequence in a single batch," which "increases arithmetic intensity." "In theory, it provides a theoretical k× speedup ... In practice, with overhead and occasional speculative-token rejections, the gain is more like a 2× speedup." Draft "must use the same tokenizer and vocabulary" and be "much faster than the large model — typically by a factor of 4× or more." "Under the standard speculative decoding acceptance procedure, the target model's output distribution is preserved." EAGLE "~3.5× speedup over vanilla decoding for a 4-token draft"; EAGLE-2 "20%–40% faster than EAGLE"; EAGLE-3 "up to 6.5× speedups over non-optimized baseline." Medusa "~2.2–3.6× speedups ... in practice ... about a 2–3× speedup." Self-speculative decoding runs "only half of its layers" as the draft. Ch. 16: speculative decoding "adds a draft model ... typically reserved for extreme cases such as ultralong contexts or erratic latency variances," while "sparsity, batching, and disaggregation, deliver the bulk of benefits in production." ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding," ICML 2023, https://arxiv.org/abs/2211.17192; Chen et al., https://arxiv.org/abs/2302.01318. Each draft token is accepted with probability min(1, p(x)/q(x)); on rejection it is resampled from norm(max(0, p−q)), plus a bonus token when all are accepted. The procedure yields output distributed identically to sampling from the target model alone. Expected tokens per target forward for acceptance α and γ draft tokens is (1 − α^(γ+1)) / (1 − α). ↩↩↩↩↩↩↩
vLLM, "Speculative Decoding," https://docs.vllm.ai/en/latest/features/speculative_decoding/ and EAGLE guide, https://docs.vllm.ai/en/latest/features/speculative_decoding/eagle/. Model-based methods (EAGLE/MTP/draft model) give the best latency reduction; n-gram and suffix decoding give modest speedups "without increasing workload during peak traffic." speculative_config keys: method, num_speculative_tokens (required), model (required for draft_model/eagle/eagle3/medusa), prompt_lookup_min/prompt_lookup_max (ngram). EAGLE draft must run with draft_tensor_parallel_size=1. ↩↩↩↩↩↩↩↩↩↩↩
SGLang, "Speculative Decoding," https://docs.sglang.ai/advanced_features/speculative_decoding.html. --speculative-algorithm EAGLE (and EAGLE3, NGRAM); example: --speculative-draft-model-path lmsys/sglang-EAGLE-llama2-chat-7B --speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 16. EAGLE3 recommended for best speed/quality; the NGRAM path is CUDA-only and disables the overlap scheduler and mixed chunked prefill. ↩↩↩
TensorRT-LLM, "Speculative Decoding," https://nvidia.github.io/TensorRT-LLM/latest/features/speculative-decoding.html, and LLM API reference, https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html. Supports Draft-Target, NGram, Medusa, ReDrafter, EAGLE/EAGLE-2/EAGLE-3, Lookahead, MTP. speculative_config accepts typed configs incl. NGramDecodingConfig(max_draft_len=..., max_matching_ngram_size=..., is_public_pool=...), EagleDecodingConfig, Eagle3DecodingConfig, MedusaDecodingConfig, MTPDecodingConfig, LookaheadDecodingConfig; the NGram path uses disable_overlap_scheduler=True. ↩↩↩
Li et al., "EAGLE-3," NeurIPS 2025, https://arxiv.org/abs/2503.01840; SafeAILab implementation, https://github.com/SafeAILab/EAGLE. EAGLE-3 fuses low/mid/high-layer features and predicts tokens directly; reported up to ~6.5× over an unoptimized baseline and ~20–40% over EAGLE-2. Independent reports (NVIDIA, vendor blogs) cite ~4–6× for 70B-class models; the 6.5× figure is a favorable upper bound, not a typical production number. ↩↩
Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads," https://arxiv.org/abs/2401.10774. Extra decoding heads on the target emit a tree of candidates verified with tree attention; requires training the heads. Reported ~2.2–3.6× (Medusa-1/Medusa-2). ↩
Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 1: "leveraging the Grace CPU in the NVL72 for preprocessing, co-running smaller 'draft' models for high-performance inference algorithms such as speculative decoding." Ch. 2: Grace CPU and Blackwell GPU are cache-coherent over NVLink-C2C at up to ~900 GB/s, sharing one address space. ↩↩