Evaluating speculative decoding: SPEED-Bench and the data-dependence problem¶
Scope: measuring speculative decoding (SD) instead of implementing it, through SPEED-Bench (arXiv 2604.09557), the NVIDIA benchmark suite built on the observation that SD performance is data-dependent in a way deterministic system optimizations are not. This page covers why acceptance rates move with text domain and entropy, what existing benchmarks miss, the two SPEED-Bench data splits and the unified measurement framework, the standardized metrics (conditional acceptance rate, acceptance length, throughput, User TPS), and the empirical findings that change deployment decisions: batch-size-dependent optimal draft lengths, synthetic-input inflation, vocabulary-pruning side effects, and training-ISL limits. The SD methods themselves (draft models, EAGLE, Medusa, MTP, n-gram) are covered in speculative decoding; general benchmark anatomy in LLM benchmarks; the CI gate that consumes results in the evaluation harness.
All benchmark numbers on this page are from the SPEED-Bench paper (v2, May 2026), measured on B200-class hardware with production engines; they were not reproduced here. Harness invocations are reference templates; the numpy example is executed and asserted, and implements the paper's Eq. 1 (acceptance length) and Eq. 3 (speedup proxy) plus a first-order latency model.
flowchart TB
subgraph DATA["SPEED-Bench dataset (HF: nvidia/SPEED-Bench)"]
Q["Qualitative split: 880 samples, 11 categories,<br/>diversity-maximized from 18 sources"]
T["Throughput split: ISL buckets 1k-32k,<br/>3 entropy tiers, 512 samples each"]
end
Q --> FW["Unified measurement framework<br/>(thin async client, external tokenization)"]
T --> FW
METH["SD methods: N-Gram, Vanilla SD,<br/>EAGLE3, native MTP"] --> FW
ENG["Engines: vLLM, SGLang, TensorRT-LLM,<br/>SpecBench models"] --> FW
FW --> OUT["Standard metrics: conditional AR, AL,<br/>output TPS, User TPS, TTFT"]
OUT --> PARETO["Throughput-latency Pareto curves<br/>per BS, DL, ISL"]
What it is¶
SPEED-Bench (Speculative Evaluation Dataset) is a benchmark suite and measurement framework for speculative decoding, built by the TensorRT-LLM group at NVIDIA (the paper discloses the conflict of interest).1 Its premise: unlike deterministic system optimizations, SD performance depends on the input distribution. A drafter that reaches high acceptance on coding prompts can approach uselessness on roleplay text, so a benchmark that undersamples the semantic space measures the benchmark, not the method.
The suite has three parts. The Qualitative split measures speculation quality: 880 samples in 11 categories (Coding, Humanities, Math, Multilingual, QA, RAG, Reasoning, Roleplay, STEM, Summarization, Writing), curated from 18 public sources by a selection algorithm that embeds candidate prompts with text-embedding-3-large and greedily minimizes total pairwise cosine similarity with local swap refinement. This cuts average intra-category similarity by 40% versus SpecBench (83% in Multilingual); about 20% of samples are multi-turn (two to five turns), and samples average roughly 650 generated tokens on GPT-4.2 The Throughput split measures system efficiency: real prompts aggregated into fixed input-sequence-length buckets (1k, 2k, 8k, 16k, 32k tokens under the o200k_base tokenizer) across three domain-entropy tiers (Low, Mixed, High), 512 samples per tier per bucket, enough volume to draw stable throughput-latency Pareto curves at batch sizes into the hundreds. The measurement framework is a thin asynchronous client that normalizes external factors (tokenization, chat templates, BOS handling) so vLLM, SGLang, TensorRT-LLM, and SpecBench-hosted models all process identical token sequences, while deliberately preserving engine-internal differences (kernels, scheduling, continuous batching), because deployment viability is the thing being measured.
Why use it¶
- Cross-method comparisons on common ground. Published SD methods validate on inconsistent datasets; MT-Bench-derived suites carry as few as 10 samples per category, and SpecBench's multilingual set is entirely German-to-English translation prompts (about 15% of its data). Small, homogeneous categories create statistical noise in which methods look interchangeable; on SPEED-Bench's larger, diverse splits the expected gaps reappear, such as external drafters beating EAGLE3 at long draft lengths.4
- It measures the regime you deploy in. Research evaluations dominated by batch size 1 on HuggingFace-level runtimes miss that production serving batches requests, shifting decode toward compute-bound and often shrinking or inverting SD gains. SPEED-Bench reports at realistic concurrency on production engines: at batch size 32 with draft length 3, N-Gram speculation is a net slowdown (0.88x on Llama 3.3 70B, 0.29x on GPT-OSS 120B), and Vanilla SD earns 1.60x where EAGLE3 earns 1.90x despite identical mean acceptance length (2.44), because the external draft model costs real time.3
- It exposes side effects other benchmarks mask. EAGLE3's vocabulary pruning (typically to 32k tokens) is nearly free on Math and Coding but drops accuracy sharply on Multilingual (where about 22% of target tokens fall outside the pruned vocabulary), RAG, and Summarization; only a broad-coverage suite surfaces that trade.5
- It quantifies the synthetic-input trap. Random-token load generation, standard for autoregressive throughput testing, overestimates SD throughput by an average of 23%: noise prompts trigger either trivial responses (inflated acceptance) or topic latching (deflated acceptance), and on MoE targets they collapse expert routing enough to distort even the no-SD baseline.6
When to use it (and when not)¶
- Use it before enabling SD in production: measure acceptance length on the categories closest to your traffic and read speedups at your actual batch-size and ISL regime, not the paper-default batch size 1.
- Use the Throughput split's proxy protocol when your domain is niche: measure step latencies once on a representative bucket, measure acceptance length on a small set of in-domain prompts, and combine them analytically (Eq. 3 below) instead of building a large custom benchmark.8
- Use it for drafter regression testing: the training-ISL experiments show public EAGLE3 checkpoints for GPT-OSS 120B degrading badly beyond their training context; the fixed ISL buckets catch that class of failure, and inference-time YaRN scaling recovered much of the loss for drafters trained at 2k or more.9
- Do not use it as a capability benchmark. It measures speculation efficiency, not answer quality; a drafter change never needs an accuracy eval to pass SPEED-Bench, which is why it complements rather than replaces the eval gate.
- Do not extrapolate across serving regimes. The paper's own findings forbid it: optimal draft length flips from long (memory-bound, low batch) to 1 (compute-bound, high batch), and MoE targets keep benefiting from SD at batch sizes where dense models stop.7
Architecture¶
The metrics are defined once and reused everywhere. The conditional acceptance rate AR_i is the probability draft token i is accepted given tokens before it were. The acceptance length for draft length gamma is the expected number of tokens emitted per verification step, including the free token the verifier itself produces:
System efficiency is output TPS (total tokens per second across concurrent requests) and User TPS (per-request tokens per second, the latency proxy), with TTFT recorded for completeness. The framework computes acceptance rates from streaming response chunks (a multi-token chunk is a successful speculation step) and timestamps every chunk for step latency. Speedup decomposes into a system part and a data part:
where t_ar and t_sd are per-step latencies of autoregressive and speculative decoding. The paper validates this proxy: step latencies measured on the Mixed tier combined with per-domain acceptance lengths closely match directly measured end-to-end speedups across 1k/2k/8k buckets (Llama 3.3 70B, batch 16, draft length 3).8 Known limitation: the asyncio client meets the GIL above batch size 256, which the authors flag as a current fidelity boundary.
The evaluated method matrix spans N-Gram, Vanilla SD (Llama 3.2 1B drafting for Llama 3.3 70B; Qwen3 0.6B for Qwen3 235B), EAGLE3 open checkpoints, and native MTP heads (Qwen3-Next; DeepSeek R1 has a native MTP head but is evaluated here only under Vanilla SD and EAGLE3, not native MTP), using draft chains rather than tree verification because chains remain the production standard at batch sizes above 1.1
How to use it¶
Reference template (unexecuted; the dataset is public, the framework ships with the paper's supplementary material; verify current entry points on the dataset card as of 2026-07):
# Reference template: pull the two SPEED-Bench splits from Hugging Face.
from datasets import load_dataset
qual = load_dataset("nvidia/SPEED-Bench", "qualitative") # 880 samples, 11 categories
tput = load_dataset("nvidia/SPEED-Bench", "throughput") # ISL buckets 1k-32k, 3 tiers
# categories carry subcategory, multiturn, and difficulty metadata for slicing
The executed example below implements the arithmetic the benchmark standardizes, and demonstrates the page's core claim in miniature: two workloads that differ only in acceptance rate have different optimal draft lengths, so a single-workload benchmark cannot pick the right configuration for mixed traffic.
# sd_eval_model.py - validated: the acceptance-length and speedup arithmetic that
# makes speculative-decoding evaluation data-dependent. Analytical model (paper
# Eq. 1 and Eq. 3 plus a first-order step-latency model), not a benchmark rerun.
import numpy as np
def acceptance_length(cond_ars: np.ndarray) -> float:
"""Paper Eq. 1: AL = 1 + sum_i prod_{j<=i} AR_j over the draft chain."""
assert np.all((cond_ars >= 0.0) & (cond_ars <= 1.0))
return float(1.0 + np.cumprod(cond_ars).sum())
def al_constant(a: float, dl: int) -> float:
"""Closed form of Eq. 1 for a constant conditional AR: (1 - a^(DL+1)) / (1 - a)."""
assert 0.0 <= a <= 1.0 and dl >= 1
if a == 1.0:
return float(dl + 1)
return float((1.0 - a ** (dl + 1)) / (1.0 - a))
def speedup(a: float, dl: int, c_draft: float, c_verify: float) -> float:
"""Paper Eq. 3, S = (t_ar * AL) / t_sd, with a first-order step model:
t_sd = (c_draft * DL + c_verify) * t_ar. c_draft is drafting cost per token
relative to one autoregressive target step; c_verify >= 1 is the verification
step cost, above 1 once larger batches push verification compute-bound."""
assert c_draft >= 0.0 and c_verify >= 1.0
return al_constant(a, dl) / (c_draft * dl + c_verify)
# 1) Boundary behavior of the acceptance length.
assert al_constant(0.0, 5) == 1.0 # nothing accepted: 1 token/step
assert al_constant(1.0, 5) == 6.0 # everything accepted: DL+1
grid = [al_constant(a, 3) for a in np.linspace(0.0, 1.0, 21)]
assert all(b >= x for x, b in zip(grid, grid[1:])), "AL must be monotone in AR"
# 2) Eq. 1 with per-position conditional ARs reduces to the closed form when constant.
rng = np.random.default_rng(0)
for dl in (1, 3, 7):
a = float(rng.uniform(0.2, 0.95))
assert np.isclose(acceptance_length(np.full(dl, a)), al_constant(a, dl))
varying = np.array([0.9, 0.7, 0.4]) # position-dependent ARs
assert np.isclose(acceptance_length(varying), 1.0 + 0.9 + 0.9 * 0.7 + 0.9 * 0.7 * 0.4)
# 3) SD can slow decoding down: low acceptance plus compute-bound verification.
slow = speedup(a=0.15, dl=3, c_draft=0.05, c_verify=1.35)
assert slow < 1.0, f"expected a net slowdown, got {slow:.3f}"
# 4) Data dependence: the speedup-optimal draft length shifts with the workload's AR.
def best_dl(a: float, c_draft: float = 0.12, c_verify: float = 1.1) -> int:
dls = range(1, 16)
return max(dls, key=lambda dl: speedup(a, dl, c_draft, c_verify))
low_entropy, high_entropy = 0.9, 0.4 # coding-like vs roleplay-like
assert best_dl(low_entropy) > best_dl(high_entropy), "optimal DL must shift with AR"
assert 1 < best_dl(low_entropy) < 15, "optimum must be interior, not a grid artifact"
print(f"AL(a=0.9, DL=3) = {al_constant(0.9, 3):.3f} AL(a=0.4, DL=3) = {al_constant(0.4, 3):.3f}")
print(f"slowdown case: S = {slow:.3f}")
print(f"optimal DL at AR=0.9: {best_dl(0.9)} at AR=0.4: {best_dl(0.4)}")
print("all assertions passed")
Output of the run: AL(a=0.9, DL=3) = 3.439, AL(a=0.4, DL=3) = 1.624, slowdown case: S = 0.784, optimal DL at AR=0.9: 9 at AR=0.4: 2, all assertions passed. The same draft length that is optimal for a 0.9-acceptance workload leaves a 0.4-acceptance workload behind its own optimum, and with low acceptance plus a compute-bound verification penalty the model reproduces the net slowdowns the paper measures for N-Gram at batch 32.
How to develop with it¶
- Slice before averaging. The Qualitative split's value is its metadata (category, subcategory, multi-turn flag, difficulty); a mean acceptance length across 11 categories hides exactly the variance the suite exists to expose. The paper's own Table 1 spans mean ALs from 1.31 (N-Gram on GPT-OSS 120B) to 2.81 (Qwen3-Next MTP) with per-category spreads from roughly 1.15 to 3.34.3
- Prefer real prompts to synthetic load, always. Even for the autoregressive baseline on MoE targets, random tokens collapse expert routing and distort step latency; the Throughput split exists so that load tests carry real semantics.6
- Test the drafter beyond its training ISL. Acceptance degrades rapidly once inference ISL exceeds training ISL; sweep the 1k-32k buckets before shipping a drafter for long-context traffic, and check whether RoPE/YaRN scaling recovers it.9
- Mind temperature. At temperature 1 mean acceptance lengths and speedups drop across every method in Table 1 (Vanilla SD on Llama 3.3 70B falls from 1.60x to 1.15x); benchmark at your serving temperature, not only greedy decoding.3
How to maintain it¶
- Pin the dataset revision and tokenizer. ISL buckets are defined under
o200k_base; recomputing buckets under a different tokenizer silently shifts every regime boundary. Treat the HF dataset revision like a model version. - Re-baseline on engine upgrades. Engine internals are deliberately inside the measurement boundary: TensorRT-LLM's fused draft-verify CUDA graph gives it higher peak throughput, while vLLM trades peak for drafting flexibility, so an engine upgrade is a measurement change even with the model fixed.10
- Track drafter and target versions together. An EAGLE3 head is trained against one target checkpoint; a target upgrade invalidates prior acceptance measurements even when the API surface is unchanged.
How to run it in production¶
The deployment question SD raises is a gating one, and the benchmark supplies the gate structure: measure acceptance length per traffic category on the Qualitative split, measure step latencies at production batch size and ISL on the Throughput split, and only enable SD (with the batch-size-appropriate draft length) where the projected speedup from Eq. 3 clears 1 with margin on your real mix. Re-run the projection when traffic composition shifts (a new multilingual tenant is exactly the case where a pruned-vocabulary drafter quietly loses its gains5), and keep SD behind a per-route flag so a regression in one domain does not tax every request. Speedups feed inference SLOs through User TPS, and capacity math through output TPS (goodput); a drafter that raises throughput but lowers User TPS at high concurrency is a policy decision, not a free win.
Failure modes¶
- Benchmarking one domain, deploying on another. Acceptance is domain-dependent (Coding versus Roleplay spans roughly a full accepted token in Table 1); a coding-tuned rollout hitting chat traffic underperforms its benchmark silently.
- Acceptance length as a wall-clock proxy. Identical mean AL (2.44) produced 1.60x versus 1.90x for Vanilla SD versus EAGLE3 on the same target because external drafting costs latency AL does not see; never promote a drafter on AL alone.3
- Batch-regime mismatch. A draft length tuned at batch 1 is wrong at batch 128; the memory-bound to compute-bound transition moves the optimum toward draft length 1, and past the crossover SD can lose to plain decoding.7
- Synthetic-load throughput claims. Random-token benchmarks overstate SD throughput by about 23% on average and misstate MoE baselines; treat any SD number measured on noise as an upper bound.6
- Vendor numbers on cherry-picked workloads. A method validated on 10-sample categories or translation-only multilingual sets inherits those artifacts; check what the evaluation data actually contained before comparing headline speedups.4
- Draft/target version skew. Publicly available EAGLE3 checkpoints for the same target differed materially in long-ISL behavior; provenance and training configuration of the drafter are part of the benchmark result.9
References¶
- Abramovich et al., SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding (arXiv 2604.09557): https://arxiv.org/abs/2604.09557
- Hugging Face paper page: https://huggingface.co/papers/2604.09557
- SPEED-Bench dataset (Hugging Face): https://huggingface.co/datasets/nvidia/SPEED-Bench
- Xia et al., SpecBench (Unlocking Efficiency in Large Language Model Inference, arXiv 2401.07851): https://arxiv.org/abs/2401.07851
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding (arXiv 2211.17192): https://arxiv.org/abs/2211.17192
- Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (arXiv 2302.01318): https://arxiv.org/abs/2302.01318
- Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (arXiv 2401.15077): https://arxiv.org/abs/2401.15077
- Cai et al., Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads (arXiv 2401.10774): https://arxiv.org/abs/2401.10774
Related: Speculative decoding · LLM benchmarks · LLM evaluation harness · Continuous batching internals · Inference serving · SLOs: inference serving · Goodput · Constrained decoding
-
SPEED-Bench (arXiv 2604.09557v2, May 2026), NVIDIA: Qualitative split (880 samples, 11 categories, 18 sources), Throughput split (ISL buckets 1k/2k/8k/16k/32k under o200k_base, Low/Mixed/High entropy, 512 samples per tier per bucket), thin async measurement framework over vLLM, SGLang, TensorRT-LLM, and SpecBench; methods N-Gram, Vanilla SD, EAGLE3, native MTP on Llama 3.3 70B, GPT-OSS 120B, Qwen3 235B, Qwen3-Next, DeepSeek R1; draft chains, single B200 except eight for DeepSeek/Qwen inference and GPT-OSS EAGLE3 training. ↩↩
-
Selection algorithm: text-embedding-3-large embeddings, greedy minimization of total pairwise cosine similarity with local swap refinement; 40% average similarity reduction versus SpecBench (83% Multilingual); about 20% multi-turn (2-5 turns); difficulty metadata for Coding/Humanities/Math/STEM at about 80% hard; mean about 650 generated tokens (GPT-4). ↩
-
Paper Table 1 (Qualitative split, batch size 32, draft length 3, temperature 0): mean AL / mean speedup: Llama 3.3 70B N-Gram 1.41 / 0.88x, Vanilla 2.44 / 1.60x, EAGLE3 2.44 / 1.90x; GPT-OSS 120B N-Gram 1.31 / 0.29x, EAGLE3 2.25 / 1.34x, MTP 2.55 / 1.45x; DeepSeek R1 Vanilla 2.43 / 1.17x, EAGLE3 2.22 / 1.33x; Qwen3 235B EAGLE3 2.36 / 1.06x (SGLang); Qwen3-Next MTP 2.81 / 1.20x (SGLang). At temperature 1 all means drop (e.g. Llama Vanilla 1.98 / 1.15x). Low-entropy domains (Coding, Math) exceed high-entropy (Roleplay) throughout. ↩↩↩↩
-
SpecBench sources most categories from MT-Bench: as few as 10 samples per category, two turns max, mean ISL under 100 tokens; multilingual is WMT14 DE-EN translation prompts, about 15% of the dataset. On SPEED-Bench, external drafters show their expected advantage over EAGLE3 at draft length 7, which SpecBench's small categories obscure. ↩↩
-
EAGLE3 vocabulary pruning (typically 32k tokens): about 22% of Multilingual target tokens fall outside the pruned vocabulary; large AL degradation in Multilingual, RAG, Summarization; negligible in Math and Coding (GPT-OSS 120B, draft length 3). ↩↩
-
Random-token inputs overestimate SD throughput by 23% on average versus the Throughput split (GPT-OSS 120B, EAGLE3, TensorRT-LLM, 8k ISL, batch 1-128); failure modes are trivial responses (inflated AR) and topic latching (deflated AR); on MoE targets random inputs also collapse expert routing, distorting even no-SD step latency. ↩↩↩
-
Optimal draft length is concurrency-dependent: longer drafts win in the memory-bound low-batch regime; near compute-bound high batch, draft length 1 wins and N-Gram turns into a net slowdown. MoE targets stay SD-friendly at high batch because sparse activation delays the compute-bound transition. ↩↩
-
Speedup proxy S = (t_ar x AL) / t_sd (Eq. 3): step latencies are system-determined (engine, ISL, batch, hardware), AL is domain-determined; projections from Mixed-tier latencies plus per-domain ALs match measured end-to-end speedups across 1k/2k/8k buckets (Llama 3.3 70B, batch 16, draft length 3, EAGLE3 and Vanilla SD). ↩↩
-
Both public EAGLE3 checkpoints for GPT-OSS 120B degrade at high ISLs; custom drafters trained at 1k/2k/4k degrade once inference ISL exceeds training ISL; inference-time YaRN scaling recovers substantial accuracy for drafters trained at 2k and above. ↩↩↩
-
TensorRT-LLM reaches higher peak throughput via a unified CUDA graph over the draft-verification loop; vLLM's multi-engine design carries host-communication overhead but more flexibility for dynamic drafting. ↩