Markdown

LLM benchmarks: anatomy, metrics, and saturation¶

Scope: what a capability benchmark actually measures and how to read it without being misled. This page covers the anatomy of a benchmark (task, format, grader, metric), why and when to use public benchmarks, how to score them (with a runnable pass@k estimator and a harness invocation), how to choose and build one, and how to evaluate cheaply as models saturate. It is the capability-measurement companion to the evaluation harness that runs these tests in CI, the anti-gaming page on contamination, and agent evaluation for interactive tasks.

Benchmark scores here are illustrative of published results; verify current numbers and dataset versions before quoting them. The Python example is executed and asserted (numpy); the harness command is a reference template.

flowchart LR
  TASK["Task item"] --> FMT["Format: multiple-choice / open / pairwise"]
  FMT --> RUN["Model under test"]
  RUN --> GRADE["Grader: string-match / verifier / LLM-judge"]
  GRADE --> METRIC["Metric: accuracy, pass@k, Elo, calibration"]
  METRIC --> READ["Read with care: saturation, contamination, discrimination"]

What it is¶

A benchmark is a dataset of task items plus a scoring rule. Three choices decide what it really measures:

Format. Multiple-choice is trivial to grade by string match but rewards elimination and leaks signal through option patterns. Open-ended generation is realistic but needs a verifier or a judge. Pairwise comparison ranks two answers, usually with an LLM judge, and is the basis of preference leaderboards.
Grader. Deterministic checkers (exact match, a math verifier, unit tests) are reproducible but need a ground truth and are brittle to formatting. Model-based graders (LLM-as-judge) scale to open-ended answers but must be calibrated against humans and can be gamed (anti-gaming).
Metric. How item scores aggregate into a number, and how comparable that number is across models and versions.

Why use it¶

Calibrate where a model stands. Public benchmarks place a model against known references on a shared, documented task, which no private vibe-check can do.
Catch regressions. A fixed benchmark run in CI turns "did this change hurt reasoning" into a number that gates a merge (eval harness).
Compare candidates. A shared metric (accuracy, pass@k, Elo) lets you rank base models, checkpoints, or quantizations on the axis you care about.

When to use it (and when not)¶

Use a public benchmark to calibrate against the field and to track a capability over time, matching the format to the use case (generative benchmark for a generative product, not multiple-choice).
Prefer pass@k over pass@1 when coverage matters (can the model ever solve it, for example with sampling or best-of-n); prefer pass@1 when reliability on the first try is what ships.
Do not treat a public benchmark as your eval. A saturated or contaminated benchmark stops discriminating; build your own held-out set from real failures for anything you deploy (agent evaluation).
Do not compare across incompatible setups. Different prompts, few-shot counts, or graders make two "MMLU" numbers non-comparable; pin the harness and its config.

Architecture¶

The read pipeline is task, then format, then grader, then metric, then interpretation. Each item runs through the model, a grader scores the output (deterministic, model-based, or human), and scores aggregate into a metric that must then be read against saturation, contamination, and discrimination. The grader and metric choices are where most benchmark errors enter, which is why they are explicit stages above rather than an afterthought.

How to use it¶

The single most useful metric to implement correctly is pass@k: the probability that at least one of k samples is correct. Estimating it by naively sampling k and checking is high-variance; HumanEval's unbiased estimator samples n >= k, counts c correct, and computes pass@k = 1 - C(n-c, k) / C(n, k). This runnable version is executed and asserted (boundaries, pass@1 = c/n, monotonic in k):

# passk.py — validated unbiased pass@k estimator (HumanEval); numpy only.
import numpy as np

def pass_at_k(n, c, k):
    """Estimate P(>=1 of k samples correct) from n samples with c correct."""
    if n - c < k:                                   # fewer than k wrong -> a correct one is certain
        return 1.0
    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))

assert pass_at_k(10, 0, 5) == 0.0                   # no correct sample -> 0
assert pass_at_k(10, 3, 20) == 1.0                  # k > n-c -> certain
assert abs(pass_at_k(10, 1, 1) - 0.1) < 1e-12       # pass@1 == c/n
vals = [pass_at_k(10, 2, k) for k in (1, 2, 5, 10)]
assert all(a <= b + 1e-12 for a, b in zip(vals, vals[1:]))   # monotonic in k
print({k: round(v, 3) for k, v in zip((1, 2, 5, 10), vals)}) # -> {1:0.2, 2:0.378, 5:0.778, 10:1.0}

In practice you rarely hand-roll a whole benchmark run; a harness does it reproducibly. The de facto standard is EleutherAI's lm-evaluation-harness, invoked against any model with a pinned task set:

# Reference template: pin the harness version and task revisions for comparable numbers.
lm_eval --model hf --model_args pretrained=Qwen/Qwen3-8B \
  --tasks mmlu,gpqa,ifeval --num_fewshot 0 --batch_size auto

How to develop with it¶

Choosing a benchmark means matching it to the capability and checking headroom:

Benchmark	Format	Measures	Note
MMLU / MMLU-Pro	multiple-choice	broad knowledge	original saturated; Pro adds 10-way options and reasoning items
GPQA	multiple-choice	expert reasoning	"Google-proof"; experts 65-74% vs non-experts ~34%
BIG-Bench Hard	mixed	diverse hard tasks	largely saturated; refresh or replace
IFEval	generative	instruction following	programmatically verifiable constraints, no judge
AlpacaEval / Arena	pairwise	open-ended quality	LLM judge (length-controlled) or human Elo
HumanEval / math	generative	code / math	graded by tests or a verifier; pass@k

Building one that stays useful: start from a domain taxonomy so coverage is diverse and per-category scores localize failures; prefer realistic generative tasks over multiple-choice when the use case is generative; source expert-curated or competition items and audit for label errors (MMLU carried roughly a 6.49% error rate, and fixing labels shifts rankings); and filter for difficulty and discrimination with a model in the loop, verified by humans.

How to maintain it¶

Benchmarks decay, so maintenance is continuous. Refresh or replace items as top models cluster near the ceiling (saturation), re-audit labels (MMLU carried roughly a 6.49% error rate, and a few percent of wrong answers reshuffles a leaderboard), and defend against contamination with freshly collected items, canary strings, and private test splits (training-data curation, anti-gaming). If you evaluate on an Item Response Theory subset to cut cost (next section), refit it as capability moves, because a subset fit to one model population can mis-rank a model far outside it.

How to run it in production¶

Wire the benchmark into the CI eval gate so a regression blocks a merge, pinning the harness version and task revisions so numbers stay comparable across runs (eval harness). Keep a private hold-out that never enters training, track the per-domain taxonomy breakdown rather than a single composite, and report accuracy with a confidence interval. To run large suites cheaply, evaluate on an Item Response Theory core (which models the probability an item is answered correctly from model ability, item difficulty, and discrimination, then picks the most informative items); approaches like tinyBenchmarks match full-benchmark rankings from a ~100-item subset, cutting cost by roughly two orders of magnitude, while the full suite runs periodically.

Failure modes¶

Saturated benchmark. Everyone scores 90%+ and the metric no longer ranks; move to a harder or refreshed set.
Contaminated test set. Memorized items inflate scores; validate against freshly collected items.
Label noise. A few percent of wrong answers reshuffles the leaderboard; audit the source data.
Judge bias. Length, position, or self-preference bias in an LLM judge; use length control, randomized order, and a stronger judge.
Single-number tunnel vision. A composite hides per-domain regressions; keep the taxonomy breakdown.
Overfitting to the eval. Tuning on the benchmark makes the number rise and the capability stall; keep a private hold-out.

References¶

Cameron R. Wolfe, "The Anatomy of an LLM Benchmark": https://cameronrwolfe.substack.com/p/llm-bench
Hendrycks et al., Measuring Massive Multitask Language Understanding (MMLU): https://arxiv.org/abs/2009.03300
Wang et al., MMLU-Pro: https://arxiv.org/abs/2406.01574
Rein et al., GPQA: A Graduate-Level Google-Proof Q&A Benchmark: https://arxiv.org/abs/2311.12022
Srivastava et al., BIG-Bench: https://arxiv.org/abs/2206.04615
Zhou et al., IFEval: Instruction-Following Evaluation for LLMs: https://arxiv.org/abs/2311.07911
Chen et al., Evaluating Large Language Models Trained on Code (HumanEval, pass@k): https://arxiv.org/abs/2107.03374
EleutherAI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness