LLM evaluation harness & the eval gate¶
Scope: measuring a post-trained model's quality reproducibly: the benchmark suites, the harness that runs them (lm-evaluation-harness, lighteval), the contamination controls that keep scores honest, and the eval gate that decides whether a checkpoint is promoted. The measurement layer that arbitrates every post-training decision, and the reason "it trained" is never sufficient to ship.
Shell and API blocks are reference templates: pin versions and validate before production use. Each core algorithm this page teaches (the eval gate, pass@k, the bias-corrected judge win-rate) is a runnable, self-asserting block that executes with numpy only, no ML dependencies.
What it is¶
An evaluation harness runs a fixed set of benchmarks against a model and emits comparable scores. Two are standard. lm-evaluation-harness (EleutherAI) is the backend of Hugging Face's Open LLM Leaderboard, with 60+ tasks and vLLM and OpenAI-compatible backends. lighteval (Hugging Face) is multi-backend (Accelerate, vLLM, SGLang, endpoints), with 1000+ tasks and custom metrics.12 HELM (Stanford) and Inspect (UK AISI) are broader alternatives.
Benchmarks group by capability: knowledge and reasoning (MMLU, GPQA, ARC, BBH, HellaSwag), maths (GSM8K, MATH), code (HumanEval, MBPP), instruction-following (IFEval), and open-ended quality judged by an LLM or humans (MT-Bench, Arena-Hard, AlpacaEval). The eval gate is the CI check that a checkpoint must pass (quality up, no regressions, no safety failures) before it is promoted to a registry or serving.
Why use it¶
- "It trained" is not "it improved." Training loss or RL reward can rise while real quality falls (reward hacking, entropy collapse); only a held-out eval tells you which happened.
- Reproducibility. A pinned harness (prompt format, few-shot count, metric, version) makes runs comparable across time and teams; ad-hoc eval scripts are not comparable and quietly drift.
- The gate arbitrates promotion. Gating on evals rather than vibes is the core MLOps discipline (SRE/MLOps practices); it catches regressions before production, not after.
- Honest scores need decontamination. A benchmark that leaked into training data reports fiction (data curation, evaluation integrity).
When to use it (and when not)¶
- Always, as a gate before promotion, and continuously during post-training to catch collapse and regression early.
- Benchmark evals measure capability; LLM-as-judge measures open-ended quality; keep human eval for the hardest safety and quality slices no automation captures.
- Do not over-index on one benchmark: optimizing a single number invites Goodharting (evaluation integrity).
- Do not trust a benchmark you did not decontaminate against; treat leaked scores as unknown, not high.
Architecture¶
flowchart LR
CKPT["Candidate checkpoint"] --> HARNESS["Harness: tasks x few-shot x metric (pinned)"]
HARNESS --> SCORES["Scores + per-task breakdown"]
DECON["Decontamination check"] -.->|"trust the score?"| SCORES
SCORES --> GATE{"Eval gate: pass thresholds + no regression?"}
GATE -->|"pass"| REG["Registry -> deploy"]
GATE -->|"fail"| TRAIN["Back to training / data"]
How to use it¶
lm-evaluation-harness runs a model against named tasks; back it with vLLM for throughput:
# pip install lm-eval (reference template) -- reproducible benchmark run on a vLLM backend
lm-eval --model vllm \
--model_args pretrained=./merged-model,tensor_parallel_size=4,dtype=bfloat16 \
--tasks mmlu,gsm8k,ifeval \
--num_fewshot 5 \
--batch_size auto \
--output_path ./eval_results
lighteval covers the same ground with more backends and custom tasks:
# pip install lighteval (reference template) -- evaluate on a vLLM backend
lighteval vllm "pretrained=./merged-model,dtype=bfloat16" "leaderboard|mmlu|5|0"
Both write per-task scores you can threshold in CI. Pin the harness version and the task revisions; scores are only comparable within the same versions.
For reasoning and code, report pass@k, not just pass@1: the capability-versus-sharpening question turns on it (RLVR), and coverage under sampling is a different quantity from first-try reliability. The naive estimate (sample k, check for a hit) is high-variance; HumanEval's unbiased estimator samples n >= k, counts c correct, and computes pass@k = 1 - C(n-c, k) / C(n, k). The companion benchmarks page owns the canonical version; the block below is the same core math, validated here against a brute-force reference (every k-subset enumerated) plus boundaries and the pass@1 = c/n identity:
# pass@k core: unbiased HumanEval estimator P(>=1 of k of n samples correct); numpy-only.
# pass@k = 1 - C(n-c, k)/C(n, k), computed as a stable product to avoid huge binomials.
import numpy as np
def pass_at_k(n, c, k):
"""Estimate P(>=1 of k samples correct) from n samples with c correct."""
assert 0 <= c <= n and 1 <= k <= n
if n - c < k: # fewer than k wrong -> a correct draw is certain
return 1.0
return 1.0 - float(np.prod(1.0 - k / np.arange(n - c + 1, n + 1)))
# slow but obviously-correct reference: enumerate every k-subset, fraction with >=1 correct
from itertools import combinations
def pass_at_k_ref(n, c, k):
idx = range(n) # first c indices are the correct samples
subsets = list(combinations(idx, k))
hit = sum(1 for s in subsets if any(i < c for i in s))
return hit / len(subsets)
# equivalence to the slow reference across a grid (the anti-bug check)
for n in range(1, 9):
for c in range(0, n + 1):
for k in range(1, n + 1):
assert abs(pass_at_k(n, c, k) - pass_at_k_ref(n, c, k)) < 1e-9, (n, c, k)
# boundaries and identities
assert pass_at_k(10, 0, 5) == 0.0 # no correct sample -> impossible
assert pass_at_k(10, 3, 8) == 1.0 # k=8 > n-c=7 -> a correct draw is certain
assert abs(pass_at_k(10, 1, 1) - 0.1) < 1e-12 # pass@1 == c/n
# monotonic non-decreasing in k: more attempts never lowers coverage
vals = [pass_at_k(10, 2, k) for k in (1, 2, 5, 10)]
assert all(a <= b + 1e-12 for a, b in zip(vals, vals[1:])), vals
print("pass@k asserts passed:", {k: round(v, 3) for k, v in zip((1, 2, 5, 10), vals)})
How to integrate it¶
The harness is a stage, not a script. Wire its four planes into the pipeline (data -> train -> eval -> register -> deploy):
- Task plane. Match tasks to the capability you are training, and keep a held-out gate set the model never sees in training or few-shot prompts. Pin the prompt format,
num_fewshot, metric, and harness/task version; all of them move scores, so record them with every result or the numbers are not comparable. - Decontamination plane. Run the contamination check against every reported benchmark before trusting a score. Before believing a jump, confirm the benchmark's items are not in the training set (data curation); an unexplained spike is usually contamination, not capability. Treat a leaked score as unknown, not high.
- Judge plane (open-ended quality). For chat quality, score with an LLM judge (MT-Bench / Arena-Hard style) and report win-rate against a fixed baseline, alongside the capability benchmarks. Judges carry position bias, verbosity bias, and self-preference, so fix the rubric, randomize order, and calibrate against human labels (reward design). The concrete correction is to judge each pair in both orders and average, which cancels a constant position bias:
# LLM-judge win-rate with position-bias correction; numpy-only.
# Each pair is judged in BOTH orders. Score the candidate: win=1, tie=0.5, loss=0,
# averaged over the two orderings so a constant "prefer-first-slot" bias cancels out.
import numpy as np
def corrected_winrate(fwd, rev):
"""fwd[i]: candidate score when shown FIRST (1/0.5/0);
rev[i]: candidate score when shown SECOND (same pairs, order swapped)."""
fwd, rev = np.asarray(fwd, float), np.asarray(rev, float)
assert fwd.shape == rev.shape
assert np.isin(fwd, [0.0, 0.5, 1.0]).all() and np.isin(rev, [0.0, 0.5, 1.0]).all()
return float(np.mean((fwd + rev) / 2.0)) # average the two slot assignments
# adversarial: a judge that ALWAYS picks whatever is in the first slot.
# When candidate is first it "wins" (fwd=1); when candidate is second it "loses" (rev=0).
n = 8
fwd_biased = np.ones(n) # candidate shown first -> biased judge says candidate wins
rev_biased = np.zeros(n) # candidate shown second -> biased judge says candidate loses
assert corrected_winrate(fwd_biased, rev_biased) == 0.5 # pure position bias cancels to a tie
# naive single-order win-rate would have reported a fake 100% for the same biased judge
naive = float(np.mean(fwd_biased))
assert naive == 1.0 and naive != corrected_winrate(fwd_biased, rev_biased)
# genuinely-better candidate: wins in both slots -> correction preserves the real signal
assert corrected_winrate(np.ones(n), np.ones(n)) == 1.0 # dominates regardless of position
# order-invariance: swapping which array is "forward" cannot change the averaged result
rng = np.random.default_rng(0)
a = rng.choice([0.0, 0.5, 1.0], n); b = rng.choice([0.0, 0.5, 1.0], n)
assert corrected_winrate(a, b) == corrected_winrate(b, a)
# ties everywhere -> 0.5 exactly (boundary)
assert corrected_winrate(np.full(n, 0.5), np.full(n, 0.5)) == 0.5
print("judge win-rate asserts passed: biased->%.2f better->%.2f" % (
corrected_winrate(fwd_biased, rev_biased), corrected_winrate(np.ones(n), np.ones(n))))
- Lineage plane. Emit per-task scores plus the pinned versions to the experiment tracker and model registry so a promoted checkpoint is reproducible and its gate decision is auditable (SRE/MLOps practices).
How to run it in production¶
The gate is the production artefact: a CI check that blocks promotion when the candidate misses a floor or regresses against the incumbent. Gate on a suite plus the held-out set, never a single number, and never on training reward. This is the runnable core, self-asserting on the happy path, the exact-floor boundary, a missing task, and an above-floor-but-regressed candidate:
# eval_gate.py -- promote only if every gated task clears its floor AND does not regress.
# Pure stdlib; runnable. In CI you would json.load the harness results instead of the inline dict.
FLOORS = {"mmlu": 0.62, "gsm8k": 0.70, "ifeval": 0.55} # promotion thresholds
def gate(scores, floors=FLOORS, incumbent=None, eps=0.0):
"""Return (ok, fails). A task fails if it is missing, below its floor,
or (when an incumbent is given) regresses by more than eps."""
fails = {}
for task, floor in floors.items():
s = scores.get(task)
if s is None: # missing task -> hard fail, never silently skip
fails[task] = "missing"
elif s < floor:
fails[task] = f"{s:.3f} < floor {floor}"
elif incumbent is not None and s < incumbent.get(task, 0.0) - eps:
fails[task] = f"{s:.3f} regresses vs {incumbent[task]:.3f}"
return (not fails), fails
# happy path: clears every floor -> promote
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58})
assert ok and fails == {}, fails
# boundary: exactly at the floor is a pass (>= semantics), just below is a fail
assert gate({"mmlu": 0.62, "gsm8k": 0.70, "ifeval": 0.55})[0] is True
ok, fails = gate({"mmlu": 0.6199, "gsm8k": 0.70, "ifeval": 0.55})
assert ok is False and set(fails) == {"mmlu"}, fails
# adversarial 1: a missing task must fail, not pass by absence
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74}) # ifeval absent
assert ok is False and fails["ifeval"] == "missing", fails
# adversarial 2: above the floor but regressed vs incumbent -> block the promotion
inc = {"mmlu": 0.70, "gsm8k": 0.74, "ifeval": 0.58}
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58}, incumbent=inc)
assert ok is False and "regresses" in fails["mmlu"], fails # 0.66 > floor 0.62 yet < incumbent 0.70
print("eval_gate asserts passed:", gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58}))
In CI, json.load the harness output (./eval_results/results.json) into scores, load the incumbent's stored scores from the registry, and sys.exit(1) when gate(...) returns a non-empty fails so the pipeline blocks the promotion. Because RL reward or falling loss can rise while quality regresses, the gate must read held-out eval scores, not the training signal (RLVR).
How to maintain it¶
- Pin and version everything. Prompt format, few-shot count, metric, and harness/task revision are all part of the result; treat a harness or task-set upgrade as a controlled change and re-baseline the incumbent's scores under the new versions before comparing.
- Re-decontaminate as data moves. New training data can leak a previously clean benchmark; rerun the contamination check whenever the corpus changes, not once.
- Recalibrate the judge. Position, verbosity, and self-preference bias drift with judge-model versions; periodically re-check the corrected win-rate against fresh human labels and refresh the rubric.
- Keep the gate set unseen. Rotate or expand the held-out gate set if there is any risk it entered training or few-shot prompts; an eval everyone can see is an eval you overfit.
How to scale it¶
Evaluation is a batch-inference workload: run it on vLLM, parallelize across tasks and GPUs, and make it a stage in the pipeline (data -> train -> eval -> register -> deploy) rather than a manual step (SRE/MLOps practices). Continuous eval in CI catches regressions per checkpoint; a nightly full-suite run tracks drift. Cache generations per (model, task, version) so re-scoring with a new metric does not re-run inference. For pass@k on reasoning and code, generating n samples per item is the cost driver, so size the sampling budget to the smallest n >= k that keeps the estimator's variance acceptable.
Failure modes¶
- Benchmark contamination. Train/eval overlap inflates scores and breaks the gate; decontaminate every reported benchmark (evaluation integrity).
- Goodharting one metric. Optimizing a single benchmark degrades untested capabilities; gate on a suite plus a held-out set.
- Incomparable runs. Changing prompt format, few-shot count, or harness version silently moves scores; pin and record them.
- LLM-judge bias. Position, verbosity, and self-preference bias skews judge scores; judge each pair in both orders and calibrate (see the corrected win-rate above), which turns a fake 100% into the true 50% for a purely position-biased judge.
- Gating on training reward. Promoting on rising RL reward or falling loss instead of held-out eval ships reward-hacked regressions (RLVR).
- No held-out gate set. If every eval is visible during development, you overfit the eval; reserve an unseen gate.
- Silent missing task. A crashed or renamed task that returns no score must fail the gate, not pass by absence; the gate above treats a missing task as a hard fail.
References¶
- lm-evaluation-harness (EleutherAI; Open LLM Leaderboard backend): https://github.com/EleutherAI/lm-evaluation-harness
- lighteval (Hugging Face; multi-backend eval): https://github.com/huggingface/lighteval
- HELM (Stanford CRFM, holistic evaluation): https://crfm.stanford.edu/helm/
- Inspect (UK AI Safety Institute, evals framework): https://inspect.aisi.org.uk/
- Chen et al., Evaluating Large Language Models Trained on Code (HumanEval, unbiased pass@k estimator): https://arxiv.org/abs/2107.03374
Related: LLM benchmarks · Evaluation integrity · Training-data curation · Synthetic data generation · Fine-tuning and post-training · RLVR · Reward design · Model merging · SRE/MLOps practices · Evaluating agents · Glossary
-
EleutherAI lm-evaluation-harness, 60+ academic benchmarks, the backend of Hugging Face's Open LLM Leaderboard; supports HF, vLLM (
--model vllm), and OpenAI-compatible backends. https://github.com/EleutherAI/lm-evaluation-harness ↩ -
Hugging Face lighteval, all-in-one evaluation across Accelerate/vLLM/SGLang/endpoints backends with 1000+ tasks and custom metrics. https://github.com/huggingface/lighteval ↩