Markdown

LLM evaluation harness & the eval gate¶

Scope: measuring a post-trained model's quality reproducibly: the benchmark suites, the harness that runs them (lm-evaluation-harness, lighteval), the contamination controls that keep scores honest, and the eval gate that decides whether a checkpoint is promoted. The measurement layer that arbitrates every post-training decision, and the reason "it trained" is never sufficient to ship.

Shell and API blocks are reference templates: pin versions and validate before production use. Each core algorithm this page teaches (the eval gate, pass@k, the bias-corrected judge win-rate) is a runnable, self-asserting block that executes with numpy only, no ML dependencies.

What it is¶

An evaluation harness runs a fixed set of benchmarks against a model and emits comparable scores. Two are standard. lm-evaluation-harness (EleutherAI) is the backend of Hugging Face's Open LLM Leaderboard, with 60+ tasks and vLLM and OpenAI-compatible backends. lighteval (Hugging Face) is multi-backend (Accelerate, vLLM, SGLang, endpoints), with 1000+ tasks and custom metrics.¹² HELM (Stanford) and Inspect (UK AISI) are broader alternatives.

Benchmarks group by capability: knowledge and reasoning (MMLU, GPQA, ARC, BBH, HellaSwag), maths (GSM8K, MATH), code (HumanEval, MBPP), instruction-following (IFEval), and open-ended quality judged by an LLM or humans (MT-Bench, Arena-Hard, AlpacaEval). The eval gate is the CI check that a checkpoint must pass (quality up, no regressions, no safety failures) before it is promoted to a registry or serving.

Why use it¶

"It trained" is not "it improved." Training loss or RL reward can rise while real quality falls (reward hacking, entropy collapse); only a held-out eval tells you which happened.
Reproducibility. A pinned harness (prompt format, few-shot count, metric, version) makes runs comparable across time and teams; ad-hoc eval scripts are not comparable and quietly drift.
The gate arbitrates promotion. Gating on evals rather than vibes is the core MLOps discipline (SRE/MLOps practices); it catches regressions before production, not after.
Honest scores need decontamination. A benchmark that leaked into training data reports fiction (data curation, evaluation integrity).

When to use it (and when not)¶

Always, as a gate before promotion, and continuously during post-training to catch collapse and regression early.
Benchmark evals measure capability; LLM-as-judge measures open-ended quality; keep human eval for the hardest safety and quality slices no automation captures.
Do not over-index on one benchmark: optimizing a single number invites Goodharting (evaluation integrity).
Do not trust a benchmark you did not decontaminate against; treat leaked scores as unknown, not high.

Architecture¶

flowchart LR
  CKPT["Candidate checkpoint"] --> HARNESS["Harness: tasks x few-shot x metric (pinned)"]
  HARNESS --> SCORES["Scores + per-task breakdown"]
  DECON["Decontamination check"] -.->|"trust the score?"| SCORES
  SCORES --> GATE{"Eval gate: pass thresholds + no regression?"}
  GATE -->|"pass"| REG["Registry -> deploy"]
  GATE -->|"fail"| TRAIN["Back to training / data"]

How to use it¶

lm-evaluation-harness runs a model against named tasks; back it with vLLM for throughput:

# pip install lm-eval  (reference template) -- reproducible benchmark run on a vLLM backend
lm-eval --model vllm \
  --model_args pretrained=./merged-model,tensor_parallel_size=4,dtype=bfloat16 \
  --tasks mmlu,gsm8k,ifeval \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path ./eval_results

lighteval covers the same ground with more backends and custom tasks:

# pip install lighteval  (reference template) -- evaluate on a vLLM backend
lighteval vllm "pretrained=./merged-model,dtype=bfloat16" "leaderboard|mmlu|5|0"

Both write per-task scores you can threshold in CI. Pin the harness version and the task revisions; scores are only comparable within the same versions.

For reasoning and code, report pass@k, not just pass@1: the capability-versus-sharpening question turns on it (RLVR), and coverage under sampling is a different quantity from first-try reliability. The naive estimate (sample k, check for a hit) is high-variance; HumanEval's unbiased estimator samples n >= k, counts c correct, and computes pass@k = 1 - C(n-c, k) / C(n, k). The companion benchmarks page owns the canonical version; the block below is the same core math, validated here against a brute-force reference (every k-subset enumerated) plus boundaries and the pass@1 = c/n identity:

# pass@k core: unbiased HumanEval estimator P(>=1 of k of n samples correct); numpy-only.
# pass@k = 1 - C(n-c, k)/C(n, k), computed as a stable product to avoid huge binomials.
import numpy as np

def pass_at_k(n, c, k):
    """Estimate P(>=1 of k samples correct) from n samples with c correct."""
    assert 0 <= c <= n and 1 <= k <= n
    if n - c < k:                                   # fewer than k wrong -> a correct draw is certain
        return 1.0
    return 1.0 - float(np.prod(1.0 - k / np.arange(n - c + 1, n + 1)))

# slow but obviously-correct reference: enumerate every k-subset, fraction with >=1 correct
from itertools import combinations
def pass_at_k_ref(n, c, k):
    idx = range(n)                                  # first c indices are the correct samples
    subsets = list(combinations(idx, k))
    hit = sum(1 for s in subsets if any(i < c for i in s))
    return hit / len(subsets)

# equivalence to the slow reference across a grid (the anti-bug check)
for n in range(1, 9):
    for c in range(0, n + 1):
        for k in range(1, n + 1):
            assert abs(pass_at_k(n, c, k) - pass_at_k_ref(n, c, k)) < 1e-9, (n, c, k)

# boundaries and identities
assert pass_at_k(10, 0, 5) == 0.0                   # no correct sample -> impossible
assert pass_at_k(10, 3, 8) == 1.0                   # k=8 > n-c=7 -> a correct draw is certain
assert abs(pass_at_k(10, 1, 1) - 0.1) < 1e-12       # pass@1 == c/n

# monotonic non-decreasing in k: more attempts never lowers coverage
vals = [pass_at_k(10, 2, k) for k in (1, 2, 5, 10)]
assert all(a <= b + 1e-12 for a, b in zip(vals, vals[1:])), vals

print("pass@k asserts passed:", {k: round(v, 3) for k, v in zip((1, 2, 5, 10), vals)})

How to integrate it¶

The harness is a stage, not a script. Wire its four planes into the pipeline (data -> train -> eval -> register -> deploy):

Task plane. Match tasks to the capability you are training, and keep a held-out gate set the model never sees in training or few-shot prompts. Pin the prompt format, num_fewshot, metric, and harness/task version; all of them move scores, so record them with every result or the numbers are not comparable.
Decontamination plane. Run the contamination check against every reported benchmark before trusting a score. Before believing a jump, confirm the benchmark's items are not in the training set (data curation); an unexplained spike is usually contamination, not capability. Treat a leaked score as unknown, not high.
Judge plane (open-ended quality). For chat quality, score with an LLM judge (MT-Bench / Arena-Hard style) and report win-rate against a fixed baseline, alongside the capability benchmarks. Judges carry position bias, verbosity bias, and self-preference, so fix the rubric, randomize order, and calibrate against human labels (reward design). The concrete correction is to judge each pair in both orders and average, which cancels a constant position bias:

# LLM-judge win-rate with position-bias correction; numpy-only.
# Each pair is judged in BOTH orders. Score the candidate: win=1, tie=0.5, loss=0,
# averaged over the two orderings so a constant "prefer-first-slot" bias cancels out.
import numpy as np

def corrected_winrate(fwd, rev):
    """fwd[i]: candidate score when shown FIRST (1/0.5/0);
    rev[i]: candidate score when shown SECOND (same pairs, order swapped)."""
    fwd, rev = np.asarray(fwd, float), np.asarray(rev, float)
    assert fwd.shape == rev.shape
    assert np.isin(fwd, [0.0, 0.5, 1.0]).all() and np.isin(rev, [0.0, 0.5, 1.0]).all()
    return float(np.mean((fwd + rev) / 2.0))            # average the two slot assignments

# adversarial: a judge that ALWAYS picks whatever is in the first slot.
# When candidate is first it "wins" (fwd=1); when candidate is second it "loses" (rev=0).
n = 8
fwd_biased = np.ones(n)          # candidate shown first -> biased judge says candidate wins
rev_biased = np.zeros(n)         # candidate shown second -> biased judge says candidate loses
assert corrected_winrate(fwd_biased, rev_biased) == 0.5     # pure position bias cancels to a tie

# naive single-order win-rate would have reported a fake 100% for the same biased judge
naive = float(np.mean(fwd_biased))
assert naive == 1.0 and naive != corrected_winrate(fwd_biased, rev_biased)

# genuinely-better candidate: wins in both slots -> correction preserves the real signal
assert corrected_winrate(np.ones(n), np.ones(n)) == 1.0     # dominates regardless of position
# order-invariance: swapping which array is "forward" cannot change the averaged result
rng = np.random.default_rng(0)
a = rng.choice([0.0, 0.5, 1.0], n); b = rng.choice([0.0, 0.5, 1.0], n)
assert corrected_winrate(a, b) == corrected_winrate(b, a)
# ties everywhere -> 0.5 exactly (boundary)
assert corrected_winrate(np.full(n, 0.5), np.full(n, 0.5)) == 0.5

print("judge win-rate asserts passed: biased->%.2f  better->%.2f" % (
    corrected_winrate(fwd_biased, rev_biased), corrected_winrate(np.ones(n), np.ones(n))))

Lineage plane. Emit per-task scores plus the pinned versions to the experiment tracker and model registry so a promoted checkpoint is reproducible and its gate decision is auditable (SRE/MLOps practices).

How to run it in production¶

The gate is the production artefact: a CI check that blocks promotion when the candidate misses a floor or regresses against the incumbent. Gate on a suite plus the held-out set, never a single number, and never on training reward. This is the runnable core, self-asserting on the happy path, the exact-floor boundary, a missing task, and an above-floor-but-regressed candidate:

# eval_gate.py -- promote only if every gated task clears its floor AND does not regress.
# Pure stdlib; runnable. In CI you would json.load the harness results instead of the inline dict.
FLOORS = {"mmlu": 0.62, "gsm8k": 0.70, "ifeval": 0.55}   # promotion thresholds

def gate(scores, floors=FLOORS, incumbent=None, eps=0.0):
    """Return (ok, fails). A task fails if it is missing, below its floor,
    or (when an incumbent is given) regresses by more than eps."""
    fails = {}
    for task, floor in floors.items():
        s = scores.get(task)
        if s is None:                                    # missing task -> hard fail, never silently skip
            fails[task] = "missing"
        elif s < floor:
            fails[task] = f"{s:.3f} < floor {floor}"
        elif incumbent is not None and s < incumbent.get(task, 0.0) - eps:
            fails[task] = f"{s:.3f} regresses vs {incumbent[task]:.3f}"
    return (not fails), fails

# happy path: clears every floor -> promote
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58})
assert ok and fails == {}, fails

# boundary: exactly at the floor is a pass (>= semantics), just below is a fail
assert gate({"mmlu": 0.62, "gsm8k": 0.70, "ifeval": 0.55})[0] is True
ok, fails = gate({"mmlu": 0.6199, "gsm8k": 0.70, "ifeval": 0.55})
assert ok is False and set(fails) == {"mmlu"}, fails

# adversarial 1: a missing task must fail, not pass by absence
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74})              # ifeval absent
assert ok is False and fails["ifeval"] == "missing", fails

# adversarial 2: above the floor but regressed vs incumbent -> block the promotion
inc = {"mmlu": 0.70, "gsm8k": 0.74, "ifeval": 0.58}
ok, fails = gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58}, incumbent=inc)
assert ok is False and "regresses" in fails["mmlu"], fails       # 0.66 > floor 0.62 yet < incumbent 0.70

print("eval_gate asserts passed:", gate({"mmlu": 0.66, "gsm8k": 0.74, "ifeval": 0.58}))

In CI, json.load the harness output (./eval_results/results.json) into scores, load the incumbent's stored scores from the registry, and sys.exit(1) when gate(...) returns a non-empty fails so the pipeline blocks the promotion. Because RL reward or falling loss can rise while quality regresses, the gate must read held-out eval scores, not the training signal (RLVR).

How to maintain it¶

Pin and version everything. Prompt format, few-shot count, metric, and harness/task revision are all part of the result; treat a harness or task-set upgrade as a controlled change and re-baseline the incumbent's scores under the new versions before comparing.
Re-decontaminate as data moves. New training data can leak a previously clean benchmark; rerun the contamination check whenever the corpus changes, not once.
Recalibrate the judge. Position, verbosity, and self-preference bias drift with judge-model versions; periodically re-check the corrected win-rate against fresh human labels and refresh the rubric.
Keep the gate set unseen. Rotate or expand the held-out gate set if there is any risk it entered training or few-shot prompts; an eval everyone can see is an eval you overfit.

How to scale it¶

Evaluation is a batch-inference workload: run it on vLLM, parallelize across tasks and GPUs, and make it a stage in the pipeline (data -> train -> eval -> register -> deploy) rather than a manual step (SRE/MLOps practices). Continuous eval in CI catches regressions per checkpoint; a nightly full-suite run tracks drift. Cache generations per (model, task, version) so re-scoring with a new metric does not re-run inference. For pass@k on reasoning and code, generating n samples per item is the cost driver, so size the sampling budget to the smallest n >= k that keeps the estimator's variance acceptable.

Failure modes¶

Benchmark contamination. Train/eval overlap inflates scores and breaks the gate; decontaminate every reported benchmark (evaluation integrity).
Goodharting one metric. Optimizing a single benchmark degrades untested capabilities; gate on a suite plus a held-out set.
Incomparable runs. Changing prompt format, few-shot count, or harness version silently moves scores; pin and record them.
LLM-judge bias. Position, verbosity, and self-preference bias skews judge scores; judge each pair in both orders and calibrate (see the corrected win-rate above), which turns a fake 100% into the true 50% for a purely position-biased judge.
Gating on training reward. Promoting on rising RL reward or falling loss instead of held-out eval ships reward-hacked regressions (RLVR).
No held-out gate set. If every eval is visible during development, you overfit the eval; reserve an unseen gate.
Silent missing task. A crashed or renamed task that returns no score must fail the gate, not pass by absence; the gate above treats a missing task as a hard fail.

References¶

lm-evaluation-harness (EleutherAI; Open LLM Leaderboard backend): https://github.com/EleutherAI/lm-evaluation-harness
lighteval (Hugging Face; multi-backend eval): https://github.com/huggingface/lighteval
HELM (Stanford CRFM, holistic evaluation): https://crfm.stanford.edu/helm/
Inspect (UK AI Safety Institute, evals framework): https://inspect.aisi.org.uk/
Chen et al., Evaluating Large Language Models Trained on Code (HumanEval, unbiased pass@k estimator): https://arxiv.org/abs/2107.03374

EleutherAI lm-evaluation-harness, 60+ academic benchmarks, the backend of Hugging Face's Open LLM Leaderboard; supports HF, vLLM (--model vllm), and OpenAI-compatible backends. https://github.com/EleutherAI/lm-evaluation-harness ↩
Hugging Face lighteval, all-in-one evaluation across Accelerate/vLLM/SGLang/endpoints backends with 1000+ tasks and custom metrics. https://github.com/huggingface/lighteval ↩