Skip to content
Markdown

Evaluation integrity and anti-gaming

Scope: how to keep the evaluation honest when an autonomous optimizer can influence what it is scored on. It covers the structural boundary between "what is correct" (data, answer extraction, scorer) and "how to get there" (training code), why any nontrivial proxy is gameable, how to enforce the boundary with isolation rather than a static check, and how to defend a long campaign against held-out leakage. The evaluator-side companion to autonomous experimentation loops; it extends the reward-hacking treatment in reward design from designing the signal to protecting the evaluator from a self-modifying optimizer, and it depends on the fixed-substrate rule in self-improving harnesses and governing self-modifying agents.

Reference templates; pin versions and validate before production use.

flowchart LR
  OPT["Optimizer<br/>(proposes params / code diffs)"] -->|"may edit"| MUT["Mutable: train.py<br/>(algorithm, optimizer, generation)"]
  OPT -.->|"must NOT reach"| FROZ["Frozen — what is correct:<br/>data, gold answers, scorer"]
  MUT --> RUN["Trial (sandboxed)"]
  FROZ --> GATE["Sandbox gate:<br/>trial gets inputs, not gold/scorer"]
  RUN --> GATE
  GATE --> METRIC["Metric computed outside the trial"]
  METRIC --> HELD["Held-out eval the loop never sees"]
  MUT -.->|"leak path — must be blocked"| FROZ

Overview

An autonomous loop that can edit its own training code has, by construction, a conflict of interest: it is graded on a number and it can change the program that produces that number. If the optimizer can reach the evaluator (edit the scorer, read the gold answers, or fold the test set into training), it will eventually maximize the measure while the thing the measure stood for falls. The defense is structural, not a matter of nicer reward shaping: freeze what defines correctness, isolate the trial from it, and hold out an evaluation the loop can never touch. Reward design covers how to build a signal that resists gaming; this page covers the boundary that keeps an optimizer from redefining or leaking the signal in the first place.

Core knowledge

The frozen–mutable boundary

Split the experiment into two files connected only by the filesystem, never by imports:

  • Frozen: "what is correct." Data loading, answer extraction, reward/score computation, the evaluation protocol. The optimizer can never modify this. It is the trust boundary: evaluation integrity is guaranteed by freezing it.
  • Mutable: "how to get there." The training algorithm, optimizer, reward function internals, generation strategy. The optimizer may evolve this freely.

This is exactly FunSearch's design: the evolved Python function body is spliced into a frozen template and executed by a frozen evaluator in a sandbox, with the mutable region fenced by explicit markers and the template preserved across every iteration.1 The boundary is what makes a code-mutating loop safe to run at all: the optimizer evolves the "how" but can never redefine the "what."

Why any nontrivial proxy is gameable

The metric is always a proxy for the goal, and optimizing a proxy hard enough diverges from the goal. That is Goodhart's law: when a measure becomes a target, it stops being a good measure. This is not a bug you can design away:

  • Amodei et al. name "avoiding reward hacking" as one of five concrete AI-safety problems: an agent exploits the gap between the specified objective and the intended one.2
  • Krakovna et al. call it specification gaming: behavior that satisfies the literal specification without achieving the intended outcome.3
  • Skalse et al. formalize it and prove the trap: a proxy is "unhackable" only if increasing proxy return can never decrease true return, and for the set of all stochastic policies two reward functions are unhackable only if one of them is constant.4 A useful, nontrivial proxy is therefore essentially always hackable: you cannot patch the loophole away, so you must bound optimization pressure and protect the evaluator instead.
  • Pan et al. show the risk grows with capability: more capable agents exploit misspecified rewards more, sometimes via a phase transition where true reward drops sharply past a capability threshold.5 A stronger LLM proposer makes evaluation integrity more load-bearing, not less.
  • Gao et al. quantify it for learned reward models: as you optimize against the proxy, gold-standard reward first rises then falls, bounded by the KL budget you spend.6

The practical reading: the more powerful the optimizer, the harder it will lean on any crack in the evaluation. Design assuming it will find the crack.

Enforcement: a static check is not a sandbox

Freezing the scorer file is necessary but not sufficient. A common weak enforcement is a diff-validation check: reject any proposed code change that names the frozen file. That stops the obvious edit but not the runtime leak. A mutable train.py can, at runtime, open the frozen dataset, parse out the gold answers, and return a trivially perfect score, or import the scorer and short-circuit it. None of that "touches" the frozen file at diff time, so a static contract passes it through.

The strong enforcement is isolation at execution time, the FunSearch model: run each trial in a sandbox that exposes only the training inputs and withholds read access to gold labels and evaluator internals; score outside the sandbox on data the trial never saw. Two levels, weakest to strongest:

  • Static contract (weak). The diff is rejected if it references frozen paths. Cheap; catches direct edits; blind to runtime access.
  • Sandboxed execution (strong). The trial cannot read the gold/scorer even if its code tries; the metric is computed by the frozen evaluator on held-out inputs. This is the contract FunSearch relies on.1

Untrusted model-written code needs a real sandbox for safety anyway (agentic RL makes the same point about tool/code execution); reuse that boundary to also fence the evaluator.

Contamination and held-out integrity

Evaluation integrity fails silently at campaign scale even without malice. If the held-out set leaks into training (the proposer tunes directly to the eval, or eval examples get folded back into the training pool over iterations), scores inflate while true quality stalls. This is the loop-scale analog of test-set contamination in pretraining, a measurable and now-provable problem: MLE-bench explicitly studies pre-training contamination on its Kaggle tasks,7 and Oren et al. give a black-box statistical test that detects whether a benchmark was seen, using the fact that a clean model finds all orderings of an exchangeable benchmark equally likely.8 Defenses:

  • Keep a held-out split the loop can never read: not the scorer, not the proposer, not the training code. Report on it, tune on nothing derived from it.
  • Rotate or refresh the held-out set periodically on long campaigns so repeated exposure cannot accumulate.
  • Watch for the divergence signature: training/proxy reward climbing while the held-out metric is flat or falling, the fingerprint of both reward hacking and contamination.

Never let the optimizer edit its own metric or guardrails

The load-bearing invariant for any self-modifying loop: the optimizer must never be able to edit its own success metric, its evaluator, or its guardrails. A searched or self-editing system is only measurable against a fixed substrate: a pinned model snapshot, a content-hashed eval set, a pinned sandbox image, and seeded randomness; otherwise a reported "gain" is drift, not progress (self-improving harnesses, governing self-modifying agents, evaluating agents). The frozen–mutable boundary is the concrete mechanism that enforces this invariant for a training loop.

Don't-miss checklist

  • Split every experiment into a frozen "what is correct" and a mutable "how"; connect them only by the filesystem, never by imports.
  • Enforce the boundary with sandbox isolation, not just a diff-time path check; assume the trial will try to read the gold at runtime.
  • Compute the metric outside the trial's reach, on inputs the trial never saw.
  • Hold out an evaluation the optimizer, proposer, and training code can never read; rotate it on long runs.
  • Keep a KL/optimization-pressure budget; a nontrivial proxy is always hackable, so bound how hard it is pushed.
  • Alert on "proxy up, held-out down", the shared signature of reward hacking and contamination.
  • Pin the substrate (model snapshot, data hash, sandbox image, seeds) so measured gains are real, not drift.
  • Never let a self-modifying optimizer touch its own metric, evaluator, or guardrails.

Failure modes

  • Scorer edit. The optimizer rewrites the reward/scorer and awards itself a perfect score. Freeze it.
  • Runtime gold leak. Mutable code opens the frozen dataset and echoes the answers, passing a static contract. Sandbox it.
  • Held-out leakage. The eval set drifts into training over iterations; scores inflate. Hold out and rotate.
  • Proxy over-optimization. True quality falls while the learned reward climbs past the KL budget.6
  • Capability phase transition. A stronger proposer suddenly exploits a misspecification that a weaker one never found.5
  • Moving substrate. A model or data change is misread as an optimizer improvement; the loop chases noise (self-improving harnesses).
  • Metric monoculture. A single scalar the loop can game; no held-out or adversarial cross-check to catch the divergence.

Open questions & validation

  • No general test certifies a proxy is unhackable; the impossibility result says a nontrivial one essentially never is.4 Treat evaluation integrity as bounded risk to manage, not a property to prove.
  • Detecting contamination is an active area; black-box tests exist but so do evasion methods, so held-out hygiene (never exposing the set) beats after-the-fact detection.8
  • How much optimization pressure a given proxy tolerates before it breaks is empirical; measure the proxy-vs-gold divergence on your own task rather than assuming a safe budget (reward design).

References

  • Amodei et al., "Concrete Problems in AI Safety," arXiv:1606.06565 (avoiding reward hacking): https://arxiv.org/abs/1606.06565
  • Krakovna et al. (DeepMind), "Specification gaming: the flip side of AI ingenuity" (2020): https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
  • Skalse et al., "Defining and Characterizing Reward Hacking," NeurIPS 2022, arXiv:2209.13085: https://arxiv.org/abs/2209.13085
  • Pan, Bhatia, Steinhardt, "The Effects of Reward Misspecification," ICLR 2022, arXiv:2201.03544: https://arxiv.org/abs/2201.03544
  • Gao, Schulman, Hilton, "Scaling Laws for Reward Model Overoptimization," ICML 2023, arXiv:2210.10760: https://arxiv.org/abs/2210.10760
  • Romera-Paredes et al., FunSearch (frozen template + evaluator, sandboxed execution), Nature 625 (2024): https://www.nature.com/articles/s41586-023-06924-6 · code: https://github.com/google-deepmind/funsearch
  • Chan et al. (OpenAI), "MLE-bench" (studies pre-training contamination), arXiv:2410.07095: https://arxiv.org/abs/2410.07095
  • Oren et al., "Proving Test Set Contamination in Black Box Language Models," arXiv:2310.17623: https://arxiv.org/abs/2310.17623

Related: LLM benchmarks · LLM eval harness · Autonomous experimentation loops · Reward design for RL · Learning-curve extrapolation & early stopping · Self-improving harnesses · Governing self-modifying agents · Evaluating agents · Agentic and tool-use RL · Glossary


  1. Romera-Paredes et al., FunSearch splices the LLM-evolved function body into a frozen template and scores it with a frozen systematic evaluator inside a sandbox; the template is preserved across iterations and the mutable region is fenced by markers. Nature 625 (2024). 

  2. Amodei et al. list "avoiding reward hacking" among five concrete AI-safety problems arising from the wrong objective function — the agent exploits the gap between specified and intended objectives. arXiv:1606.06565. 

  3. Krakovna et al. define specification gaming as behavior that satisfies the literal specification of an objective without achieving the intended outcome (DeepMind, 2020). 

  4. Skalse et al. define a proxy as unhackable iff increasing expected proxy return can never decrease expected true return, and prove that over all stochastic policies two reward functions are unhackable only if one is constant — so any nontrivial proxy is hackable. NeurIPS 2022, arXiv:2209.13085. 

  5. Pan, Bhatia, Steinhardt find more capable agents (more capacity, finer action space, longer training) exploit misspecified rewards more, with phase transitions where true reward drops sharply past a capability threshold. ICLR 2022, arXiv:2201.03544. 

  6. Gao, Schulman, Hilton show optimizing a learned reward model overshoots true reward per Goodhart's law, following measurable functional forms (best-of-n quadratic, RL logarithmic) bounded by the KL budget. ICML 2023, arXiv:2210.10760. 

  7. Chan et al., MLE-bench investigates the impact of pre-training contamination on its 75 Kaggle ML-engineering tasks alongside agent resource-scaling. arXiv:2410.07095. 

  8. Oren et al. prove test-set contamination in black-box models without pretraining-data or weight access: a clean model finds all orderings of an exchangeable benchmark equally likely, so a canonically-ordered benchmark scoring far higher than shuffled flags contamination. arXiv:2310.17623.