Markdown

Automated harness optimization¶

Scope: the operational method behind automated harness search, told through one deep worked case: the Meta-Harness loop run against Harvey's Legal Agent Benchmark (LAB) with a frozen deepseek-ai/DeepSeek-V4-Pro. This page covers the loop protocol (copy-and-adapt, one mechanism per iteration), the promotion rule (a blended score, trial averaging, and a noise margin), split hygiene, the reliability engineering an LLM optimizer needs, and what transferred across model families. The survey of harness self-optimization as a field lives in self-improving harnesses; the evaluator-protection rules the loop depends on live in evaluation integrity and anti-gaming.

The loop snippets and the pending_eval.json example are reference material quoted from the source study, not executed here. The numpy example is executed and asserted; it simulates the acceptance discipline (noise, margins, cost weights), it does not rerun the study.

flowchart TB
  HIST["Run history digest<br/>(frontier lineage + rejected orphans)"] --> PROP["LLM proposer<br/>(copies best harness, adds ONE mechanism)"]
  PROP --> CAND["Candidate harness + pending_eval.json<br/>(hypothesis, fix tasks, regression tasks)"]
  CAND --> EVAL["Dev evaluation<br/>(24 tasks x 3 trials, LLM judge)"]
  EVAL --> GATE{"blended score >= incumbent + min_delta?"}
  GATE -->|"yes"| FRONT["New best harness (frontier)"]
  GATE -->|"no"| ORPH["Orphan (kept in history, restackable)"]
  FRONT --> HIST
  ORPH --> HIST
  LEAK["Leakage audit: reject any iteration<br/>that touched the held-out test split"] -.-> EVAL

What it is¶

Automated harness optimization holds a model's weights fixed and lets a search loop rewrite only the code around it: the runtime wrapper that selects context, executes tool calls, checks output, and decides when a run is finished. The premise is the mismanaged geniuses hypothesis: capable models score badly when a brittle scaffold mismanages them, so a weak benchmark number can measure the scaffold as much as the model.¹ Meta-Harness turns the hand-tuning loop (inspect failures, adjust prompts, add checks, re-run) into an automated one by giving a coding agent the prior harness code, execution traces, and scores, and asking it for the next edit.²

The case study on this page applies that loop to LAB, Harvey's benchmark of realistic legal matters graded criterion-by-criterion by an LLM judge that only reads files at the top level of output/ under the exact requested filename.³ The proposer was a Claude (Opus 4.8) session; the optimized model, deepseek-ai/DeepSeek-V4-Pro, was never fine-tuned. Over 22 evaluated iterations the dev-set pooled criterion pass rate climbed from 63.1% to 83.3%, and the final harness, run untouched on a held-out 100-task test split, moved from 63.4% to 80.1% pooled and from 0% to 5.0% all-pass, placing the open model between Sonnet 4.6 and Opus 4.6 on LAB's headline metric.⁴

Why use it¶

The harness can be most of the gap. With the same model, judge, and tasks, five wrappers spanned 3.5% to 80.1% pooled: mini-swe-agent 3.5%, Goose 23.2%, Pi 45.4%, the stock LAB harness 63.4%, the optimized harness 80.1%.⁵ A 16.7-point held-out gain cost roughly twenty evaluated candidates, no training run.
Operational failures are searchable. The largest wins were deterministic code, not prompts: the judge cannot find a deliverable that is misplaced, misnamed, chunked, or left in a scratch folder, and a post-solve landing step turns that zero into a scored result. Five of the top six harnesses on the final frontier were code mechanisms.
Gains compound. Copy-and-adapt means every accepted mechanism is inherited by all later candidates, so single fixes stack: deliverable_landing_gate alone moved dev pooled from 72.8% to 79.2%.
Code mechanisms travel. The V4-Pro-tuned harness, run untouched, lifted DeepSeek V4 Flash by 14.4 points; robustness fixes such as toolcall_json_repair helped a different family (Nemotron-3 Ultra) even where the net gain was only +0.4 because model-specific prompt playbooks cancelled it.

When to use it (and when not)¶

Use it when a dense, scriptable metric exists. The loop optimizes the pooled criterion pass rate because it moves smoothly; LAB's sparse all-pass headline enters only as a bonus term. A verifier dense enough for search to climb is the real prerequisite.
Use it when failures look operational. Missing deliverables, malformed tool calls, repetition loops, wrong-file output: these are exactly what deterministic scaffold code can guarantee away.
Do not expect it to buy substance. The searched harness plateaued around 83% pooled on dev; a tax-analysis task kept missing roughly 30 of 107 criteria on every trial (quantitative depth, citation precision). Past the operational floor, the base model carries the substantive work.
Do not copy the small dev set when scoring is cheap. The study kept 24 dev tasks only because each evaluation costs real money (LLM judge, long rollouts). With a cheap deterministic scorer, use a much larger dev set.

Architecture¶

Two roles split cleanly. The proposer (an LLM session) reads a history digest, forms a hypothesis, copies the current best harness, and adds exactly one mechanism. The scaffold (plain Python, meta_harness.py) runs the experiment and is strict about what counts: evaluate_harness() benchmarks the candidate on all 24 dev tasks at 3 trials; update_frontier() decides promotion; _touched_test() rejects any iteration that read the held-out split; _history_digest() feeds back the frontier lineage plus rejected orphans so a mechanism that missed once can be stacked later; loop_state.json keeps one contiguous iteration counter across relaunches. The behavioral contract for the proposer lives in a SKILL.md: copy-and-adapt, one mechanism at a time, no task-specific hints, never read the test split.

Candidates are flat copies, not subclasses: after twenty-odd iterations an inheritance chain would be unreadable, and a flat copy keeps every accepted mechanism visible in one file. Each candidate ships a machine-readable claim of what it fixes:

{
  "iteration": 5,
  "candidates": [{
    "name": "deliverable_reassembly_gate",
    "hypothesis": "Reassembling a chunked-write-clobbered deliverable from the model's own write history raises the pooled criterion pass rate ...",
    "changes": "Copied the highest not-promoted orphan ... added ONE deterministic post-solve mechanism, _land_clobbered_deliverables ...",
    "fix_tasks": ["funds-asset-management/draft-compliance-manual", "..."],
    "regression_tasks": ["environmental-esg/draft-markup-of-administrative-settlement-agreement", "..."],
    "observed": "Causal proof (deterministic judge re-score of the shipped gate output, no rollout): ... Live A/B: fix slice 0.801 -> 0.914, regression slice flat (0.810 -> 0.805)."
  }]
}

How to use it: the promotion rule¶

The promotion rule is the search objective, for better or worse. LAB's all-pass rate is too sparse to optimize (on 24 dev tasks at 3 trials, one extra all-passing rollout is about 1.4 points, so optimizing it directly chases luck). The study optimizes the dense pooled rate and folds all-pass in as a bonus, with a small cost term:

score = pooled_criterion_rate + 0.5 * all_pass_rate - 0.005 * tokens_per_million

Each weight has a reason. The 0.5 on all-pass means one lucky all-pass run cannot clear the margin by itself, but a pooled gain plus an all-pass gain can. The cost term (about half a point per million tokens) stops a much more expensive harness from winning on a marginal score gain. A candidate is promoted only if its blended score beats the incumbent by min_delta of one point, sitting just above the noise left after averaging three trials: single passes carry about 1.4 points of standard deviation (the study's three full passes landed at 72.9%, 76.2%, and 74.9%), and averaging three trials cuts that to roughly 0.8. A stricter 3-point margin was tried and rejected because it blocked genuine stacked gains of 1.5 to 1.8 points.

How to develop with it: validating a mechanism¶

The proposer's protocol demands evidence per candidate: either a causal replay (re-run a deterministic fix over old transcripts and re-score, which costs zero model tokens) or a pooled comparison over at least 5 fix tasks and 5 regression tasks. Single-trial, single-task swings are noise by construction. The accepted mechanism families from the run, in descending impact: deliverable landing and delivery (code), matter fidelity (a runtime check that detects a draft about the wrong dispute and forces a redraft), loop robustness (toolcall_json_repair, repetition breaking), and prompt playbooks (work-type methodology in the system prompt; two entered the frontier early, later prompt-only candidates never beat the code-heavy frontier).

The statistics that make the promotion gate trustworthy are small enough to validate directly. This block is executed and asserted: trial averaging plus a margin suppress false promotions without blocking genuine stacked gains, the cost term rejects an expensive marginal win, and comparing a fresh challenger against a stale incumbent score (computed under an old cost weight) flips a real decision, which is the study's promotion-rule bug:

# promotion_gate.py — validated: trial averaging plus a promotion margin suppress
# false promotions; the cost term and stale-weight recomputation change real
# decisions at the boundary. Simulation of the acceptance discipline, not a rerun
# of the study. numpy only.
import numpy as np

TRIAL_STD = 1.4          # single-pass noise, points (source: appendix, 72.9/76.2/74.9)
LAMBDA = 0.5             # cost weight, points per million tokens ("about half a point")

def blended(pooled_pts: float, all_pass_pts: float, tokens_m: float,
            lam: float = LAMBDA) -> float:
    return pooled_pts + 0.5 * all_pass_pts - lam * tokens_m

def promoted(challenger: float, incumbent: float, min_delta: float) -> bool:
    return challenger >= incumbent + min_delta

def promotion_rate(true_gain: float, trials: int, min_delta: float,
                   rng: np.random.Generator, n: int = 200_000) -> float:
    means = true_gain + rng.normal(0.0, TRIAL_STD, (n, trials)).mean(axis=1)
    return float(np.mean(means >= min_delta))

rng = np.random.default_rng(0)

# (a) a no-better candidate: 3 trials + 1-point margin vs 1 trial + no margin
false_gated = promotion_rate(0.0, trials=3, min_delta=1.0, rng=rng)
false_naive = promotion_rate(0.0, trials=1, min_delta=0.0, rng=rng)
assert false_naive > 0.45                      # coin flip: noise alone promotes
assert false_gated < 0.15                      # gate suppresses it
assert false_gated < false_naive / 3
# and the gate keeps power: a genuine +1.7-point stack still lands, a 3-point
# margin (rejected in the study) would block it
assert promotion_rate(1.7, 3, 1.0, rng) > 0.70
assert promotion_rate(1.7, 3, 3.0, rng) < 0.15

# (b) the cost term: +1.5 pooled clears the margin, but +4M tokens/run costs
# 2.0 points, so the expensive candidate must lose
inc = blended(79.2, 4.2, 1.0)
expensive = blended(80.7, 4.2, 5.0)
assert 80.7 - 79.2 >= 1.0 and not promoted(expensive, inc, 1.0)

# (c) stale-weight bug: incumbent scored under an old lambda, challenger under
# the current one; the challenger is worse on pooled AND tokens yet wins
stale_inc = blended(80.0, 0.0, 3.0, lam=1.0)   # stored score, old weight
fresh_inc = blended(80.0, 0.0, 3.0, lam=0.1)   # recomputed, current weight
challenger = blended(78.5, 0.0, 4.0, lam=0.1)
assert promoted(challenger, stale_inc, 1.0)        # bug: worse harness promoted
assert not promoted(challenger, fresh_inc, 1.0)    # fix: recompute the incumbent

print(f"false promotion, 3 trials + 1pt margin: {false_gated:.3f}")
print(f"false promotion, 1 trial + no margin:   {false_naive:.3f}")
print("cost-term boundary and stale-weight flip: OK")

Output: false promotion, 3 trials + 1pt margin: 0.109, false promotion, 1 trial + no margin: 0.501, then the boundary confirmations. The gate turns a coin flip into roughly one false promotion in ten while keeping over 70% power on real 1.7-point gains.

How to maintain it: an LLM optimizer is a long-running job¶

The study's own summary is that the loop worked before it was trustworthy: it lost iterations to crashes, once promoted a worse harness, and measured one model unfairly. The hardening that made the headline number hold up:

Recompute derived numbers. The promotion bug above was a stale stored score computed under an earlier cost weight; update_frontier() now recomputes the incumbent under current weights on every comparison. Copy-and-adapt absorbed the damage, since the mistakenly promoted mechanism was carried forward and still sits in the final frontier.
Treat provider reliability as a first-class variable. Long rollouts (up to a couple of million tokens) hit stream drops, protocol errors, and rate limits. Nemotron first scored about 1 out of 100 because the adapter did not retry a bare first-turn server error; a one-line retry fix turned "this model cannot do legal work" into a fair measurement. Transient provider failures were re-run before any test number was reported.
Auto-resume the proposer, kill its orphans. Proposer sessions run one to three hours; they auto-resume on session-limit resets, timeouts, and transient server errors, stop cleanly on an expired login, and a process-group kill prevents orphaned rollout subprocesses.
Audit for leakage every iteration. _touched_test() rejects any iteration that read the held-out split; this is the frozen-vs-mutable boundary enforced mechanically rather than by convention.

Running it in production: cost engineering¶

Evaluation, not proposal, dominates spend. A 100-task test run costs roughly $120 to $160 all-in: $90 to $100 of agent rollouts (mean about 1.8M tokens per task) plus $30 to $60 of judge calls, about 6,300 Sonnet calls at one per rubric criterion. Prompt caching reuses the per-task deliverable prefix across criteria and cuts judge input cost 5 to 8 times. A 24-task dev evaluation (72 rollouts, about 4,500 judge calls) lands in the same order of magnitude, which is why dev stays small and why the 22 evaluated iterations dominate the budget. The study's stated next step is a cheaper judge calibrated against the reference one, which would allow a dev set an order of magnitude larger; its priced example puts DeepSeek V4 Flash at about 21x cheaper than Sonnet 4.6 on input and 54x on output.

Transfer is the production question: a harness tuned on one model is usually deployed on several. The measured pattern is that robustness and code mechanisms transfer across families while prompt playbooks are model-specific and can backfire; within the tuned family the harness moved almost fully (V4 Flash +14.4 points), across families it was roughly flat (+0.4 on Nemotron-3 Ultra, where tool-call repair gains and playbook losses cancelled). Ship the code mechanisms broadly; gate prompt playbooks per model family.

Failure modes¶

Optimizing the sparse headline metric. One lucky all-passing rollout on a small dev set is worth more than the noise margin; the search then climbs luck. Optimize the dense rate and demote the sparse one to a bonus term.
Dev-set overfitting. A 24-task dev set can only surface failure modes that appear in 24 tasks, and wins may not generalize; the held-out split is the check, and the proposer must never read it.
Stale derived scores at the promotion boundary. Comparing a fresh challenger to a stored incumbent score computed under old weights promoted a strictly worse harness; recompute both sides under current weights.
Infrastructure failure read as model failure. Missing retries scored a model at about 1/100; timeouts were harness behavior (Goose timed out on 73/100 tasks, Pi on 25, mini-swe-agent on 13) and stayed in the aggregate as failures. Separate the two before believing a number.
Prompt mechanisms that backfire cross-model. Playbooks tuned to one model hurt tasks another model could already finish; code mechanisms did not show this failure.
Expecting scaffold search to fix substance. The loop's ceiling on LAB was operational: past deliverable landing and matter fidelity, remaining misses were quantitative depth and citation precision, which no wrapper trick recovered.

References¶

Niklaus, Don't Train the Model, Evolve the Harness (the case study; interactive figures and run data): https://huggingface.co/spaces/joelniklaus/harness-optimization
LAB harness and loop code (GitHub): https://github.com/JoelNiklaus/harness-optimization
Run artifacts (dev and test results, proposer traces): https://huggingface.co/buckets/joelniklaus/LAB-results
Meta-Harness: End-to-End Optimization of Model Harnesses (arXiv 2603.28052): https://arxiv.org/abs/2603.28052
Zhang, Li, Khattab, The Mismanaged Geniuses Hypothesis: https://alexzhang13.github.io/blog/2026/mgh/
Harvey, Introducing Harvey's Legal Agent Benchmark: https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark
Novikov et al., AlphaEvolve: A coding agent for scientific and algorithmic discovery (arXiv 2506.13131): https://arxiv.org/abs/2506.13131
Lange et al., ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution (arXiv 2509.19349): https://arxiv.org/abs/2509.19349
Zhang et al., Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents (arXiv 2505.22954): https://arxiv.org/abs/2505.22954
Agrawal et al., GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (arXiv 2507.19457): https://arxiv.org/abs/2507.19457

Zhang, Li, Khattab: capable models held back by brittle hand-built scaffolds; a weak score can measure the scaffold as much as the model. https://alexzhang13.github.io/blog/2026/mgh/ ↩
Meta-Harness (arXiv 2603.28052): automated search over the harness around a fixed model, with an LLM proposer reading prior harness code, execution traces, and scores. ↩
Harvey's Legal Agent Benchmark: closed document workspaces, concrete deliverables, an LLM judge grading against per-task rubrics; even frontier models complete fewer than 10% of tasks end to end. https://www.harvey.ai/blog/introducing-harveys-legal-agent-benchmark ↩
Niklaus, harness-optimization: frozen DeepSeek-V4-Pro, Claude (Opus 4.8) proposer, 24-task dev set at 3 trials, 100-task held-out split over the 1,251 public LAB tasks pinned to a fixed commit and seed; dev pooled 63.1% to 83.3% (all-pass 4.2%), held-out test 63.4% to 80.1% pooled and 0% to 5.0% all-pass; 22 evaluated iterations (4 prompt-only, 14 code-only, 1 mixed candidates). ↩
Same model, judge, and tasks across five harnesses. Delivery explained much of the gap: Pi produced an expected deliverable file type on 75/100 tasks (scoring 56.8% there, 11.6% elsewhere); Goose managed 26/100 and mini-swe-agent 7/100. ↩