Self-improving harnesses¶
Scope: treating the harness as something to optimize rather than hand-tune. It covers searching and ablating harness components, harnesses that edit themselves, optimizing prompts and context instead of weights, and lifting agent control flow out of the prompt into an explicit program graph. This is the dynamic counterpart to the static harness architecture; it depends on fixed-substrate evaluation and on the controls in governing self-modifying agents.
Code and configs here are reference templates; pin versions and validate before relying on them.
flowchart TB
subgraph OPT["Harness optimization loop"]
PROP["Propose harness change<br/>(prompt, tool set, component)"] --> EVAL["Evaluate on a fixed substrate"]
EVAL --> KEEP{"Improves?"}
KEEP -->|"yes"| ADOPT["Adopt"]
KEEP -->|"no"| ROLL["Roll back"]
ADOPT --> PROP
ROLL --> PROP
end
FLAT["Implicit control flow in the prompt"] --> LIFT["Lift into a program"]
LIFT --> DAG["Explicit graph (DAG of LLM calls)"]
Overview¶
The static view treats the harness as something an engineer writes once and tunes by hand (harness architecture). The frontier treats the harness as an object to optimize: search over its components, evolve its prompts, let it edit itself, or rewrite its control flow as an explicit program. The shared premise is that the harness, not only the model weights, is a lever on agent performance, so the harness deserves the same optimization machinery the model gets.1 A recent survey organizes this whole "code as agent harness" space into three layers: the harness interface that connects the agent to reasoning and environment, the harness mechanisms for planning and adaptive control, and multi-agent scaling over shared code artifacts.5
Core knowledge¶
The harness is often the bottleneck¶
When an agent fails, the cause is frequently the harness rather than the model. AutoHarness synthesizes a code harness automatically and reports that a large share of agent losses were illegal or invalid actions the harness failed to constrain, which reframes the harness as a policy over the action space rather than passive plumbing.1 In its evaluation a generator model iteratively refines a protective code wrapper from environment feedback until it eliminates illegal moves across 145 TextArena games, after which the wrapped smaller model outperforms a larger unwrapped one at lower cost.1 Constraining what the model is allowed to emit (the tool-mode idea in harness architecture) is the manual version of the same insight.
Searching and ablating the harness¶
If the harness is a lever, it can be searched. Meta-Harness performs end-to-end optimization of a model's harness with an agentic proposer that searches over the harness code itself, reading its source, its scores, and its execution traces, and reports gains on text classification, retrieval-augmented math reasoning, and agentic coding; NLAH ablates individual harness components to find which ones carry their weight.23 The governing rule from that ablation work is covered in harness architecture: a component earns its place only if it introduces an independent signal, and an ablation is valid only against a fixed substrate (see below).
Harnesses that improve themselves¶
The strongest form closes the loop: the harness proposes and applies its own changes. Self-Harness runs a three-stage loop, identifying a model's specific failure patterns, proposing targeted harness modifications, and validating each change by regression test before adopting it, and reports lifting one model's pass rate from 40.5% to 61.9%.4 This raises the question the static view never has to answer, which is whether the optimizer is safe; that question is the subject of governing self-modifying agents, not this page.
Optimizing prompts and context, not weights¶
Much of the gain lives in the prompt and the context, which can be optimized directly without touching the weights. GEPA evolves prompts reflectively, using the agent's own traces to propose edits, and ACE treats context engineering itself as the thing to optimize.78 These are cheaper than fine-tuning and operate on the part of the system the harness already owns.
Control flow as a program¶
A parallel move lifts control flow out of the implicit prompt transcript into an explicit program. Instead of hoping the model re-derives the plan each turn, the structure is expressed as code: a scheduler-theoretic framework lifts agent control flow into an explicit static graph, and Autellix treats agentic programs as a DAG of LLM calls to be scheduled.910 LLM-as-Code takes the paradigm to its conclusion: the program governs all control flow and the model is invoked only for the reasoning and generation steps, keeping deterministic looping and branching in code where they are reliable, which its authors show stabilizes long computer-use sequences.6 The program owns the control flow; the context becomes a call graph rather than a flat transcript. This is one instance of externalization, the broader pattern of moving state, control, and memory out of the model and into inspectable structure.11
The fixed-substrate requirement¶
Every claim of harness improvement is a measurement, and the measurement is only as trustworthy as its substrate. A searched or self-editing harness must be evaluated against a pinned model snapshot, a pinned sandbox image, a content-hashed eval set, and seeded randomness; otherwise the reported gain is drift, not progress (evaluation, harness architecture). A self-editing harness additionally needs the controls in governing self-modifying agents, the most important of which is that the optimizer must never be able to edit its own guardrails or success metric.
The discipline is small enough to state as code. This runnable block models the propose-evaluate-keep loop and asserts the two properties that make it trustworthy: on a fixed substrate the loop is monotone (it never adopts a regression), and a moving substrate breaks it (its accepted "gains" fail to transfer to a held-out pinned eval):
# harness_opt.py — validated: keep a harness change only if it beats the current one on a FIXED
# substrate; a moving substrate makes the gain fail to transfer. numpy only.
import numpy as np
def score(harness, substrate_seed): # pinned tasks + seeded order == fixed substrate
r = np.random.default_rng(substrate_seed)
tasks = r.random(300); picks = r.integers(0, len(harness), 300)
return float((harness[picks] > tasks).mean()) # pass a task if the picked capability beats it
def optimize(base, eval_seed, rounds=80, seed=0):
rng = np.random.default_rng(seed)
cur = base.copy(); cur_s = score(cur, eval_seed()); hist = [cur_s]
for _ in range(rounds):
cand = cur.copy(); i = rng.integers(len(cand))
cand[i] = np.clip(cand[i] + rng.normal(0, 0.15), 0, 1) # propose a harness tweak
if score(cand, eval_seed()) > cur_s: # adopt ONLY on measured improvement
cur = cand
cur_s = score(cur, eval_seed()); hist.append(cur_s)
return cur, hist
base = np.random.default_rng(1).random(24)
FIXED = 42
cur_fixed, hist = optimize(base, lambda: FIXED) # optimize on a pinned substrate
assert all(b <= a + 1e-9 for b, a in zip(hist, hist[1:])) # monotone: never regress on a fixed substrate
assert score(cur_fixed, FIXED) > score(base, FIXED) # a real, measured improvement
m = np.random.default_rng(7) # adversarial: a MOVING substrate each step
cur_moving, _ = optimize(base, lambda: int(m.integers(1_000_000)))
assert score(cur_fixed, FIXED) > score(cur_moving, FIXED) # moving-substrate gains do not transfer
print(f"held-out: base={score(base,FIXED):.3f} fixed={score(cur_fixed,FIXED):.3f} moving={score(cur_moving,FIXED):.3f}")
Don't-miss checklist¶
- Treat the harness (prompts, tools, components, control flow) as optimizable, not fixed.
- Optimize prompts and context first; it is cheaper than touching weights.
- Keep or drop each searched component by whether it adds an independent signal.
- Evaluate every harness change against a fixed substrate, or the gain is unmeasurable.
- Lift control flow into an explicit program when the plan must be reliable across turns.
- Put a self-editing harness under self-modifying-agent governance; never let it edit its own guardrails.
Failure modes¶
- Moving substrate. A model or data change is read as a harness improvement; the optimization chases noise.
- Same-signal components. A searched component recycles the doer model's own judgment and adds cost without lift.
- Self-improvement without guardrails. The optimizer edits its own success metric or policy and the loop diverges.
- Over-rigid graphs. Control flow lifted into a static DAG loses the flexibility a loop needs for genuinely open-ended tasks.
- Prompt-evolution overfitting. Reflective prompt edits overfit the evaluation traces and regress on held-out tasks.
Open questions & validation¶
- Which harness components are worth searching is task-dependent; validate on the target workload, not a generic benchmark.
- How far control flow should be lifted into a static graph versus left to the loop is unsettled and task-specific.
- Whether self-improving harnesses converge or drift over long horizons is an open question; gate them on held-out evaluation.
References¶
- Lee, Khattab, Finn et al., Meta-Harness (end-to-end harness optimization): https://arxiv.org/abs/2603.28052
- AutoHarness (synthesizing a code harness): https://arxiv.org/abs/2603.03329
- Self-Harness (harnesses that improve themselves): https://arxiv.org/abs/2606.09498
- LLM-as-Code (the program governs control flow; the model is a callee): https://arxiv.org/abs/2606.15874
- Code as Agent Harness (survey of code-as-harness): https://arxiv.org/abs/2605.18747
- NLAH (harness-component ablation): https://arxiv.org/abs/2603.25723
- GEPA: Reflective Prompt Evolution: https://arxiv.org/abs/2507.19457
- ACE: Agentic Context Engineering: https://arxiv.org/abs/2510.04618
- From Agent Loops to Structured Graphs: https://arxiv.org/abs/2604.11378
- Autellix (agentic programs as DAGs of LLM calls): https://arxiv.org/abs/2502.13965
- Externalization in LLM agents (survey): https://arxiv.org/abs/2604.08224
Related: Harness architecture · Governing self-modifying agents · Evaluating agents · Orchestration & control plane · The agent loop · Autonomous experimentation loops · Evaluation integrity & anti-gaming · Agentic systems
-
AutoHarness synthesizes a code harness automatically and finds that a large fraction of agent losses are illegal or invalid actions the harness failed to constrain, recasting the harness as a policy over the action space. ↩↩↩
-
Meta-Harness performs end-to-end optimization of a model's harness rather than the weights. ↩
-
NLAH ablates harness components to identify which carry their weight; valid ablation requires a fixed substrate. ↩
-
Self-Harness (arXiv 2606.09498): a three-stage self-improvement loop, identify a model's failure patterns, propose targeted harness modifications, and validate each by regression test before adopting; reported lifting a model's pass rate from 40.5% to 61.9%. ↩
-
Code as Agent Harness (arXiv 2605.18747), survey: organizes code-as-harness into the harness interface (to reasoning and environment), harness mechanisms (planning and adaptive control), and multi-agent scaling over shared code artifacts. ↩
-
LLM-as-Code (arXiv 2606.15874): invert the usual arrangement so the program governs all control flow and the model is invoked only for reasoning and generation; keeping deterministic looping and branching in code stabilizes long computer-use sequences. ↩
-
GEPA evolves prompts reflectively from the agent's own execution traces. ↩
-
ACE treats context engineering as the optimization target, tuning what enters the context rather than the weights. ↩
-
A scheduler-theoretic framework lifts agent control flow from the implicit prompt context into an explicit static graph. ↩
-
Autellix treats agentic programs as a DAG of LLM calls to be scheduled. ↩
-
A survey of externalization frames agents as moving state, control, and memory out of the model into inspectable structure. ↩