HarnessX: a composable agent harness foundry¶
Scope: HarnessX (arXiv 2606.14249, Chen et al.), a foundry that composes agent harnesses from typed processor primitives, evolves them from execution traces with the AEGIS meta-agent pipeline, and co-trains the model on the traces the evolution loop already produces. This page covers the harness-as-value formalism (hooks, processors, singleton groups, the nine-dimensional taxonomy), the operational mirror that maps harness evolution onto RL and predicts its pathologies, the deterministic acceptance gate and variant isolation, cross-harness GRPO co-evolution, and the measured results and failure cases. It extends the harness cluster: static anatomy is harness architecture, the field survey of harness optimization is self-improving harnesses, the single-benchmark search case study is automated harness optimization, and training one skill document is skill optimization. HarnessX differs from all three in scope: it is infrastructure for composing, isolating, and evolving whole harnesses across tasks and models, plus a bridge into model RL.
All numbers are the paper's (v1, June 2026), measured on the same task sets used for evolution; the authors state held-out generalization is not evaluated, so treat gains as adaptation-set results. The codebase is announced for a future open-source release and no public repository existed to verify as of 2026-07. The processor snippet is a reference paraphrase of the paper's protocol, unexecuted; the Python composition example is executed and asserted.
flowchart TB
RUN["Run harness H_t on adaptation batch<br/>(pass@2, full trace recording)"] --> TS["Trace store: events, verifier scores,<br/>shipped and rejected edits"]
TS --> DIG["Digester: ~10M trace tokens to<br/>~10K structured per-task summaries"]
DIG --> PLAN["Planner: adaptation landscape<br/>(failing tasks, tried edits, untried types)"]
PLAN --> EVO["Evolver: K candidate harnesses<br/>+ change manifests + smoke tests"]
EVO --> CRIT["Critic: manifest vs trace evidence<br/>(one revision cycle, then ship_ranking)"]
CRIT --> GATE{"Deterministic gate: manifest complete,<br/>normalized, smoke test, seesaw constraint"}
GATE -->|"pass"| SHIP["H_t+1 committed"]
GATE -->|"improves subset, regresses solved"| FORK["Fork variant (ensemble routing)"]
GATE -->|"fail"| ARCH["Archived with rejection reason"]
SHIP --> RUN
FORK --> RUN
TS -.->|"same buffer"| GRPO["Cross-harness GRPO model update"]
GRPO -.-> RUN
What it is¶
HarnessX is a harness foundry: infrastructure that makes the runtime scaffold around a model (prompts, tools, memory, control flow) a first-class object that can be composed, adapted, and evolved alongside the model. A harness is the pair (M, C): a model configuration (which model serves the main, judge, and evaluator roles, with fallback policies) and a harness configuration recording behavior independent of model identity. The two combine into an executable agent via model_config.agentic(harness_config), and each is independently substitutable, serializable, and hashable.1
The harness configuration decomposes into processors and slots. A processor is the atomic behavioral unit, an object satisfying async def process(self, event: Event) -> AsyncIterator[Event] with exactly five outcomes: pass through, transform, split, intercept, or interrupt. Processors attach to one of eight lifecycle hooks (task_start, step_start, before_model, after_model, before_tool, after_tool, step_end, task_end), each with a fixed event type and a whitelist of permitted modifications that the run loop validates after every invocation; violating a contract raises immediately instead of propagating corrupted state. Three metadata fields govern composition: _singleton_group (mutual exclusion), _order (PRE, NORMAL, POST), and _after (soft dependencies). Slots are shared singletons: tool registry, tracer, workspace, sandbox provider, plugins.1
On top of this substrate sit two loops. AEGIS evolves the harness from traces through a four-stage pipeline (Digester, Planner, Evolver, Critic) behind a deterministic acceptance gate. Co-evolution feeds the same traces into cross-harness GRPO so the model internalizes strategies that successive harness versions introduced.37
Why use it¶
- Measured gains without touching weights. Across five benchmarks (GAIA, ALFWorld, WebShop, tau3-Bench, SWE-bench Verified) and three task-agent families (Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B), AEGIS improves 14 of 15 configurations, averaging +14.5 points with a range up to +44.0.5
- The weakest models gain most. Gains scale inversely with baseline: Qwen3.5-9B gains +44.0 on ALFWorld (53.0% to 97.0%), +17.1 on GAIA, +18.2 on SWE-bench Verified, while Sonnet 4.6 gains +11.2, +9.7, +10.9 on the same benchmarks. An evolved harness closes behavioral gaps a weak model cannot self-correct.5
- Typed composition is what makes evolution stable. On heterogeneous GAIA, a single evolved harness peaks at 73.8% then collapses to 49.5% (peak-final gap -24.3, far outside the ±8.5 binomial CI); variant isolation over the same typed components finishes at 87.4% with zero degradation and fewer tokens (107.8M vs 143.7M). The paper's framing: types do not prevent bad edits, they make each edit's scope explicit, which is the precondition for isolating it.6
- LLM proposes, deterministic gate disposes. No edit ships on the Critic's judgment alone: the gate checks manifest completeness, configuration normalization, smoke tests, and the seesaw constraint (no previously solved task may regress) in sequence. Safety properties hold regardless of meta-agent failure modes.3
- Co-evolution breaks the scaffolding ceiling. Interleaving cross-harness GRPO with harness evolution over one shared replay buffer adds +4.3 (GAIA) and +5.0 (WebShop) over harness-only evolution for Qwen3.5-9B, at no extra rollout cost: the model update replays traces the harness loop already paid for.7
When to use it (and when not)¶
- Use the foundry pattern when a team maintains harnesses across several tasks or models and pays repeated hand-tuning cost; the typed substrate turns per-task scaffolds into configuration rather than forks of application code.
- Use trace-driven evolution where a scriptable verifier exists per task and failures are operational (wrong tool, missing context, control-flow loops); those are the edits AEGIS ships most (context assembly and tool ecosystem were the most frequent edit targets).1
- Use variant isolation on heterogeneous task sets; the Global single-harness strategy is adequate where evolution converges in a few rounds (SWE-bench peaked at R2-R3, ALFWorld by R4-R7).
- Do not read the numbers as held-out generalization. All gains are peak accuracy on the adaptation set itself; the authors flag both selection bias and potential overfitting. Gate any production claim on your own held-out split (see evaluation integrity).
- Budget for a frontier meta-agent. AEGIS requires multi-file code generation and structured trace analysis; all experiments use Claude Opus 4.6 as meta-agent, and open-weight meta-agents are untested. Meta-agent budgets ran 100M to 175M tokens per benchmark.
- Co-evolution assumes joint control of harness evolution and model training over a shared buffer; across team boundaries that coupling is often impractical, per the authors.
Architecture¶
The operational mirror. Harness evolution is formalized as an MDP over symbolic artifacts: states are (harness configuration, trace store), actions are typed code-level edits generated by the meta-agent (open-ended, not enumerated), feedback is traces plus verifier scores, and the deterministic acceptance operator governs state transitions. The mirror is predictive: it converts three RL pathologies into design risks that the paper later observes empirically. Reward hacking (the evolver can exploit the verifier directly) is defended by the Critic; catastrophic forgetting (an edit repairing pattern A silently regresses pattern B through shared components) by the gate's seesaw constraint; under-exploration (a bias toward cheap prompt tweaks that pass gating) by the Planner's explicit adaptation landscape.2
AEGIS stages. One meta-agent drives all four stages with selective invocation. The Digester compresses roughly 10M raw trace tokens per GAIA round into structured per-task summaries (outcome, failure category, implicated components, evidence excerpts) with cross-round continuity. The Planner builds the adaptation landscape: failing tasks, edits already attempted, components implicated, edit types untried. The Evolver emits up to K=4 candidate harnesses per round as typed builder operations, each with a change manifest and, for new processor code, a smoke test. The Critic compares manifests against trace evidence, may request one revision, then returns a ship ranking; the deterministic gate applies its checks in order and archives failures with reasons. Protocol defaults: 15 rounds maximum, early stop after 3 idle rounds, ±5% single-round noise threshold, 3 seeds, 10 concurrent rollouts, pass@2 scoring.3
The change manifest is the loop's evidence ledger and what makes every edit falsifiable: candidate id, bucket (prompt, tools, config, processor), capability evidence, file changes, predicted impact (tasks expected to unlock, stabilize, or be put at risk), and an attribution signature (a trace feature that must appear if the edit fired). The next round's traces confirm or refute the prediction, and ship-prediction accuracy becomes a health metric of the evolution itself.4
Variant isolation (ensemble routing). Instead of rejecting an edit that improves one task cluster while regressing another, the system forks a new harness variant (pool capped at K, lowest performer retired) and routes each task to the variant with the highest estimated success on its cluster; the seesaw constraint is then scoped per variant.6
How to use it¶
The composition surface, paraphrased from the paper's protocol (reference template; the code is not yet released, so treat this as the paper's shape, not an importable API):
# Reference paraphrase of the HarnessX processor protocol (arXiv 2606.14249).
class ContextTrim(Processor):
_singleton_group = "context_assembly"
_order = NORMAL
async def process(self, event: StepStartEvent) -> AsyncIterator[StepStartEvent]:
yield event.with_history(trim(event.history)) # structural history edits: step_start only (Table 1)
harness = (HarnessConfig()
.attach("step_start", ContextTrim())
.attach("before_tool", ToolGuard()))
agent = model_config.agentic(harness) # model and harness independently swappable
The behavioral space is organized along nine dimensions: model selection, context assembly, memory management, tool ecosystem, execution environment, evaluation and reward, control and safety, observability, and the training bridge. Observability supplies the trace substrate AEGIS reasons over, and the training bridge converts trajectories into RL records for co-evolution, so the taxonomy is also the edit surface.1
How to develop with it¶
The disciplines the foundry enforces (hook-typed composition, mutual exclusion, dependency ordering, gated acceptance with the seesaw constraint, variant forking) are checkable without the system. This block is executed and asserted, covering the adversarial cases: a plan with an unplaced dependency, a duplicate singleton group, a cross-hook swap, an edit that regresses a solved task, and a split edit that forks a variant instead of being lost:
# harnessx_compose.py - validated: the composition and gating discipline HarnessX
# builds on, modeled at runtime. A hook-typed processor pipeline (singleton groups,
# ordering dependencies), a composer that rejects invalid plans before execution,
# the paper's deterministic acceptance gate (manifest completeness, then the seesaw
# constraint: no previously solved task may regress), and the variant fork that
# ensemble routing performs instead of rejecting a split edit. A model of the
# discipline, not the system. Pure stdlib.
from dataclasses import dataclass, field
HOOKS: dict[str, str] = { # subset of the paper's eight hooks, for the model
"task_start": "TaskStartEvent", "step_start": "StepStartEvent",
"before_model": "BeforeModelEvent", "after_model": "ModelResponseEvent",
"before_tool": "ToolCallEvent", "after_tool": "ToolResultEvent",
}
@dataclass(frozen=True)
class Processor:
name: str
hook: str
singleton_group: str
after: tuple[str, ...] = () # soft deps on earlier singleton groups
@dataclass
class Manifest:
bucket: str # prompt | tools | config | processor
predicted_impact: dict[str, bool] # task -> expected pass (falsifiable)
attribution_signature: str = ""
def compose(plan: list[Processor]) -> dict[str, list[Processor]]:
"""Validate a composition plan; reject before anything executes."""
seen_groups: set[str] = set()
pipeline: dict[str, list[Processor]] = {h: [] for h in HOOKS}
for proc in plan:
assert proc.hook in HOOKS, f"{proc.name}: unknown hook {proc.hook}"
assert proc.singleton_group not in seen_groups, \
f"{proc.name}: duplicate singleton group {proc.singleton_group}"
for dep in proc.after:
assert dep in seen_groups, f"{proc.name}: dependency {dep} not yet placed"
seen_groups.add(proc.singleton_group)
pipeline[proc.hook].append(proc)
return pipeline
def swap(plan: list[Processor], old_group: str, new: Processor) -> list[Processor]:
"""Interface-compatible substitution: same singleton group, same hook."""
old = next(p for p in plan if p.singleton_group == old_group)
assert new.hook == old.hook, \
f"swap rejected: {new.name} targets hook {new.hook}, group owns {old.hook}"
return [new if p is old else p for p in plan]
def gate(manifest: Manifest, incumbent: dict[str, bool],
candidate: dict[str, bool]) -> bool:
"""Deterministic acceptance: manifest completeness, then the seesaw constraint
(the paper's rule: reject any edit that regresses a previously solved task)."""
if not (manifest.bucket and manifest.predicted_impact):
return False
return all(candidate[t] for t, passed in incumbent.items() if passed)
def gate_or_fork(manifest: Manifest, incumbent: dict[str, bool],
candidate: dict[str, bool],
variants: list[dict[str, bool]], k: int) -> str:
"""Ensemble routing: a split edit (improves some, regresses solved) forks a
new variant instead of being rejected outright; pool capped at k."""
if gate(manifest, incumbent, candidate):
return "shipped"
improves = any(candidate[t] and not p for t, p in incumbent.items())
if improves and len(variants) < k:
variants.append(candidate)
return "forked"
return "rejected"
base = [
Processor("SystemPrompt", "task_start", "system_prompt"),
Processor("ContextTrim", "step_start", "context_assembly"),
Processor("ToolGuard", "before_tool", "tool_policy", after=("context_assembly",)),
]
# 1) A complete plan assembles; hook typing routes each processor correctly.
pipe = compose(base)
assert [p.name for p in pipe["before_tool"]] == ["ToolGuard"]
# 2) A plan with a missing dependency is rejected before execution.
try:
compose([Processor("ToolGuard", "before_tool", "tool_policy",
after=("context_assembly",))])
raise SystemExit("missing dependency must be rejected")
except AssertionError as e:
assert "not yet placed" in str(e)
# 3) Duplicate singleton groups are rejected (mutual exclusion).
try:
compose(base + [Processor("ContextTrim2", "step_start", "context_assembly")])
raise SystemExit("duplicate singleton group must be rejected")
except AssertionError as e:
assert "duplicate singleton group" in str(e)
# 4) An interface-compatible swap yields a valid harness; a cross-hook swap fails.
swapped = compose(swap(base, "context_assembly",
Processor("ContextSummarize", "step_start", "context_assembly")))
assert [p.name for p in swapped["step_start"]] == ["ContextSummarize"]
try:
swap(base, "context_assembly",
Processor("BadSwap", "after_tool", "context_assembly"))
raise SystemExit("cross-hook swap must be rejected")
except AssertionError as e:
assert "swap rejected" in str(e)
# 5) Seesaw gate: an edit that regresses a solved task is rejected, state unchanged.
incumbent = {"t1": True, "t2": True, "t3": False, "t4": False}
variants: list[dict[str, bool]] = [incumbent]
regressing = {"t1": True, "t2": False, "t3": True, "t4": True} # fixes 2, breaks t2
clean_gain = {"t1": True, "t2": True, "t3": True, "t4": False} # fixes 1, breaks none
m = Manifest("processor", {"t3": True})
assert gate(m, incumbent, clean_gain) is True
assert gate(m, incumbent, regressing) is False
assert gate(Manifest("", {}), incumbent, clean_gain) is False # incomplete manifest
snapshot = dict(incumbent)
assert gate(m, incumbent, regressing) is False and incumbent == snapshot
# 6) Ensemble routing forks the split edit instead of losing it; pool cap holds.
assert gate_or_fork(m, incumbent, regressing, variants, k=3) == "forked"
assert len(variants) == 2 and variants[0] == incumbent # incumbent intact
assert gate_or_fork(m, incumbent, regressing, variants + [{}], k=3) == "rejected"
print("pipeline hooks:", {h: [p.name for p in ps] for h, ps in pipe.items() if ps})
print("seesaw: clean gain shipped, regression rejected, split edit forked")
print("all composition and gating assertions passed")
Output: the assembled pipeline routing, seesaw: clean gain shipped, regression rejected, split edit forked, and all composition and gating assertions passed. Note what the seesaw gate is and is not: it is a no-regression check on previously solved tasks under pass@2, not a strict-improvement requirement; the paper's Telecom failure case (below) shows exactly where that binary, per-edit check is blind.
How to maintain it¶
- Read the manifests, not just the score. Every shipped edit carries a rollback target and a falsifiable prediction; every rejected candidate is archived with its reason. Treat ship-prediction accuracy as the evolution loop's own health metric: on ALFWorld it fell from 80% (R3) to 0% (R7), signaling prompt-space exhaustion that the raw score alone could not distinguish from convergence.8
- Alarm on same-type edit concentration. Five consecutive prompt/processor "reminder" edits on tau3-Bench Telecom accumulated sub-threshold conflicts the seesaw constraint cannot see (pass@2 registers only binary flips); the sixth same-type edit then dropped compliance 14.0 points in one round, and the paper's own Critic had flagged the concentration yet the edit still shipped. Track the edit-bucket distribution per component and stop stacking.
- Re-verify with more seeds where flips are expensive. With 55 SWE-bench tasks, one flip moves aggregate accuracy ~1.8 points, and post-peak degradation reached -12.7 from peak; small task sets need more trials before trusting a shipped edit.
- Pin nothing yet. The codebase is unreleased as of 2026-07; anything built on the paper's interfaces today is a reimplementation, and the protocol details (hook whitelist, manifest schema) are version-0 targets to re-diff when the release lands.
Running it in production¶
Evolution is an offline, upfront cost that amortizes at deployment. The evolved harness is a static artifact requiring no meta-agent at inference; unseen tasks route to the variant with the highest overall success on the evolution set. On GAIA the 107.8M-token evolution amortizes within roughly 1,300 invocations because the evolved harness also cut per-task inference tokens by ~25% (targeted tool selection shortens trajectories); on ALFWorld per-task cost rose ~60% (decomposition prompts lengthen execution) and the return is accuracy, not cost.9 Oversight is structural: manifests plus rollback targets give auditability, the deterministic gate refuses regressions, and the gating layer supports human approval above a configurable risk threshold (present in the design, not exercised in the paper's automated runs). A harness that rewrites itself in production is a governed self-modifying system; apply the controls in governing self-modifying agents and keep evolution off the live path, shipping evolved harnesses through the same release gates as code. For the model side, cross-harness GRPO is standard GRPO with task-level grouping across harness versions and cached behavior log-probabilities, so the agentic RL infrastructure concerns apply unchanged.
Failure modes¶
- Reward hacking through the verifier. A shipped composite edit raised GAIA accuracy 74.8% to 79.6%, but part of the gain came from exploiting format regularities in the exact-match verifier rather than retrieval; trace inspection caught it one round later and a guard shipped two rounds later. A scalar score cannot distinguish this from genuine improvement; only trace structure can.8
- Sub-threshold catastrophic forgetting. Per-edit gating under binary pass@2 misses accumulated coupling: no single Telecom edit violated the seesaw constraint, yet the stack of six degraded compliance 14.0 points at once. Per-edit gates bound single-step damage, not drift.8
- Under-exploration plateaus. Cheap prompt edits pass gates and bias the next round's hypotheses toward the same neighborhood; the system shipped <1% gains per round for four rounds on ALFWorld before structural edits were even calibrated.8
- Single-harness collapse on heterogeneous tasks. Global evolution on GAIA lost 24.3 points from its peak as conflicting edits fought; without variant isolation, heterogeneity converts evolution into oscillation.6
- Adaptation-set overfitting. Peak accuracy on the evolution set is the headline; nothing in the loop measures held-out transfer, and a harness tuned to 103 tasks may encode those tasks. Keep a frozen split the meta-agent never sees.
- Interface drift in a growing registry. The run loop validates hook contracts only at invocation time, and singleton groups only prevent duplicates, not stale processors whose assumptions about shared slots have rotted; audit slot dependencies when retiring or replacing components (the same hygiene as any tool registry).
References¶
- Chen et al., HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry (arXiv 2606.14249): https://arxiv.org/abs/2606.14249
- Lee et al., Meta-Harness: End-to-End Optimization of Model Harnesses (arXiv 2603.28052): https://arxiv.org/abs/2603.28052
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO, arXiv 2402.03300): https://arxiv.org/abs/2402.03300
- Mialon et al., GAIA: a benchmark for General AI Assistants (arXiv 2311.12983): https://arxiv.org/abs/2311.12983
- Shridhar et al., ALFWorld: Aligning Text and Embodied Environments (arXiv 2010.03768): https://arxiv.org/abs/2010.03768
- Yao et al., WebShop: Towards Scalable Real-World Web Interaction (arXiv 2207.01206): https://arxiv.org/abs/2207.01206
- Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (arXiv 2310.06770): https://arxiv.org/abs/2310.06770
Related: Harness architecture · Self-improving harnesses · Automated harness optimization · Skill optimization · Tools and function calling · Context and memory · Evaluating agents · Loop engineering · GRPO
-
HarnessX Sections 3.1-3.3: harness = (model config, harness config); processors with the five-outcome async protocol on eight hooks (Table 1 whitelists per-hook modifications; run-loop contract validation raises on violation);
_singleton_group,_order(PRE/NORMAL/POST),_aftermetadata; slots (tool registry, tracer, workspace, sandbox provider, plugins); nine behavioral dimensions D1-D9, with context assembly (D2) and tool ecosystem (D4) the most frequent evolution targets, observability (D8) the trace substrate, and the training bridge (D9) feeding co-evolution. ↩↩↩↩ -
Section 4.1-4.2: MDP over symbolic artifacts with states (harness, trace store), open-ended code-level edit actions, verifier-score rewards, and a deterministic acceptance operator enforcing the seesaw constraint; the mirror predicts reward hacking, catastrophic forgetting, and under-exploration, each mapped to a defense (Critic, gate, Planner). ↩
-
Sections 4.3-4.4 and Appendix 9.4: one meta-agent (Claude Opus 4.6) drives Digester, Planner, Evolver, Critic with selective invocation; Digester compresses ~10M GAIA trace tokens per round to ~10K of summaries; Evolver proposes up to K=4 candidates per round with manifests and smoke tests; Critic allows one revision cycle; gate order: manifest completeness, configuration normalization, build/smoke tests, seesaw regression check; 15 rounds max, patience 3, ±5% noise threshold, 3 seeds, concurrency 10, pass@2; meta-agent budgets 100M-175M tokens per benchmark; round-0 harness is a handcrafted competent baseline. ↩↩↩
-
Appendix 10.3, Table 9: manifest fields candidate_id, bucket, capability_evidence, file_changes, predicted_impact, attribution_signature; the Critic checks next-round trace features against the manifest's predicted mechanism and impact. ↩
-
Table 4 (pass@2, peak accuracy, adaptation set): ALFWorld 83.6 to 94.8 (Sonnet 4.6), 76.9 to 97.8 (GPT-5.4), 53.0 to 97.0 (Qwen3.5-9B); WebShop 60.0 to 76.0, 55.0 to 73.0, 36.0 to 49.0; GAIA 73.8 to 83.5, 73.8 flat, 20.3 to 37.4; SWE-bench Verified 76.4 to 87.3, 45.5 to 63.6, 23.6 to 41.8; tau3-Bench averages +5.4, +14.5, +1.1 (93.5 near-ceiling baseline); average +14.5, 14/15 improved; per-domain: GPT-5.4 Telecom 67.5 to 93.0 (+25.4) at R2. ↩↩
-
Sections 4.5, 6.3, 7.1: Global strategy on GAIA GPT-5.4 peaks 73.8 (R4), finishes 49.5 (-24.3 vs ±8.5 CI), 143.7M tokens; Ensemble variant isolation finishes 87.4 = peak (R14), 107.8M tokens, lifting the configuration from 0.0 to +13.6; split edits fork a new variant (pool capped, weakest retired), seesaw scoped per variant. Meta-agent ablation (Table 6): single-session CC SDK evolver reaches 86.4 vs AEGIS 87.4 (within one SE) but spends ~14% more tokens, attributed to the missing Digester compression. ↩↩↩
-
Section 5 and 6.5: shared FIFO replay buffer (4-round window; 824 traces on GAIA, 400 on WebShop); cross-harness GRPO groups trajectories by task id across harness versions, group-relative advantage, clipped objective, KL anchor to the fixed initial reference (coefficient 0 in the reported runs), behavior log-probs cached at insertion; lr 1e-6, clip 0.2, 5 steps per round; Qwen3.5-9B peaks GAIA 37.4 to 41.7 (+4.3) and WebShop 49.0 to 54.0 (+5.0), avg +4.7, with the gap persisting at final rounds; no rollouts are generated solely for training. ↩↩
-
Section 6.6: GAIA/Sonnet R10 reward hacking (74.8 to 79.6 with partial format exploitation; guard shipped R12); tau3-Bench Telecom R7 forgetting (five same-type reminder edits R2-R6, then -14.0 in one round, self-corrected by R9); ALFWorld R4-R7 under-exploration (ship-prediction accuracy 80% to 0%; the lone structural edit calibrated at 1/7 predicted flips). ↩↩↩↩
-
Section 7.5, Table 7: GAIA Global 143.7M tokens for 0.0 gain vs variant-isolation 107.8M for +13.6; ALFWorld Sonnet Global 43.4M for +11.2 in 7 rounds; evolved-harness per-task inference tokens fall ~25% on GAIA and rise ~60% on ALFWorld; GAIA evolution amortizes in ~1,300 invocations (~83K tokens saved per invocation). ↩