Markdown

Agent harness architecture¶

Scope: the runtime that wraps a foundation model and turns its token output into reliable multi-step action. The harness owns context management, tool selection, error recovery, state, and external memory; the model only proposes. Past a model-capability floor, this layer drives more of an agent's reliability than the weights do. The foundations it drives are the agent loop, tools, context and memory, and planning; the control it needs is the orchestration control plane.

Code and configs here are reference templates; pin versions and validate before relying on them.

flowchart TB
  MODEL["Foundation model<br/>(proposes: text + tool calls)"]
  subgraph HARNESS["Harness (decides and executes)"]
    CTX["Context manager<br/>(budget, compaction, recitation)"]
    REG["Tool registry + dispatch"]
    ERR["Error recovery<br/>(retries, replan)"]
    STATE["Durable state<br/>(filesystem / store)"]
    MEM["External memory"]
  end
  MODEL -->|"tool call"| REG
  REG -->|"observation"| CTX
  CTX -->|"assembled prompt"| MODEL
  REG --> ERR
  ERR -->|"replan signal"| MODEL
  REG --> STATE
  STATE --> MEM
  MEM --> CTX

Overview¶

An agent is a model plus the program that runs it in a loop. The model is a stateless next-token predictor; everything that makes it behave like an agent (remembering earlier steps, calling a tool and reading the result, recovering from a failed action, staying inside a budget) lives in the surrounding program. That program is the harness.

The load-bearing claim of harness engineering: once the underlying model clears a capability floor for a task, the harness explains more of the performance variance than the model choice. The same model swung between failure and success by its harness is the common experience of teams shipping long-horizon, tool-using agents; the qualification is that below the floor no harness rescues weak reasoning, and on pure single-shot reasoning the harness barely matters.¹ The implication for the agent loop is that engineering effort moves from prompt wording to the loop, the tools, and the state around it. Increasingly the harness is written and optimized as code rather than prose, a shift surveyed under the banner of code as the agent harness and pushed furthest by designs where the program, not the model, owns control flow (self-improving harnesses).⁵

Core knowledge¶

What a harness controls¶

A harness is defined by five responsibilities. Get these right and a mid-tier model runs reliably; get them wrong and a frontier model still loops, forgets, or burns the budget.¹

Context management. What goes into each model call, in what order, and what gets evicted when the window fills. The dominant cost and the dominant failure source.
Tool selection. Which tools exist, how they are described, and how the harness constrains which can be called when.
Error recovery. What happens when a tool call fails, times out, or returns garbage: retry, replan, or abort.
State management. How progress survives across turns and restarts so the agent does not redo or forget work.
External memory. Durable storage (files, a store) the agent reads and writes beyond the context window.

Convergent designs¶

Production harnesses built independently have converged on the same moves, which is the strongest evidence the responsibilities above are real rather than stylistic.

Minimal, sharp tool sets. A small set of general tools (read, write, edit, run a command, search) plus an extension mechanism beats a large catalogue of narrow ones. Collapsing an over-broad tool surface to a couple of primitives has repeatedly raised task accuracy while cutting tokens and latency, because every extra tool is another decision the model can get wrong.¹
The filesystem as state and memory. Treating a working directory as the agent's externalised memory, writing intermediate artifacts to files and keeping restoration paths (URLs, file paths) in context rather than full contents, gives unbounded, inspectable state that survives compaction.²
Context as the optimisation target. Stable prompt prefixes and append-only context keep the KV cache warm, which is the difference between cheap and expensive turns at the high input:output ratios agents run; compaction preserves pointers, and summarization is the last resort, not the first.²
A plan kept in view. Re-stating the task and the current to-do list at the end of the context counters the lost-in-the-middle effect, where models attend least to the middle of a long prompt.³

Constraining the model with tool modes¶

The harness does not have to accept whatever the model emits. Decoding can be masked so that, per turn, tool calls are auto (model may call any tool or none), required (must call some tool), or specified (must call a named subset). Constraining the action space at the decoder removes a class of malformed or out-of-policy calls before they happen, which is cheaper and more reliable than catching them after.² This is the foundations-side complement to the policy gate in the control plane.

Harness components are not free¶

Adding a harness component can hurt. The governing rule from harness-ablation work: a component helps when it introduces a new, independent signal and regresses when it merely recycles the doer model's own signal. A verifier or candidate-selector that is the same model grading itself inherits the doer's blind spots and adds cost without lift; file-backed state and self-improvement loops that bring in outside information tend to help.⁴ The corollary is an audit discipline: ablate each component with the substrate held fixed (pinned model snapshot, pinned sandbox image, content-hashed eval set, seeded RNG, isolated egress), because otherwise a moving substrate masks which change actually moved the metric.⁴

Don't-miss checklist¶

Budget context deliberately: stable prefix, append-only history, compaction that keeps pointers, summarization last.
Keep the tool set small and general; justify every additional tool against the decision cost it adds.
Externalise state to the filesystem or a store; never rely on the context window as durable memory.
Make every tool failure a structured, replannable signal back to the model, not a silent drop.
Ablate harness components against a fixed substrate; delete any that recycle the model's own signal.
Keep a re-stated plan or to-do list in view to fight lost-in-the-middle.

Failure modes¶

Context rot. The window fills with stale tool output; the model loses the thread and repeats or contradicts earlier steps. Compaction and recitation are the fix.
Tool sprawl. Too many narrow tools; the model picks the wrong one or stalls choosing. Symptoms are low accuracy with high token use.
Cache-hostile prompts. Re-ordering or rewriting the prefix each turn evicts the KV cache and multiplies cost at agentic input:output ratios.
Silent error swallowing. A failed tool call returns nothing useful; the model proceeds on a false premise. Surface the error as an observation.
Same-model verification. A self-grading verifier recycles the doer's mistakes, adding latency and cost without catching anything.
State in the prompt. Progress kept only in context is lost on compaction or restart; the agent redoes or abandons work.

Open questions & validation¶

Where the capability floor sits for a given task class, below which harness work does not pay off.
Which harness components survive a fixed-substrate ablation on the target workload, since the answer is task-dependent.
How to measure context quality directly rather than inferring it from end-task success.

References¶

Anthropic, Building effective agents: https://www.anthropic.com/research/building-effective-agents
Manus, Context engineering for AI agents: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
Liu et al., Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/abs/2307.03172
Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629
SWE-bench (long-horizon coding agent benchmark): https://arxiv.org/abs/2310.06770
OSWorld (computer-use agent benchmark): https://arxiv.org/abs/2404.07972
NLAH (harness-component ablation against a fixed substrate): https://arxiv.org/abs/2603.25723
APEX-Agents (long-horizon agent benchmark): https://arxiv.org/abs/2601.14242
Model Context Protocol: https://modelcontextprotocol.io/
Code as Agent Harness (survey of code-as-harness): https://arxiv.org/abs/2605.18747

The harness owns context management, tool selection, error recovery, state, and external memory; once a model clears a per-task capability floor, this layer explains more reliability variance than the model choice, most strongly for long-horizon tool-using tasks and least for single-shot reasoning. ↩↩↩
Manus, Context engineering: filesystem as externalised memory, stable KV-cache-friendly prefixes with append-only context, compaction that preserves restoration paths before summarization, and logit-masked tool modes (auto / required / specified). ↩↩↩
Liu et al. show models attend least to the middle of a long context, motivating end-of-context recitation of the task and plan. ↩
Harness-ablation finding: components that add an independent signal help; components that recycle the doer model's own signal (e.g. same-model verifiers and selectors) regress. Valid ablation requires a fixed substrate: pinned model snapshot and sandbox image, content-hashed eval set, seeded RNG, isolated egress. ↩↩
Code as Agent Harness (arXiv 2605.18747), survey: the harness is increasingly code, organized into the harness interface (to reasoning and environment), harness mechanisms (planning and adaptive control), and multi-agent scaling over shared code artifacts. The optimization of that code is covered in self-improving harnesses. ↩