Context and memory¶
Scope: deciding what an agent sees on each model call and what it remembers across calls. The context window is a working set, not storage, and managing it well is the difference between a coherent agent and one that forgets, repeats, or burns tokens. This page covers context engineering, separating stored history from the presented prompt, reducing context without losing the thread, and short- and long-term memory. It is the foundations side of the harness context responsibility and feeds the agent loop.
Code here is illustrative reference; validate against your own stack before relying on it.
flowchart TB
EVENTS["Stored history (immutable events)"] --> PROJ["Project per call"]
PROJ --> MEASURE["Measure tokens"]
MEASURE -->|"under budget"| SEND["Send (cache-friendly)"]
MEASURE -->|"over budget"| COMPACT["Compaction (deterministic: stub re-fetchable results)"]
COMPACT -->|"still over"| SUMM["Summarization (lossy, last resort)"]
COMPACT --> SEND
SUMM --> SEND
LTM["Long-term memory (vector store)"] -->|"retrieve relevant"| PROJ
Overview¶
A model can only be as smart as the context it is given, so the discipline of an agent is largely context engineering: providing the information the model needs at the right time and in the right form.1 Bigger context is not better. Models attend least to the middle of a long prompt (the lost-in-the-middle effect), and quality degrades as the window fills (often called context rot), so the optimal working context is much smaller than the maximum window.2 The job is to keep the working set small, relevant, and ordered.
Core knowledge¶
Five moves¶
Context engineering is a small set of operations on the working set: generate the right prompt, retrieve external knowledge into it, write information out to durable storage, reduce what is present when it grows, and isolate sub-tasks into their own context. Every technique below is one of these moves applied at a particular point in the loop.1
Separate storage from presentation¶
Keep the run's history as an immutable, append-only record of events, and treat the prompt sent to the model as a re-computable projection of that record, optimised per call. Editing the projection (dropping, reordering, compacting) never corrupts the ground truth, so you can experiment with what to present without losing what happened. This separation is where context engineering happens.3 It also keeps the observability trace faithful, because the raw events survive whatever the projection does.
Reduce hierarchically, cheapest first¶
When the projection exceeds the budget, reduce in order of increasing loss:4
- Measure the token count; if it is under threshold, do nothing, which also preserves the stable prefix the KV cache depends on.
- Compact deterministically: replace content the agent can fetch again (an already-read file, prior search results, large tool arguments) with a short reference stub, so nothing is truly lost.
- Summarize semantically only as a last resort: an LLM rewrites a span into a shorter form. This is lossy, so it is the final step, not the first.
Doing this in order keeps cost down and avoids discarding the exact tokens a later step or evaluation needs. Vector-search compression of a large tool result can be dramatic; replacing a full web page with the few relevant chunks has cut a result from tens of thousands of tokens to a few hundred.4
Sessions: memory across calls¶
A session persists the event history across separate runs keyed by a session id, which is what makes an agent multi-turn rather than amnesiac. The same mechanism enables asynchronous human-in-the-loop: the agent can pause on a sensitive tool, return the pending call, and resume later when a human approves, because the session holds the state in between.5
Long-term memory¶
Beyond a single session, durable memory is a write-and-retrieve loop: the agent extracts a structured memory from its history, checks it against existing memories to avoid duplicates, and stores it in a vector store; later it retrieves relevant memories either through an explicit recall tool or by automatic injection before each call. This is the operating-system view of memory, where a small fast context is backed by a larger store the agent pages in and out.6 Retrieval-augmented generation is the same idea applied to external knowledge: fetch the relevant passages into the context rather than hoping they are in the weights.7
Don't-miss checklist¶
- Treat the context window as a working set; keep it small and relevant, not maximal.
- Store history immutably; send a per-call projection you can compact freely.
- Reduce in order: measure, then compact deterministically, then summarize only if forced.
- Preserve a stable prefix so the KV cache stays warm (harness architecture).
- Use sessions for multi-turn state and human-in-the-loop pause and resume.
- Page durable knowledge in and out of a vector store rather than stuffing it into the prompt.
Failure modes¶
- Window stuffing. The prompt is maximised, not curated; the model loses the middle and degrades.
- Mutating the record. Compaction edits the ground truth, so the trace and any replay are corrupted.
- Summarize-first. Lossy summarization is applied before deterministic compaction, discarding recoverable detail.
- Cache-hostile edits. Rewriting the prefix each turn evicts the KV cache and multiplies cost.
- No long-term memory. The agent re-derives the same facts every session.
Open questions & validation¶
- A direct measure of context quality is missing; validate reduction against end-task success and trace fidelity.
- Retrieval quality bounds memory usefulness; measure recall of the store on real queries.
- When automatic memory injection helps versus distracts is task-dependent; A/B it.
References¶
- Liu et al., Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/abs/2307.03172
- Packer et al., MemGPT: Towards LLMs as Operating Systems: https://arxiv.org/abs/2310.08560
- Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: https://arxiv.org/abs/2005.11401
- Manus, Context engineering for AI agents: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
- Song & Hur, Build an AI Agent (From Scratch), Manning Publications (MEAP), 2026.
Related: Harness architecture · The agent loop · Tools & function calling · Planning & reasoning · Agent observability · Agentic systems
-
Song & Hur (Ch 1): an LLM can only be as smart as the context it is given; context engineering provides the right information at the right time in the right form, via five moves (generation, retrieval, write, reduce, isolate). ↩↩
-
Liu et al. show models attend least to the middle of long contexts; combined with context rot, the optimal working set is far smaller than the maximum window. ↩
-
Song & Hur (Ch 6): keep events as immutable ground truth and the prompt as a re-computable projection; this separation is where context engineering happens. ↩
-
Song & Hur (Ch 6): reduce hierarchically, measuring tokens (preserve the cache prefix), compacting deterministically by stubbing re-fetchable content, then summarizing semantically as a lossy last resort. ↩↩
-
Song & Hur, Build an AI Agent (From Scratch), Manning MEAP, 2026. Sessions persist events across runs and enable asynchronous human-in-the-loop; long-term memory extracts, deduplicates, stores, and retrieves structured memories from a vector store. ↩
-
Packer et al., MemGPT: an operating-system view of memory where a small context is backed by a larger store paged in and out. ↩
-
Lewis et al., Retrieval-Augmented Generation: retrieve relevant passages into the context instead of relying on parametric memory. ↩