Markdown

Agentic loop economics¶

Scope: the cost and latency of running a model in a loop, dominated by whether the large, mostly-static prefix the agent loop re-sends every turn hits the model's KV cache. This page covers prefill-versus-decode economics, the prompt-stability hierarchy that keeps a cache warm, and the harness and context choices that quietly destroy it.

Code and configs here are reference templates; pin versions and validate before relying on them.

flowchart LR
  TURN["Each turn re-sends the prefix<br/>(system + tools + history)"] --> CHECK{"Prefix byte-identical<br/>to a cached one?"}
  CHECK -->|"hit"| SKIP["Reuse KV cache, skip prefill<br/>(cheap, low TTFT)"]
  CHECK -->|"miss"| RECOMPUTE["Recompute prefill: O(N^2)<br/>(expensive, high TTFT)"]
  SKIP --> DECODE["Decode new tokens<br/>(memory-bandwidth bound)"]
  RECOMPUTE --> DECODE

Overview¶

An agent is a loop, and a loop re-sends almost the same prompt over and over: the system prompt, the tool definitions, and a growing history, with only a few new tokens appended each turn. The single largest lever on what that loop costs and how fast it responds is whether that repeated prefix is recomputed or reused. Treating the prefix as a cache key, rather than as text the model happens to re-read, is the difference between a cheap agent and an expensive one.

Core knowledge¶

Prefill versus decode¶

A model call has two phases with different cost shapes. Prefill processes the input prompt and is compute-bound; its work grows roughly with the square of the prompt length (attention is quadratic), and it sets the time-to-first-token. Decode generates the output one token at a time and is memory-bandwidth-bound; it sets the tokens-per-second. For an agent with a short answer and a huge prompt, prefill dominates both the bill and the latency, which is why the prompt, not the completion, is the thing to optimise.

The KV cache and byte-identical reuse¶

The key and value vectors a model computes during prefill depend only on the tokens, their positions, and the frozen weights. They do not depend on anything later in the sequence. So if two requests share an identical leading span, the K/V for that span is identical and can be reused: a cache hit skips prefill for the shared prefix entirely. The catch is that the match must be byte-for-byte from position zero. A single changed character near the start (a new timestamp, a reordered tool, a trimmed space) shifts every downstream position and busts the cache from that point on.

Why the loop makes caching the dominant lever¶

Because an agent re-sends a large, near-static prefix on every turn, prefix caching is the dominant cost and latency control for agentic workloads specifically. The evaluation in "Don't Break the Cache" measures this on real agent runs: across providers, over 500 long-horizon agent sessions with 10K-token system prompts, prompt caching cut API cost by 41 to 80 percent and time-to-first-token by 13 to 31 percent.¹ The arithmetic shows why it compounds: a 10K-token header re-sent over roughly 40 turns across roughly 50K sessions a day is about 20 billion input tokens a day; caching that header can cut the input bill by close to an order of magnitude (illustrative arithmetic, not a vendor quote).

The prompt-stability hierarchy¶

Order the prompt so the stable parts come first and the volatile parts come last, then never rewrite the stable parts. A useful four-layer hierarchy, most stable first:

L1: system prompt and tool definitions. Changes rarely; should be byte-identical across a whole deployment.
L2: project or task context. Stable within a task.
L3: session. Stable within a conversation.
L4: per-turn messages. The only part that should change each turn.

Keep L1 to L3 byte-identical and append new content at the end, so each turn extends a cached prefix instead of invalidating it.

Engine mechanisms¶

Inference engines implement prefix reuse so callers get it without managing cache keys by hand. vLLM Automatic Prefix Caching hashes the prompt at block granularity (with PagedAttention managing the blocks) and reuses any matching leading blocks. SGLang RadixAttention stores shared prefixes once in a radix tree, which also helps branching or forked agent conversations that share a common head. Separately, DeepSeek Multi-head Latent Attention compresses the KV cache into a smaller latent representation,² which lowers the memory cost of holding cached prefixes; treat it as one approach to KV size, not a universal speedup.

Vendor caching contracts¶

The hosted APIs (Anthropic, OpenAI, Google, DeepSeek) all bill cached prefix reads at a steep discount relative to a fresh read, and several reduce latency on a hit as well. The exact discounts, minimum cacheable lengths, and cache lifetimes differ by provider and change often; as of 2026, verify the current terms on the provider's pricing page rather than hard-coding a percentage, and design for the general shape (cheap cached reads, a small write premium) rather than a specific number.

Don't-miss checklist¶

Put the most stable content (system prompt, tool definitions) first; append volatile content last.
Keep the stable layers byte-identical across turns; a one-character change busts the cache.
Prefer truncation or appending over summarization mid-session, since summarizing rewrites the prefix.
Do not switch models mid-session on a path you want kept warm; the new model starts cold.
Treat tool-set changes, including an MCP server reconnect, as cache-busting; stabilise tool definitions.
Measure cache hit rate and time-to-first-token, not just token counts.

Failure modes¶

Tool-set mutation mid-session. Adding or removing a tool rewrites the definitions block, so every later turn pays a cold prefill.
Mid-session model switch. The new model has no cache for the old prefix; the next turn recomputes from scratch.
Summarize-instead-of-append. Replacing history with a summary rewrites the prefix and discards the cache; truncation that preserves the leading span does not.
MCP crash-reconnect. A reconnect that re-emits tool definitions in a different order or form changes the prefix and loses the cache.
Hidden non-determinism. A timestamp, a reordered JSON key, or trailing whitespace in the prefix causes silent misses that look like the cache simply not working.

Open questions & validation¶

Cache hit rate is the metric that predicts cost; measure it on real sessions rather than inferring it from token counts.
Provider caching terms move quickly; re-check pricing, minimum lengths, and lifetimes before relying on any figure.
Cross-session and on-disk caching trade off storage against reuse; validate them against the actual prefix-reuse pattern of the workload.

References¶

Lumer et al., Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks: https://arxiv.org/abs/2601.06007
DeepSeek-V2 (Multi-head Latent Attention, KV-cache compression): https://arxiv.org/abs/2405.04434
vLLM Automatic Prefix Caching: https://docs.vllm.ai/en/latest/
Anthropic prompt caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
OpenAI prompt caching: https://platform.openai.com/docs/guides/prompt-caching

Lumer et al., Don't Break the Cache, report that prompt caching cut API cost 41 to 80 percent and time-to-first-token 13 to 31 percent across providers over more than 500 long-horizon agent sessions with 10K-token system prompts. ↩
DeepSeek Multi-head Latent Attention compresses the per-token KV cache into a latent vector, reducing the memory needed to hold cached context; the cache-size reduction is the claim, not a fixed end-to-end speedup. ↩