Skip to content
Markdown

Prompt caching: the provider API contract

Scope: prompt caching as a billing and latency contract on hosted LLM APIs, covering Anthropic cache_control breakpoints and TTLs, OpenAI automatic caching and prompt_cache_key, Gemini implicit caching, the write-premium versus 0.1x-read economics, the invalidation hierarchy, and the harness discipline that keeps a cache warm. The engine mechanism underneath (PagedAttention block reuse, RadixAttention) is covered in KV cache management; why caching dominates agent cost is argued in agentic loop economics. This page is the operational contract you code against.

Code blocks below come in two kinds. The Python API snippets are reference templates on the real vendor SDKs (not executed here: no API key, and the calls bill money); pin SDK versions and validate against the provider's current docs before rollout. The plain-python blocks are runnable, self-checking validations of the core math and mechanism the page teaches (break-even arithmetic, TTL-refresh accounting, prefix hashing with tiered invalidation); each runs on a stock python3 and asserts its result, including adversarial cases (a cache write that never pays for itself, a prefix below the model minimum, a timestamp in the system prompt, non-deterministic JSON, the 20-block lookback wall).

What it is

Prompt caching lets a hosted API reuse the attention KV state of a request's leading span instead of recomputing prefill for it. The provider hashes the exact rendered bytes of the prompt prefix; a later request whose prefix matches byte-for-byte skips prefill for that span, bills it at a deep discount, and cuts time-to-first-token. It is the managed-API surface of the same prefix reuse that vLLM and SGLang implement engine-side (KV cache management), with two additions only a vendor can impose: a price schedule and a lifetime.

The three major providers expose it differently:

  • Anthropic caches explicitly. You mark up to 4 breakpoints per request with cache_control: {"type": "ephemeral"} on content blocks; the prefix up to each breakpoint is cached. Default TTL is 5 minutes, refreshed at no cost every time the entry is used; "ttl": "1h" buys a 1-hour lifetime at a higher write price. The prompt renders in a fixed order, tools then system then messages, and that order defines what a prefix is.1
  • OpenAI caches automatically for prompts of 1,024 tokens or longer, with no code change and no extra fee. Cached prefixes persist through roughly 5 to 10 minutes of inactivity (up to an hour); some models offer extended retention up to 24 hours. The optional prompt_cache_key parameter steers requests that share a prefix onto the same cache shard.2
  • Gemini enables implicit caching by default on Gemini 2.5 and newer models, with per-model minimum prefix sizes (2,048 tokens on the 2.5 generation, 4,096 on newer ones) and automatic pass-through of the savings; a separate explicit-caching API (cachedContents, billed as storage per token-hour) exists on the standard API but is not carried into every newer surface, so verify support before designing around it.3

In every case the contract is the same shape: a discounted read price for the matched prefix, a bounded lifetime, and an exact-bytes match requirement that makes prompt construction, not model choice, the thing you engineer.

Why use it

An agent loop re-sends a large, near-static prefix (system prompt, tool definitions, growing history) on every turn, so prefill of repeated tokens dominates both the bill and the latency. Measured across providers on 500+ long-horizon agent sessions with 10K-token system prompts, prompt caching cut API cost by 41 to 80 percent and time-to-first-token by 13 to 31 percent.4 The mechanism-level argument is in agentic loop economics; what this page adds is the exact arithmetic of the price schedule.

On Anthropic's schedule a cache read bills 0.1x the base input price, a 5-minute-TTL write bills 1.25x, and a 1-hour-TTL write bills 2x (for Claude Opus 4.8 at $5/MTok base: $0.50, $6.25, and $10 per MTok respectively).1 Three consequences fall straight out of those multipliers, and the block below asserts all of them:

  • Break-even is fast but not free. With the 5-minute TTL the second request already wins (1.25 + 0.1 = 1.35x versus 2x uncached); with the 1-hour TTL the second request still loses (2.1x versus 2x) and the third is the first win.
  • A write that is never read back is a pure loss, 1.25x the price of just not caching. Caching near-unique prompts costs money.
  • TTL choice is a traffic-shape decision. Requests 7 minutes apart miss a 5-minute TTL every time (every request pays the write premium: caching loses), while the 1-hour TTL takes one write and then reads. The refresh-on-use rule means any inter-arrival gap under the TTL keeps the entry alive indefinitely.
# Runnable on system python3 (stdlib). Core algorithm: prompt-cache billing. A cache
# read bills 0.1x the base input price, a cache write bills 1.25x (5-minute TTL) or
# 2x (1-hour TTL), and the TTL refreshes free on every use. The block reproduces the
# vendor break-even points, shows when each TTL wins, and asserts the two silent
# money-losers: a write that is never read back, and a prefix below the model minimum.

READ, WRITE_5M, WRITE_1H = 0.10, 1.25, 2.00
TTL_S = {"5m": 5 * 60, "1h": 60 * 60}


def loop_cost(times_s: list[float], prefix_tokens: int, ttl: str,
              min_cacheable: int = 1024) -> tuple[float, int]:
    """Billed cost (units of base-price x tokens) of re-sending one cached prefix at
    the given request times, with refresh-on-use TTL. Returns (cost, cache_writes)."""
    assert prefix_tokens > 0 and ttl in TTL_S, "positive prefix and a known TTL required"
    write = WRITE_5M if ttl == "5m" else WRITE_1H
    cost, writes, expires = 0.0, 0, float("-inf")
    for t in sorted(times_s):
        if prefix_tokens < min_cacheable:      # below the model minimum: silently uncached
            cost += float(prefix_tokens)
            continue
        if t < expires:
            cost += READ * prefix_tokens       # hit: 0.1x, and the TTL refreshes free
        else:
            cost += write * prefix_tokens      # miss: write premium at the breakpoint
            writes += 1
        expires = t + TTL_S[ttl]
    return cost, writes


P = 10_000                                      # a 10k-token agent header

# 1. The published dollar figures. Claude Opus 4.8 base input is $5/MTok, so the docs'
#    table ($6.25 write-5m, $10 write-1h, $0.50 read per MTok) is exactly the multipliers.
assert (WRITE_5M * 5, WRITE_1H * 5, READ * 5) == (6.25, 10.0, 0.5)

# 2. Break-even. 5-minute TTL: request 2 already wins (1.25 + 0.1 = 1.35 < 2 uncached).
#    1-hour TTL: request 2 still loses (2.1 > 2); request 3 is the first win (2.2 < 3).
two_5m, _ = loop_cost([0, 60], P, "5m")
assert two_5m == 1.35 * P and two_5m < 2 * P
two_1h, _ = loop_cost([0, 60], P, "1h")
three_1h, _ = loop_cost([0, 60, 120], P, "1h")
assert two_1h == 2.1 * P and two_1h > 2 * P
assert three_1h == 2.2 * P and three_1h < 3 * P

# 3. Adversarial: a write with zero later reads costs 1.25x the uncached price. Caching
#    a prefix that never repeats is a pure loss, not a no-op.
one, writes = loop_cost([0], P, "5m")
assert one == 1.25 * P > P and writes == 1

# 4. The agent-loop regime the feature exists for: 40 turns, 30 s apart. Every turn
#    lands inside the refreshed 5-minute TTL, so one write serves 39 reads and the
#    prefix bill drops ~7.8x versus re-sending it uncached each turn.
session, writes = loop_cost([i * 30 for i in range(40)], P, "5m")
assert writes == 1 and session == (1.25 + 39 * 0.10) * P
assert 40 * P / session > 7.7

# 5. TTL choice is a traffic-shape decision. Requests 7 minutes apart miss a 5-minute
#    TTL every time (five full-price writes: caching LOSES money), while a 1-hour TTL
#    takes one write and four reads and beats not caching at all.
gap7 = [i * 7 * 60 for i in range(5)]
cost_5m, w5 = loop_cost(gap7, P, "5m")
cost_1h, w1 = loop_cost(gap7, P, "1h")
assert w5 == 5 and cost_5m == 6.25 * P > 5 * P    # 5m TTL: worse than uncached
assert w1 == 1 and cost_1h == 2.4 * P < 5 * P     # 1h TTL: 2x+ cheaper than uncached
assert cost_1h < cost_5m

# 6. Adversarial: an 800-token prefix is under the (model-dependent) minimum, so the
#    marker is silently ignored: zero writes, and the bill equals plain uncached input.
small, writes = loop_cost([0, 60, 120], 800, "5m")
assert writes == 0 and small == 3 * 800

# 7. Adversarial: out-of-order timestamps must not corrupt the TTL accounting (the
#    function sorts), and a non-positive prefix must raise, not bill zero.
shuffled, w = loop_cost([120, 0, 60], P, "5m")
assert (shuffled, w) == loop_cost([0, 60, 120], P, "5m")
raised = False
try:
    loop_cost([0], 0, "5m")
except AssertionError:
    raised = True
assert raised, "zero-token prefix must raise"

print("econ OK:", f"session={session/P:.2f}x vs 40x uncached;",
      f"7min-gap: 5m={cost_5m/P:.2f}x 1h={cost_1h/P:.2f}x uncached=5x")

When to use it (and when not)

  • Cache whenever a prefix of at least the model minimum repeats within the TTL: agent loops, multi-turn chat, RAG over a shared document set, few-shot templates, batch jobs asking many questions of one context. On OpenAI and Gemini this is automatic; on Anthropic it is one marker on the last stable block.
  • Do not cache near-unique prompts. Every request with a distinct head pays the 1.25x write premium and reads nothing back. Leave cache_control off and spend nothing.
  • Mind the minimum. Prompts below the model's minimum cacheable length are processed uncached with no error; the only signal is cache_creation_input_tokens: 0. The minimum is model-dependent (between 512 and 4,096 tokens across current Claude models as of July 2026, 1,024 on OpenAI, 2,048 to 4,096 on Gemini); check the provider's current table rather than hard-coding a number.123
  • Pick the TTL from the inter-arrival gap. Default 5-minute TTL when turns arrive faster than every 5 minutes (refresh-on-use keeps one write alive for a whole session). Pay the 2x 1-hour write only when real gaps exceed the short TTL, as in bursty ticket queues or slow human-in-the-loop review, and only when at least two later requests will read it.
  • Self-hosted serving does not need this page's price schedule. On your own vLLM or SGLang cluster prefix reuse is an engine feature with no billing contract; you optimize hit rate and TTFT instead (KV cache management), and the prompt-structure discipline below applies unchanged.

Architecture

The provider renders the request into one token stream in a fixed order (tools, then system, then messages), hashes the bytes up to each breakpoint, and looks the hash up in a workspace-scoped cache. Caches are never shared across organizations, and a cache entry only becomes usable after the response that writes it begins.1

Render order, breakpoints, and the billing split

flowchart TD
  REQ["Request"] --> RENDER["Render in fixed order:<br/>tools, then system, then messages"]
  RENDER --> BP["Breakpoint: cache_control<br/>(max 4 per request)"]
  BP --> LOOKUP{"Exact prefix bytes<br/>cached and inside TTL?"}
  LOOKUP -->|"hit"| HIT["Cached span bills 0.1x<br/>TTL refreshes free"]
  LOOKUP -->|"miss"| WRITE["Write at breakpoint:<br/>1.25x (5m) or 2x (1h)"]
  HIT --> SUFFIX["Suffix after the breakpoint<br/>bills at 1x"]
  WRITE --> SUFFIX

Invalidation is hierarchical, not all-or-nothing. A change invalidates its own tier and everything rendered after it, so the damage depends on where the change sits:1

Change Tools cache System cache Messages cache
Tool definitions (add, remove, reorder) or model switch lost lost lost
System prompt content, web-search or citations toggle kept lost lost
tool_choice, images, thinking toggles, message content kept kept lost

Two further mechanics matter in agent loops. First, the exact-bytes rule means invalidators are usually silent and self-inflicted: a timestamp interpolated into the system prompt, JSON serialized without a stable key order, a trailing space. Second, a breakpoint walks back at most 20 content blocks to find the previous cache entry, so a single turn that appends more than 20 blocks (common with parallel tool calls) strands the prior entry even though the prefix is untouched; an intermediate breakpoint inside the long turn restores the chain.1 The block below makes all of this concrete and asserts each case.

# Runnable on system python3 (stdlib). Core mechanism: the cache key is a hash of the
# exact rendered bytes in the fixed render order tools -> system -> messages, and
# invalidation is hierarchical: a tool change invalidates everything, a system change
# spares the tools tier, a message change spares tools and system. The block also
# proves the two classic silent invalidators (a timestamp, non-deterministic JSON) and
# the 20-block lookback wall in long agentic turns.
import hashlib
import json


def render(tools: list[dict], system: list[str], messages: list[str]) -> list[tuple[str, bytes]]:
    """The rendered request as (tier, bytes) blocks, in the API's fixed render order."""
    blocks = [("tools", json.dumps(t, sort_keys=True).encode()) for t in tools]
    blocks += [("system", s.encode()) for s in system]
    blocks += [("messages", m.encode()) for m in messages]
    return blocks


def prefix_hashes(blocks: list[tuple[str, bytes]]) -> list[bytes]:
    """Cumulative hash after each block: the cache key for the prefix ending there."""
    h, out = hashlib.sha256(), []
    for _tier, data in blocks:
        h.update(data)
        out.append(h.copy().digest())
    return out


def hit_blocks(cache: set[bytes], blocks: list[tuple[str, bytes]]) -> int:
    """Longest cached prefix in blocks; the first divergent byte ends reuse."""
    n = 0
    for hsh in prefix_hashes(blocks):
        if hsh not in cache:
            break
        n += 1
    return n


TOOLS = [{"name": "bash", "input_schema": {"type": "object"}},
         {"name": "edit", "input_schema": {"type": "object"}}]
SYSTEM = ["You are a build agent.", "Project context: repo layout, conventions."]
TURNS = ["user: fix the failing test", "assistant: running pytest", "tool: 1 failed"]

base = render(TOOLS, SYSTEM, TURNS)
cache = set(prefix_hashes(base))                 # request 1 writes at its breakpoint

# 1. An identical request hits every block; total = 2 tools + 2 system + 3 messages.
assert hit_blocks(cache, base) == 7

# 2. Append-only growth: the next turn extends the prefix, so all 7 prior blocks still
#    hit and only the 2 new blocks miss. This is why agent history must be append-only.
grown = render(TOOLS, SYSTEM, TURNS + ["assistant: patching", "tool: 0 failed"])
assert hit_blocks(cache, grown) == 7 and len(grown) == 9

# 3. Hierarchy: a timestamp interpolated into the system prompt kills the system and
#    messages tiers, but the tools tier (rendered first) still hits.
stamped = render(TOOLS, ["You are a build agent. Now: 2026-07-03T10:00Z", SYSTEM[1]], TURNS)
assert hit_blocks(cache, stamped) == 2
assert all(t == "tools" for t, _ in stamped[:2])

# 4. Hierarchy, top tier: reordering the tool list changes byte 0, so nothing hits.
assert hit_blocks(cache, render(list(reversed(TOOLS)), SYSTEM, TURNS)) == 0

# 5. Silent invalidator: dicts serialized without sort_keys leak construction order.
#    The same logical tool built key-in-different-order misses; sort_keys makes the
#    bytes identical again (render() uses sort_keys, so the cache survives).
a = {"name": "bash", "input_schema": {"type": "object"}}
b = {"input_schema": {"type": "object"}, "name": "bash"}
assert json.dumps(a) != json.dumps(b)                          # order-sensitive bytes
assert json.dumps(a, sort_keys=True) == json.dumps(b, sort_keys=True)
assert hit_blocks(cache, render([b, TOOLS[1]], SYSTEM, TURNS)) == 7

# 6. Adversarial: one trailing space on the first system block must not partially hit;
#    that block and everything after it miss (tools still hit).
assert hit_blocks(cache, render(TOOLS, [SYSTEM[0] + " ", SYSTEM[1]], TURNS)) == 2

# 7. The 20-block lookback: a new breakpoint walks back at most 20 positions to find a
#    prior cache entry. A single agentic turn that appends 25 tool_use/tool_result
#    blocks strands the old entry; an intermediate breakpoint inside the turn fixes it.
def lookback_finds_entry(stored: set[bytes], blocks, window: int = 20) -> bool:
    hashes = prefix_hashes(blocks)
    return any(h in stored for h in hashes[-window:])

burst = render(TOOLS, SYSTEM, TURNS + [f"tool: chunk {i}" for i in range(25)])
assert hit_blocks(cache, burst) == 7                            # prefix itself is fine
assert not lookback_finds_entry(cache, burst)                   # but the walk-back fails
mid = render(TOOLS, SYSTEM, TURNS + [f"tool: chunk {i}" for i in range(15)])
cache_mid = cache | {prefix_hashes(mid)[-1]}                    # marker 15 blocks in
assert lookback_finds_entry(cache_mid, burst)                   # back within 20 blocks

print("prefix OK:", "append-only=7/9 hits; timestamp=2; tool-reorder=0;",
      "sorted-json survives; 25-block turn strands the entry, mid-marker recovers")

How to use it

On Anthropic, put the marker on the last stable block, not the last block. A marker on the final system block caches tools and system together (tools render first); in multi-turn use, a marker on the most recent turn's last content block extends the chain each request. The snippet is a reference template on the real SDK; it is not executed here.

# Reference template (needs the anthropic SDK and an API key). Not executed here.
import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    tools=STABLE_TOOL_LIST,                       # deterministic order, frozen per deploy
    system=[{
        "type": "text",
        "text": FROZEN_SYSTEM_PROMPT,             # no timestamps, no per-user interpolation
        "cache_control": {"type": "ephemeral"},   # add "ttl": "1h" only for gappy traffic
    }],
    messages=history + [{"role": "user", "content": user_turn}],
)

# The three usage fields partition the prompt; their sum is the true prompt size.
u = response.usage
total = u.input_tokens + u.cache_creation_input_tokens + u.cache_read_input_tokens
hit_rate = u.cache_read_input_tokens / max(1, total)
  • cache_read_input_tokens is the span served at 0.1x, cache_creation_input_tokens the span written at the premium, input_tokens only the uncached remainder. A long-running agent showing input_tokens: 4000 on a 200k-token prompt is the feature working; always reason about the sum.1
  • On OpenAI there is nothing to mark: keep the shared prefix at least 1,024 tokens, put variable content last, and read usage.prompt_tokens_details.cached_tokens. Set prompt_cache_key per prefix family (per agent, per template) so load balancing does not scatter one prefix across cache shards.2
  • On Gemini, implicit caching applies the discount automatically above the per-model minimum; cached counts appear in usage_metadata.3

How to develop with it

Design the prompt-assembly path around the stability hierarchy from agentic loop economics: most stable first (system prompt, tool definitions), per-task context next, per-turn content last, and nothing volatile ahead of the last breakpoint.

  • Freeze the system prompt per deploy. Inject dates, user names, modes, and feature flags as late message content, never as system-prompt interpolation. Every f-string in the system prompt is a per-request cache key.
  • Serialize tools deterministically. Sort tool lists by name and serialize schemas with a stable key order; assert in CI that two builds of the tool block are byte-identical (case 5 in the block above is exactly this test).
  • Append, never rewrite. History edits, mid-session summarization, and re-rendered tool definitions all rewrite the prefix. When context must shrink, prefer dropping whole trailing spans or use the provider's context-management features, and expect one cold write when you do (context and memory).
  • Forks must copy the parent's prefix verbatim. Sub-agents, summarizers, and evaluators that rebuild system, tools, or model with any difference miss the parent's cache entirely; copy those three unchanged and append the fork-specific content at the end (harness architecture).
  • Long tool-heavy turns need intermediate breakpoints. A turn appending more than 20 content blocks breaks the lookback chain (case 7 above); place a marker roughly every 15 blocks.

How to run it in production

  • Track hit rate as a first-class SLI. Emit cache_read / (cache_read + cache_creation + input) per route; the OpenTelemetry GenAI conventions carry gen_ai.usage.cache_read.input_tokens and gen_ai.usage.cache_creation.input_tokens for exactly this (GenAI observability). A hit rate near zero on prefix-heavy traffic is a rendering bug, not a tuning problem: diff the exact bytes of two consecutive requests.
  • Expect a cold window on every deploy. A prompt version bump or model switch rewrites the prefix, so every active session pays one write. Roll prompt changes with the same care as a schema migration and annotate the hit-rate dashboard with deploys.
  • Serialize the first request of a fan-out. A cache entry becomes readable only after the response writing it begins, so N parallel requests with the same new prefix all pay full price. Send one, wait for the first streamed token, then release the other N-1.1
  • Pre-warm only when it pays. Anthropic accepts max_tokens: 0 as a prefill-only request that writes the cache and returns no output; use it at worker boot or before a scheduled traffic window when first-token latency is user-visible. Continuous traffic keeps itself warm; an interval re-warm under those conditions is a pure extra write. The pre-warm form is rejected with streaming, extended thinking, structured outputs, forced tool choice, and batch requests.1
  • Respect scoping. Caches are isolated per workspace or organization and never shared across organizations, so multi-tenant platforms get no cross-tenant reuse; per-tenant system prompts each warm their own entry.1

How to maintain it

  • Re-verify the numbers on a schedule. Multipliers, minimum cacheable lengths, and retention windows drift per model generation; the per-model minimum table has already changed across 2026 snapshots. Treat any hard-coded constant from this page's economics as a value to re-check against the provider docs at review time.123
  • Gate harness changes on the invalidator audit. Any PR touching prompt assembly, tool registration, or MCP integration should answer: does this change bytes ahead of a breakpoint? A reconnect that re-emits tools in a new order is a cache-busting change even though no prompt text changed.
  • Re-baseline after model migrations. Caches are model-scoped; budget the migration's first-day bill with every session paying one cold write, and confirm the hit rate recovers to its prior level.
  • Keep a rendered-prompt diff tool at hand. The fastest diagnosis for a silent miss is byte-diffing two consecutive rendered requests; hashes in logs (never raw prompts) make this cheap to automate.

Failure modes

  • Markers set, zero reads. cache_read_input_tokens stays 0 across identical-looking requests: a silent invalidator (timestamp, UUID, unsorted JSON, trailing whitespace) is changing bytes ahead of the breakpoint. Diff the rendered bytes; do not tune TTLs.
  • Prefix below the model minimum. No error is returned; the marker is ignored and cache_creation_input_tokens is 0. Pad-free fix: move more stable content ahead of the breakpoint or accept the prompt as uncacheable.
  • Write premium with no reads. Near-unique prompts marked for caching bill 1.25x for nothing (case 3 in the economics block). Remove the marker on routes without prefix reuse.
  • Wrong TTL for the traffic shape. Gappy traffic on the 5-minute TTL pays the write premium on every request and ends up costlier than not caching (case 5). Move to the 1-hour TTL or drop caching on that route.
  • Fan-out storm on a cold prefix. Parallel workers all miss because the entry is not readable until the first response begins; serialize the first request.
  • Summarize-instead-of-append. Mid-session compaction rewrites the prefix and discards the cache; the next turn is a cold prefill of the whole context (agentic loop economics).
  • Tool-set mutation or MCP reconnect. Tools render at position zero, so an added, removed, or reordered tool invalidates all three tiers (case 4); dynamic tool discovery should append definitions, never swap them.
  • Turn longer than the lookback window. More than 20 blocks appended in one turn strand the previous entry (case 7); add intermediate breakpoints.
  • Deploy-shaped hit-rate collapse. A prompt bump or model switch turns every active session cold at once; if the platform is latency-sensitive, stagger the rollout and pre-warm the new prefix.
  • Assuming cross-tenant reuse. Identical prompts in different organizations never share a cache; capacity and cost models that assume one warm entry per prompt family must multiply by tenancy.

References

  • Anthropic, Prompt caching: cache_control, 4-breakpoint limit, 5-minute refresh-on-use and 1-hour TTLs, 1.25x/2x write and 0.1x read pricing, per-model minimum cacheable lengths, invalidation hierarchy, 20-block lookback, concurrency and scoping rules, max_tokens: 0 pre-warming.
  • OpenAI, Prompt caching: automatic caching at 1,024+ tokens, 5-to-10-minute retention up to 1 hour, extended retention up to 24 hours, prompt_cache_key, usage.prompt_tokens_details.cached_tokens.
  • Google, Gemini context caching: implicit caching defaults on Gemini 2.5+, per-model minimum token counts, explicit cachedContents caching and its surface support.
  • Lumer et al., Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks: measured 41 to 80 percent cost and 13 to 31 percent TTFT reductions across providers on 500+ agent sessions.
  • vLLM, Automatic Prefix Caching (design): the engine-side mechanism the provider feature is built on.

Related: Agentic loop economics · KV cache management · Harness architecture · Context & memory · GenAI observability (OpenTelemetry) · LLM inference efficiency · Inference serving · Glossary


  1. Anthropic, "Prompt caching" https://platform.claude.com/docs/en/build-with-claude/prompt-caching. cache_control: {"type": "ephemeral"} with optional "ttl": "1h"; max 4 breakpoints; render order tools, system, messages; writes 1.25x (5m) / 2x (1h) and reads 0.1x of base input (Opus 4.8 at $5/MTok: $6.25, $10, $0.50 per MTok); TTL refreshed free on each use; per-model minimum cacheable length (512 to 4,096 tokens across current Claude models as of July 2026), below which prompts are silently uncached; usage split across input_tokens + cache_creation_input_tokens + cache_read_input_tokens; tiered invalidation (tools > system > messages); 20-block lookback per breakpoint; entries readable only after the writing response begins; workspace/organization scoping, never cross-organization; max_tokens: 0 pre-warm rejected with streaming, extended thinking, structured outputs, forced tool choice, and batches. 

  2. OpenAI, "Prompt caching" https://developers.openai.com/api/docs/guides/prompt-caching. Automatic for prompts of 1,024 tokens or longer, no fee; up to 90 percent input-cost and 80 percent latency reduction; retention roughly 5 to 10 minutes of inactivity up to 1 hour, extended retention up to 24 hours on supported models; prompt_cache_key for cache routing; cached span reported in usage.prompt_tokens_details.cached_tokens

  3. Google, "Gemini API: context caching" https://ai.google.dev/gemini-api/docs/caching. Implicit caching enabled by default on Gemini 2.5 and newer (minimum 2,048 tokens on 2.5 Flash/Pro, 4,096 on newer models) with savings passed through automatically and hits reported in usage_metadata; explicit cachedContents caching is a separate API surface and is not supported everywhere (not on the Interactions API), so verify before depending on it. 

  4. Lumer et al., "Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks" https://arxiv.org/abs/2601.06007. Across providers, over 500 long-horizon agent sessions with 10K-token system prompts, prompt caching cut API cost by 41 to 80 percent and time-to-first-token by 13 to 31 percent.