Markdown

Agent observability¶

Scope: making an agent's behaviour visible enough to debug, evaluate, and audit. The inference boundary is the easiest place to instrument, because on every model call the harness already holds the full prompt and the full response. The discipline is to capture the whole trajectory, span it with shared conventions, and join every boundary on one request id, without summarizing away the signal you will need later. Built on the harness and the control plane; it feeds evaluation and prompt-injection defense.

Code and configs here are reference templates; pin versions and validate before relying on them.

flowchart LR
  subgraph SRC["What enters the prompt"]
    SYS["System"]
    USR["User"]
    RAG["Retrieved docs"]
    PRIOR["Prior tool results"]
    MEM["Memory"]
  end
  SRC --> CALL["Model call (full prompt + response on the wire)"]
  CALL --> TRACE["Trace points: pre-prompt, post-response, each tool decision"]
  TRACE --> SPAN["OpenTelemetry GenAI spans (invoke_agent, execute_tool)"]
  SPAN --> JOIN["Join on X-Request-Id"]
  JOIN --> AUDIT["Audit + eval + detector verdict"]

Overview¶

Most distributed-tracing problems are about reconstructing what happened from partial signals. The agent inference path is the opposite: the harness sits in the middle of every model call and sees the complete input and output. That makes the inference boundary the highest-signal, lowest-effort layer to instrument. The same vantage point that lets you trace a run also lets a per-request check (a prompt-injection detector) return a verdict before the tool path runs.¹

Core knowledge¶

Capture the whole trajectory, at fixed points¶

An agent run is a trajectory, not a single call. Instrument three points per step: before the model call (the assembled prompt), after the response (the model output and any tool requests), and at each tool-call decision (what was proposed, what the control plane decided, what the tool returned). Record where each part of the prompt came from: system instructions, user input, retrieved documents, prior tool results, and memory. That provenance is what later lets you attribute a bad action to a poisoned input.¹

Use shared span conventions¶

Adopt the OpenTelemetry GenAI semantic conventions so agent traces interoperate with the rest of the platform's telemetry. The agent-level span (invoke_agent) wraps lower-level execute_tool and model-call spans, giving a hierarchy that reads as a trajectory rather than a flat list of chat completions. Standard spans also mean the same dashboards, sampling, and alerting that cover services cover agents.²

Join every boundary on one request id¶

An agent action crosses several boundaries: the inference call, the policy decision, the credential issued for the tool, and any state mutation. Stamp a single X-Request-Id at ingress and propagate it through all of them. The single timeline (chat input, reasoning, detector verdict, policy decision, issued credential, backend audit row) is what turns four separate logs into one explainable event. The one load-bearing change in application code is generating and propagating that id.³

Do not summarize the signal you need¶

Compressing traces for storage or for feeding back into an optimiser is lossy in a way that matters. Harness-evaluation work finds a large drop in downstream performance when raw traces are replaced by model-written summaries, on the order of fifteen points, because the summary discards the exact tokens that explain a failure. Keep raw traces for the work that depends on them (debugging, evaluation, failure attribution) and summarize only for human digests. The recurring leak sites are experiment trackers, truncated spans, trace UIs that elide tool bodies, and chat or paging digests.⁵

Track agent-specific metrics¶

Beyond latency and cost, the metrics that predict agent reliability are trajectory-shaped: turns per trajectory, tool-call validity rate (fraction of calls that parse and satisfy their schema), tool-error rate, retries per task, and budget consumed per task. A rising turn count or tool-error rate is an early warning that a harness change regressed, often before end-task success moves.¹

A reference shape¶

An inline proxy on the inference path can capture both directions of every model call, attach the request id, run a detector ensemble, and emit the span, without changing the agent's code. LLMTrace (llmtrace.io) is one adopted implementation: a drop-in gateway that records the prompt and response, returns a per-request prompt-injection verdict, and joins the inference boundary to the policy and credential boundaries on the shared id, so inference-path security and tracing ship as one gateway rather than a bolt-on.¹

Monitoring is a signal, not a guarantee¶

Instrumentation can be turned against the observer. An adaptive attacker can poison the traces a monitor reads: the agent-as-a-proxy attack has an agent repeat adversarial strings in its reasoning and tool-call traces so a safety monitor misreads a malicious trajectory as safe, and, counterintuitively, observing more of the tool-call path can widen the attack surface rather than narrow it. Treat traces and chain-of-thought as a noisy signal, not proof of safety, and add a perplexity or anomaly check on tool outputs instead of trusting the agent's own narration.⁴

Don't-miss checklist¶

Instrument pre-prompt, post-response, and every tool decision; record prompt-part provenance.
Emit OpenTelemetry GenAI spans (invoke_agent over execute_tool) so agent traces join platform telemetry.
Propagate one X-Request-Id across inference, policy, credential, and state boundaries.
Keep raw traces for debugging and evaluation; never feed only summaries back into an optimiser.
Alert on tool-call validity rate, tool-error rate, turns per trajectory, and budget per task.

Failure modes¶

Chat-completion logging only. Flat per-call logs with no trajectory; you cannot reconstruct why the agent did something.
Lost provenance. The prompt is logged as one blob, so a poisoned retrieved document cannot be distinguished from a legitimate instruction.
Broken join. Boundaries log to separate systems with no shared id; correlating an action to its authorization is manual.
Summary-only retention. Raw traces are dropped; later evaluation and failure attribution run on lossy text and underperform.
Vanity metrics. Dashboards track tokens and latency but not tool-call validity or turn count, so harness regressions go unnoticed.

Open questions & validation¶

A direct measure of context quality, rather than inferring it from end-task success, is still missing.
Sampling strategy for high-volume agents: what to keep at full fidelity versus summarize, validated against debugging needs.
Detector verdicts on the inference path add latency; measure the budget on the target workload (prompt-injection defense).

References¶

OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OWASP Top 10 for LLM Applications (LLM01 Prompt Injection): https://genai.owasp.org/llm-top-10/
LLMTrace (inference-path security gateway): https://llmtrace.io
Isbarov & Kantarcioglu, Bypassing AI control via agent-as-a-proxy attacks: https://arxiv.org/abs/2602.05066
Anthropic, Building effective agents: https://www.anthropic.com/research/building-effective-agents

The inference boundary holds the full prompt and response on every call, making it the highest-signal observability layer; capture the trajectory at pre-prompt, post-response, and each tool decision, record prompt-part provenance, and track trajectory-shaped metrics (turns, tool-call validity, tool-error rate, budget per task). ↩↩↩↩
OpenTelemetry GenAI conventions define agent and tool spans (invoke_agent, execute_tool) so agent traces share the platform's tracing, sampling, and alerting. ↩
Propagating a single X-Request-Id across the inference, policy, credential, and mutation boundaries joins otherwise separate logs into one explainable timeline. ↩
The agent-as-a-proxy attack (Isbarov & Kantarcioglu) has an agent repeat adversarial strings in its reasoning and tool-call traces so a monitor misreads a malicious trajectory as safe; observing more of the tool-call path can increase the attack surface. Treat monitoring as a noisy signal and add anomaly checks on tool outputs. ↩
Harness-evaluation work reports a large downstream-performance drop (around fifteen points) when raw traces are replaced by model-written summaries; keep raw traces for debugging, evaluation, and failure attribution. ↩