Markdown

Evaluating agents¶

Scope: measuring whether an agent works, well enough to ship and to catch regressions. Because an agent acts, a wrong action is a real cost, so evaluation has to be automated and has to look past the final answer at the components and the trajectory that produced it. This page covers what to evaluate, how to build datasets and rubrics from real failures, using a model as judge without being fooled by it, and gating deployment. It depends on observability and closes the loop on the harness.

Code and rubrics here are reference templates; validate against your own data before relying on them.

flowchart LR
  OPS["Production traces"] --> ERR["Error analysis (open coding)"]
  ERR --> TAX["Failure taxonomy"]
  TAX --> RUB["PASS/FAIL rubrics"]
  RUB --> MET["Metrics (pass rate)"]
  MET --> TEST["Update test sets"]
  TEST --> GATE["CI eval gate"]
  GATE --> OPS

Core knowledge¶

Evaluate output, components, and trajectory¶

Scoring only the final answer cannot tell a correct result that happened by luck from one produced by sound reasoning. Evaluate three layers: the output (was the answer right), the components (did each tool call and sub-step behave), and the trajectory (was the path coherent and efficient). Trajectory evaluation is what catches an agent that reached the right answer through a fragile or wasteful route that will fail next time.¹ This is also why raw traces must be retained rather than summarized (observability).

Build the dataset from real failures¶

A useful evaluation set mixes three test types: core use cases, hard edge cases, and adversarial inputs. Sources are public benchmarks, custom cases, and captured production traces. The method that produces good rubrics is error analysis: review a sample of traces (on the order of 50 to 100 per cycle), describe each failure in plain language (open coding), then group those descriptions into a failure taxonomy ranked by frequency and severity. The taxonomy tells you what to measure, instead of guessing at metrics up front.¹

Rubrics and metrics¶

Turn the taxonomy into rubrics, and prefer binary PASS/FAIL criteria over graded scales, because a binary judgment is more reproducible and easier to act on (for example: answer relevant, citations reliable, requirements met). Aggregate the rubrics into a metric such as overall pass rate, and track it over time.¹

LLM-as-judge, used carefully¶

A model can scale grading, either scoring a single output against a rubric or comparing two outputs pairwise. Make it reliable by using a judge at least as strong as the model under test, grounding it strictly in an evidence pack, and forcing structured output. The caveat is real: an LLM judge is itself promptable and can be talked out of its verdict, so do not use a general model as a judge on adversarial inputs without defenses, and prefer a small purpose-trained judge where injection is a risk (prompt-injection defense).¹

Gate deployment, then keep improving¶

Wire evaluation into delivery as a gate: a pre-merge stage runs a small fast set (tens of cases) on every change, and a staging stage runs larger sets with load and integration tests before release. The standing process is a flywheel: operational data feeds error analysis, which updates the taxonomy, rubrics, and test sets, which feed the gate, which protects the next release. Building an agent is a continuous cycle, not a one-time act.¹ Evaluating a harness change requires the substrate held fixed (pinned model, sandbox, and data) so the measured delta is the change and not drift (harness architecture).

Benchmarks worth knowing¶

Public benchmarks calibrate where an agent stands and stress different skills: SWE-bench for real-world software issues, OSWorld for computer-use in a real desktop environment, GAIA for general assistant tasks with unambiguous answers, tau-bench for tool-and-user interaction, and AgentBench across multiple environments. Treat them as calibration, not as a substitute for an evaluation set built from your own failures.²³⁴⁵⁶

Attributing multi-agent failures is hard¶

When a multi-agent run fails, finding which step caused it is itself an open problem. Step-level failure attribution scores far below agent-level attribution on current methods, so do not assume you can automatically localise the fault in a long multi-agent trajectory (orchestration).⁷⁸

Don't-miss checklist¶

Evaluate output, components, and trajectory, not just the final answer.
Build rubrics from real failures via open coding into a taxonomy; do not guess metrics first.
Prefer binary PASS/FAIL rubrics; aggregate into a tracked pass rate.
Use a judge at least as strong as the tested model, grounded in evidence, and watch for injectability.
Gate every change on a fast eval set; run larger sets before release.
Hold the substrate fixed when measuring a harness change.

Failure modes¶

Output-only scoring. Luck passes and fragile reasoning ships.
Made-up metrics. Rubrics invented without error analysis miss the real failure modes.
Likert mush. Graded scores are inconsistent across runs and hard to act on.
Injectable judge. An adversarial input flips the LLM judge's verdict.
No gate. Regressions reach production because nothing blocks a bad change.
Moving substrate. A model or data change is read as a harness improvement.

Open questions & validation¶

Trajectory-quality metrics are immature; validate that they predict real-world success.
LLM-judge agreement with humans varies by task; calibrate it before trusting it.
Multi-agent failure attribution is unreliable; verify any automated localisation against ground truth.

References¶

Jimenez et al., SWE-bench: Can Language Models Resolve Real-World GitHub Issues: https://arxiv.org/abs/2310.06770
Xie et al., OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks: https://arxiv.org/abs/2404.07972
Mialon et al., GAIA: a Benchmark for General AI Assistants: https://arxiv.org/abs/2311.12983
Yao et al., tau-bench: A Benchmark for Tool-Agent-User Interaction: https://arxiv.org/abs/2406.12045
Liu et al., AgentBench: Evaluating LLMs as Agents: https://arxiv.org/abs/2308.03688
Zhang et al., AgenTracer: Who Is Inducing Failure in LLM Agentic Systems: https://arxiv.org/abs/2509.03312
OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
Song & Hur, Build an AI Agent (From Scratch), Manning Publications (MEAP), 2026.

Song & Hur (Ch 10): evaluate output, components, and trajectory to separate luck from sound reasoning; build datasets (benchmark, custom, production) across core, edge, and adversarial cases; derive rubrics by open coding 50-100 traces into a failure taxonomy; prefer binary PASS/FAIL; use a stronger judge grounded in an evidence pack; gate deployment in CI and run the agent quality flywheel continuously. ↩↩↩↩↩
Jimenez et al., SWE-bench evaluates resolution of real GitHub issues. ↩
Xie et al., OSWorld evaluates computer-use agents in a real desktop environment. ↩
Mialon et al., GAIA evaluates general assistant tasks with unambiguous answers. ↩
Yao et al., tau-bench evaluates tool-agent-user interaction. ↩
Liu et al., AgentBench evaluates LLMs as agents across multiple environments. ↩
Failure-attribution work (the Who&When study) shows step-level attribution scores far below agent-level. ↩
Zhang et al., AgenTracer attributes failures in multi-agent systems and documents the difficulty of step-level localisation. ↩