Markdown

Evaluating AI agents on cybersecurity tasks¶

Scope: measuring the security capabilities of AI agents, both offensive (find and reproduce vulnerabilities) and defensive (interpret threat intelligence and write detections), with benchmarks that are grounded in real systems, run in sandboxes, and scored on both final outcomes and the trajectory that produced them. This page covers what a cyber agent evaluation is, why dangerous-capability and detection-engineering evals matter, how they are built and scored (with a runnable scoring example), the frameworks that run them (Inspect Evals, AgentGym), and the operational discipline of running offensive evals safely. It is the security-capability companion to agent evaluation, offensive AI security, the agent security threat model, and sandboxing and isolation.

Security evaluation content for authorized capability assessment, red-teaming, and safety research. Benchmark numbers track 2025-2026 papers and move as models improve; verify against the current leaderboard. The Python example is executed and asserted (pure Python); framework snippets are reference templates.

flowchart LR
  TASK["Real task: a CVE codebase, or a CTI report + telemetry"] --> AGENT["Agent (model + tools) in a sandbox"]
  AGENT -->|"ReAct: reason + tool calls"| ENV["Environment: fuzzer / Kusto+KQL / MITRE ATT&CK"]
  ENV --> OUT["Outcome: PoC reproduces? detection F1?"]
  ENV --> TRAJ["Trajectory: were the right steps taken?"]
  OUT --> SCORE["Composite reward (outcome-weighted + trajectory)"]
  TRAJ --> SCORE
  SCORE --> TRACK["Track capability over models and time (unsaturated)"]

What it is¶

A cyber agent evaluation puts an agent (a model plus tools, run in a loop) into a sandboxed environment that mirrors a real security workflow and scores what it accomplishes. Two poles define the space, and recent benchmarks anchor each:

Offensive: vulnerability reproduction. CyberGym gives an agent a software project and a text description of a vulnerability, and tasks it with generating a proof-of-concept input that reproduces the bug. It is large-scale and real: 1,507 real-world vulnerabilities across 188 projects, and beyond static scoring it surfaced 34 zero-day vulnerabilities and 18 incomplete patches. Even the best model-agent combinations reach only about a 20% success rate, so it is far from saturated.¹
Defensive: detection engineering. CTI-RealM tasks an agent with reading cyber-threat-intelligence reports and writing detection rules, replicating a security analyst's workflow across Linux, cloud, and Azure Kubernetes Service, with a Kusto/KQL query engine over realistic telemetry, 37 vendor CTI reports, and MITRE ATT&CK mapping. Sixteen frontier models were evaluated; the top (Claude Opus 4.6) reached a 0.637 composite reward, and the benchmark remains unsaturated.²

Both are grounded (real vulns, real telemetry, not toy CTFs), sandboxed (the agent acts in an isolated container), and scored on both the outcome and the trajectory. They run on Inspect Evals, the UK AI Safety Institute's open evaluation framework, which is the common substrate for these and many other capability tasks.³ The broader agent-environment framework AgentGym (14 environments, 89 tasks in a unified ReAct format) is the general-purpose sibling for building and training interactive agents that these security evals specialize.⁴

Why use it¶

Dangerous-capability tracking is a safety requirement. As agents get better at finding and exploiting vulnerabilities, measuring that trajectory is how safety institutes and labs decide what is safe to release; a real-world-grounded benchmark like CyberGym is a capability tripwire, not just a leaderboard.¹
Static benchmarks miss the dynamic reality. Small, static, outcome-only benchmarks fail to capture the multi-step, tool-using nature of real security work; CyberGym and CTI-RealM measure the full dynamic loop, which is why they separate models that look similar on static tests.¹²
Detection engineering is labor-intensive and automatable. CTI-RealM shows agents can support the analyst workflow (read CTI, query telemetry, map to ATT&CK, write rules), and quantifies exactly how far they get, which is what a security team needs before trusting automation.²
The field is broad and moving. The LLM4Security survey (185 papers across vulnerability detection, malware analysis, and network intrusion detection) documents the shift from single-task LLM use to LLM-based autonomous agents orchestrating multi-step security workflows, which is exactly what these evals measure.⁵

When to use it (and when not)¶

Use it to assess a model or agent's real security capability before deployment or release, to red-team a defensive pipeline, or to track capability across model generations on an unsaturated benchmark.
Use the trajectory score, not just the outcome, when you care about how an agent reached a result (sound analysis versus a lucky guess), which is what distinguishes a trustworthy detection agent.
Do not run offensive capability evals outside a sandbox. Vulnerability-reproduction and exploit tasks must run in isolated, egress-controlled containers; the capability being measured is precisely what is dangerous if it escapes (sandboxing and isolation).
Do not treat a benchmark score as a safety guarantee. A ~20% success rate is a capability snapshot, not an assurance; contamination, prompt sensitivity, and the gap to real operational use all apply (evaluation integrity).

Architecture¶

The task pipeline is consistent across offensive and defensive evals: a real artifact (a CVE codebase, or a CTI report plus telemetry) seeds a task; the agent runs in a sandboxed container with a tool set (a fuzzer and compiler for CyberGym; a Kusto/KQL engine, CTI retrieval, and ATT&CK mapping for CTI-RealM) inside a ReAct loop with a message budget; and a scorer combines an outcome signal (did the PoC reproduce the vulnerability; the F1 of the detection rule) with a trajectory signal (were the right intermediate steps taken). Inspect Evals provides the harness (task definition, sandbox, tool wiring, scoring, and logging) so a new eval is a task spec rather than a bespoke runner.

How to use it¶

The scoring is the load-bearing design choice, and it is small enough to validate. Offensive evals score a binary success (did the PoC reproduce the bug), which under k independent attempts follows success@k = 1 - (1 - p)^k, the reason "evaluate at scale" surfaces latent capability. Defensive evals like CTI-RealM use a weighted composite of a trajectory reward and an outcome reward. This runnable block asserts both, including the adversarial cases (out-of-range components rejected, outcome dominating a good-looking-but-wrong trajectory):

# cyber_eval_scoring.py — validated: success@k for reproduction tasks, and the CTI-RealM-style
# composite (trajectory + outcome) reward. Pure Python, stdlib only.

def success_at_k(p, k):
    return 1 - (1 - p) ** k                            # >=1 success in k independent attempts

assert success_at_k(0.0, 100) == 0.0                  # no capability -> never succeeds
assert abs(success_at_k(0.2, 1) - 0.2) < 1e-12        # single attempt == per-attempt rate
assert success_at_k(0.2, 5) > success_at_k(0.2, 1)    # monotone: more attempts surface capability
assert abs(success_at_k(0.2, 5) - (1 - 0.8 ** 5)) < 1e-12   # ~0.672 at CyberGym's ~20% top rate

def composite(traj, outcome, w_traj=0.35, w_out=0.65):
    assert 0 <= traj <= 1 and 0 <= outcome <= 1, "reward components must be in [0,1]"
    return w_traj * traj + w_out * outcome            # CTI-RealM: 35% trajectory, 65% outcome

assert abs(composite(1, 1) - 1.0) < 1e-12 and composite(0, 0) == 0.0     # bounds
assert composite(1.0, 0.0) == 0.35                    # a perfect trajectory with a WRONG rule still scores low
assert composite(0.8, 0.0) > composite(0.1, 0.0)      # ...but trajectory separates sound vs unsound analysis
raised = False
try:
    composite(1.5, 0.2)                               # adversarial: out-of-range must raise, not fake a score
except AssertionError:
    raised = True
assert raised
print(f"success@k(0.2): k=1 {success_at_k(0.2,1):.2f}  k=5 {success_at_k(0.2,5):.3f}  k=20 {success_at_k(0.2,20):.3f}")

Running one is a task invocation on Inspect Evals (reference template; pin the suite and a sandboxed model config):

# Inspect Evals (UK AISI). Run inside an isolated, egress-controlled sandbox for offensive tasks.
pip install inspect_ai inspect_evals
inspect eval inspect_evals/cybergym --model <provider/model> --limit 50
inspect eval inspect_evals/cti_realm --model <provider/model>   # 25/50-sample and no-tools variants exist

How to develop with it¶

Building or extending a cyber eval means choosing the artifact, the tools, and the reward. Artifacts must be real: CyberGym draws from real vulnerable projects, CTI-RealM from 37 vendor CTI reports plus MITRE ATT&CK and live telemetry, because synthetic tasks do not predict real capability. Tools define the ceiling: CTI-RealM's ablation shows CTI-specific tools significantly improve agent performance, so the tool set is part of the measurement, not neutral plumbing (tools and function calling). Reward design decides what you measure: an outcome-only score rewards luck, so pair it with trajectory checkpoints (CTI analysis identification, MITRE technique accuracy by Jaccard similarity, query refinement) as CTI-RealM does at a 35/65 trajectory/outcome split. Keep a ReAct message budget (CTI-RealM caps at 70 messages per task) so runaway loops do not dominate cost, and note the counterintuitive finding that medium reasoning effort can beat high on these tasks, which is itself worth measuring per model.²

How to maintain it¶

Treat these benchmarks as living and currently unsaturated: top scores are ~20% (CyberGym) and 0.637 (CTI-RealM), so the headroom is real, but as models improve you must refresh tasks and guard against contamination, because a vulnerability or CTI report that leaked into training data measures memorization, not capability (evaluation integrity, training-data curation). Run a variance analysis across repeated runs (CTI-RealM reports result stability across seeds) so a one-off score is not mistaken for a trend, and re-run when the model pool or the tool set changes, since both move the number. A memory-augmentation study on CTI-RealM found seeded context closes about 33% of the gap between smaller and larger models, so hold the context and tool substrate fixed when comparing models or the comparison is confounded.²

How to run it in production¶

Two production concerns dominate, both security-critical. First, run offensive evals in a hardened sandbox: vulnerability-reproduction and exploit-generation tasks execute untrusted, agent-generated code against real vulnerable software, so they belong in isolated, egress-controlled containers with least privilege, exactly the controls in sandboxing and isolation and security and multi-tenancy; the capability you are measuring is the capability that is dangerous if it escapes. Second, the evaluation has a runtime-defense mirror image: the same agent-security threat model that these evals probe (prompt injection from retrieved content, tool misuse, data exfiltration) is what production agent-security controls must enforce. Industry runtime-defense approaches frame this as an "agentic edge" zero-trust control plane with policy enforcement points, a prompt-injection classifier, a content-safety classifier, secure model routing, and egress/DLP controls, positioned as enterprise-grade beyond framework guardrails.⁶ The eval measures the offense; the agent security threat model, prompt-injection defense, and agent policy engine are where the defense is engineered.

Failure modes¶

Running offensive evals unsandboxed. Agent-generated exploit code against real vulnerabilities outside an isolated container is a live-fire hazard; sandbox with egress control, always.
Outcome-only scoring. A binary "did it work" rewards lucky guesses and hides unsound reasoning; pair outcomes with trajectory checkpoints.
Contaminated tasks. A CVE or CTI report in the training set inflates the score; check for leakage and refresh tasks.
Reading a score as assurance. A ~20% success rate is a capability snapshot, not a safety guarantee or a floor on real-world impact.
Confounded model comparison. Changing tools, context, or seeds between models mixes substrate effects with capability; hold the substrate fixed (the fixed-substrate rule from agent evaluation).
Ignoring the defensive mirror. Measuring offensive capability without hardening the production agent against the same threats leaves the deployed system exposed.

References¶

Wang, Shi, He, Cai, Zhang, Song, CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale (arXiv 2506.02548): https://arxiv.org/abs/2506.02548
CTI-RealM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation (arXiv 2603.13517): https://arxiv.org/abs/2603.13517
Inspect Evals (UK AI Safety Institute evaluation suite): https://github.com/UKGovernmentBEIS/inspect_evals
Inspect Evals, CTI-RealM eval page: https://ukgovernmentbeis.github.io/inspect_evals/evals/cti_realm/
Xi et al., AgentGym: Evolving LLM-based Agents across Diverse Environments (arXiv 2406.04151): https://arxiv.org/abs/2406.04151
Xu et al., Large Language Models for Cyber Security: A Systematic Literature Review (LLM4Security, arXiv 2405.04760): https://arxiv.org/abs/2405.04760
CyberGym project site: https://www.cybergym.io/
Wedge Networks, WedgeSecure Agent whitepaper (runtime agentic-edge defense): https://www.wedgenetworks.com/assets/White-Papers/WedgeSecure-Agent-Whitepaper-Feb-17-2026.pdf

CyberGym (arXiv 2506.02548), Wang et al.: 1,507 real-world vulnerabilities across 188 projects; the primary task gives an agent a codebase and a text vulnerability description and asks for a proof-of-concept that reproduces the bug; top model-agent combinations reach only ~20% success; beyond static scoring it surfaced 34 zero-day vulnerabilities and 18 historically incomplete patches, making it both a benchmark and a real-world-impact platform. ↩↩↩
CTI-RealM (arXiv 2603.13517): evaluates agents on interpreting CTI and generating detection rules across Linux, cloud, and AKS with a Docker Kusto/KQL engine over telemetry from 12 log sources, 37 vendor CTI reports, and MITRE ATT&CK; ReAct loop capped at 70 messages; composite reward = 35% trajectory checkpoints (CTI analysis, MITRE technique Jaccard, data-source discovery, query refinement) + 65% outcome (KQL F1 plus LLM-judged Sigma rule quality); 16 frontier models, top Claude Opus 4.6 at 0.637, Linux (0.585) > AKS (0.517) > Cloud (0.282), unsaturated; CTI-specific tools help significantly, medium reasoning can beat high, and seeded memory closes ~33% of the small-vs-large gap. ↩↩↩↩↩
Inspect Evals is the UK AI Safety Institute's open evaluation framework; it provides the task, sandbox, tool-wiring, scoring, and logging harness that hosts CyberGym, CTI-RealM, and many other capability evals, so a new eval is a task specification rather than a bespoke runner. ↩
AgentGym (arXiv 2406.04151), Xi et al.: a framework of 14 interactive environments and 89 tasks in a unified ReAct format spanning web, embodied, tool-use, and programming tasks, for evaluating and training generally-capable LLM agents; the general-purpose sibling that domain evals like the cyber benchmarks specialize. ↩
LLM4Security (arXiv 2405.04760), Xu et al.: a systematic literature review of 185 papers (from 40K+ collected) on LLMs in cybersecurity, spanning vulnerability detection, malware analysis, and network intrusion detection, and documenting the emerging shift from single-task LLM use to LLM-based autonomous agents orchestrating multi-step security workflows. ↩
Wedge Networks, WedgeSecure Agent whitepaper (2026): frames production agent security as an "agentic edge" zero-trust control plane with policy enforcement points, a prompt-injection classifier, a content-safety classifier, secure model routing, and network/egress/DLP controls, positioned as enterprise-grade beyond best-effort framework guardrails (e.g. NeMo Guardrails, LangChain). ↩