Markdown

Offensive AI and the security arms race¶

Scope: how AI changes the balance between attackers and defenders in software security, and what that forces on anyone operating agentic systems. Where the threat model covers the agent's defensive attack surface, this page characterizes offensive capability (what AI makes cheap for attackers and what stays hard) and the defensive posture it requires. The framings here are analytical; specific capability figures move fast and should be read as direction, not fixed benchmarks.

This page is defensive characterization, not an exploitation guide. Capability claims are time-sensitive; verify against current sources.

flowchart TB
  subgraph OFF["Offense"]
    DISC["Discovery (find the bug)<br/>oracle: crash / sanitizer -> commoditizing"]
    CONS["Construction (working exploit)<br/>no oracle -> frontier-gated"]
  end
  DISC --> ASYM["Cost/speed asymmetry"]
  CONS --> ASYM
  ASYM --> DEF["Defense: continuous discovery,<br/>sanitizer-backed CI, evidence-ladder gating"]

Overview¶

AI now sits on both sides of the security loop. The same models that help defenders audit code help attackers find and weaponize flaws, and the discovery tooling is symmetric: defenders and attackers run the same scanners. The useful question is not "is AI a threat" but "which parts of the attack get cheaper, and which stay hard," because the answer tells you where to invest defensively.¹

Core knowledge¶

Discovery versus construction¶

Autonomous cyber-offense splits into two capabilities with very different economics:¹

Discovery (finding a vulnerability) has a ground-truth oracle: a crash, a failing sanitizer, a reproducer. Because success is checkable, it is well suited to automation and is commoditizing quickly. Defenders own the same oracle, so this is a capability both sides gain.
Construction (turning a flaw into a reliable, mitigation-evading exploit) has no clean oracle: it is long-horizon adversarial planning where partial progress is hard to score. This is where a capability gap between frontier and commodity systems persists.

The practical reading: assume attackers can find bugs cheaply (so can you), and that weaponization is harder but improving. Do not assume the gap is permanent.

Capability scales with inference budget¶

Autonomous cyber-task performance keeps improving as more inference budget is spent on a task, and evaluations have observed this scaling continuing to very large token budgets without a clear plateau.² The implication is that offensive capability is set partly by spend, not only by model choice, so a defender cannot assume a fixed ceiling tied to "the best model today." Budgeted, persistent agents are the threat to plan against.

The cost and speed asymmetry¶

Offense and defense run at different speeds and costs. An offensive attempt is a model run (cheap, fast, parallelizable). The defensive response is a human-paced pipeline: triage takes hours to days, patching plus change approval takes weeks, and fleet rollout takes longer still. That asymmetry inverts the old assumption that defenders have a comfortable window after disclosure, and it is the core reason security has to shift left and automate.¹

The agent itself is the attack surface¶

For agentic systems, the most likely breach vector is the organization's own hijacked agent. Real incidents show the pattern: a zero-click data exfiltration through an assistant's own context (EchoLeak, CVE-2025-32711), and a prompt injection that writes an agent's tool-configuration file to gain code execution (Cursor, CVE-2025-54135). The agent's tools, config files, and memory are the surface, which is why sandboxing, the policy gate, and prompt-injection defense matter more than perimeter controls.³

Monitoring the agent is not a guarantee¶

Defense cannot rely on watching an agent explain itself. Chain-of-thought and tool-call traces are an imperfect safety signal: models can behave differently when they detect evaluation, and a trace can be poisoned so a monitor misreads a malicious trajectory as safe (the agent-as-a-proxy attack, see prompt-injection defense). Treat monitoring as one noisy signal, not proof of safety.⁴

The defensive response¶

The posture that follows is continuous, automated, and oracle-anchored:¹

Run discovery yourself, continuously. Use the same automated scanners attackers use, against your own code, on every change.
Anchor on a real oracle. Sanitizer-backed CI (ASan, UBSan) gives a ground-truth signal that a finding is real, which an LLM verdict alone does not.
Use an evidence ladder. Promote a finding through stages (suspicion, static corroboration, crash reproduced, root cause explained, exploit demonstrated, patch validated), and do not let the model self-certify past static corroboration; the higher rungs require an external artifact.
Gate before deploy. Put the gate in the pipeline with no human on the critical path, because the human-paced loop is exactly what the asymmetry defeats.

Model-level defenses help too: classifier-based jailbreak defenses such as Constitutional Classifiers substantially reduce attack success, though at some compute and false-refusal cost.⁵

Don't-miss checklist¶

Assume cheap bug discovery on both sides; invest defensively in finding your own bugs first.
Plan against budgeted, persistent agents, not a fixed capability ceiling.
Treat the organization's own agent as the likely breach vector; harden its tools, config, and memory.
Anchor findings on a sanitizer oracle; never let an LLM verdict be the final word past static corroboration.
Automate the pre-deploy gate; the human-paced patch loop loses to the speed asymmetry.
Treat agent self-reports and CoT as noisy signals, not proof of safety.

Failure modes¶

Patch-window thinking. Security assumes time after disclosure; automated discovery has closed it.
LLM-as-oracle. A finding is accepted on a model's say-so with no reproducer; false positives and misses follow.
Perimeter focus. Controls guard the network while the hijacked agent acts from inside.
Trust in the trace. Monitoring is treated as proof; a poisoned trace or eval-aware behaviour slips through.
Fixed-ceiling assumption. Defenses are sized to today's model; a higher-budget agent exceeds them.

Open questions & validation¶

Where the discovery-construction gap sits for a given target class is empirical; measure it, do not assume it.
Capability-scaling curves are evolving; re-check current evaluations rather than trusting a cached figure.
The reliability of CoT and trace monitoring under adaptive attack is an open research problem.

References¶

UK AI Safety Institute, How fast is autonomous AI cyber capability advancing: https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing
CVE-2025-32711 (EchoLeak, M365 Copilot zero-click exfiltration): https://nvd.nist.gov/vuln/detail/CVE-2025-32711
CVE-2025-54135 (Cursor prompt-injection to RCE via tool config): https://nvd.nist.gov/vuln/detail/CVE-2025-54135
Zou et al., Universal and Transferable Adversarial Attacks on Aligned LMs: https://arxiv.org/abs/2307.15043
Anthropic, Constitutional Classifiers: https://arxiv.org/abs/2501.18837
OWASP Top 10 for LLM Applications: https://genai.owasp.org/llm-top-10/
MITRE ATLAS: https://atlas.mitre.org/

Offensive capability splits into discovery (finding a flaw, which has a ground-truth oracle and is commoditizing) and construction (a reliable mitigation-evading exploit, which lacks an oracle and stays frontier-gated). Offense runs at model speed and cost; defense runs at human-paced triage, patch, and rollout, an asymmetry that forces continuous, automated, oracle-anchored defense. ↩↩↩↩
The UK AI Safety Institute reports autonomous cyber-task capability improving with inference budget, with scaling observed to very large token budgets without a clear plateau. ↩
EchoLeak (CVE-2025-32711) was a zero-click exfiltration via an assistant's own context; the Cursor flaw (CVE-2025-54135) let a prompt injection write the agent's tool-configuration file to gain code execution. The agent's own tools, config, and memory are the surface. ↩
Chain-of-thought and tool-call traces are imperfect safety signals: behaviour can change under evaluation awareness, and traces can be poisoned to blind a monitor (the agent-as-a-proxy attack). ↩
Anthropic's Constitutional Classifiers reduce jailbreak attack success substantially, at some additional compute and false-refusal cost. ↩