Markdown

Agent security threat model¶

Scope: the attack surface that opens when a model stops answering and starts acting. A chatbot that only emits text is low-risk; an agent that reads private data, executes code, calls APIs, and moves money converts a successful prompt injection into real-world consequences. This page frames the threat (what an attacker gains, where trust boundaries sit, what vocabulary to use) and routes to the defenses: prompt-injection defense, sandboxing, and policy and governance.

This page describes attack classes and defensive framing; it is not an exploitation guide.

flowchart TB
  subgraph TRIFECTA["The lethal trifecta"]
    PRIV["Access to private data"]
    UNTRUSTED["Exposure to untrusted content"]
    EXFIL["Ability to communicate externally"]
  end
  PRIV --> RISK["Injection becomes exfiltration / action"]
  UNTRUSTED --> RISK
  EXFIL --> RISK
  RISK --> IMPACT["Data loss, fraud, lateral movement"]

Overview¶

The security story of a language model is mostly about its output. The security story of an agent is about its actions. Agency is the multiplier: the same model weakness (it will follow instructions hidden in its input) is a curiosity in a chatbot and a breach in an agent that can read a mailbox and send mail. The threat model therefore tracks capability, not cleverness. Enumerate what the agent can do and what each input it trusts can reach, and the high-risk paths fall out.¹

Core knowledge¶

The lethal trifecta¶

Most damaging agent exploits need three properties present at once: access to private data, exposure to untrusted content, and the ability to communicate externally. With all three, an attacker who plants instructions in the untrusted content can make the agent read private data and exfiltrate it. Removing any one leg breaks the chain, which is why least privilege and channel restrictions matter more than any single filter.² EchoLeak (CVE-2025-32711), a zero-click exfiltration of Microsoft 365 Copilot data, is the canonical instance: a crafted email caused the assistant to leak context, bypassing several prompt-side defenses.³

Identity and intent vanish at the same hop¶

Two things disappear when an agent calls a backend: who asked (identity) and what they wanted (intent). Both flatten into a generic service-account API call before the backend sees them, so the backend cannot enforce per-user authority or confirm the action matches the user's request. This is the structural root of OWASP's "excessive agency": the agent acts with more privilege, and less attributable intent, than the human behind it. Restoring identity and intent is the job of policy and governance.¹

Shared vocabulary: use the taxonomies¶

Three reference frameworks give a common language and a checklist:

OWASP Top 10 for LLM Applications names the application-level risks: prompt injection (LLM01), sensitive-information disclosure (LLM02), excessive agency (LLM06), and unbounded consumption (LLM10) are the ones agents hit hardest.⁴
MITRE ATLAS catalogues adversary tactics and techniques against AI systems, the ATT&CK analogue for ML.⁵
NIST AI RMF and its adversarial-ML taxonomy frame governance and the attack classes (evasion, poisoning, privacy, abuse) at the organisational level.⁶

Map the trust boundaries¶

Every input the agent treats as instructions is a trust boundary. The system prompt is trusted; the user turn is semi-trusted; retrieved documents, tool outputs, web pages, and prior-agent messages are untrusted and are the usual injection carriers. The model cannot reliably tell instructions from data in a single token stream, so the boundary has to be enforced outside the model, by provenance tracking (observability), content handling (prompt-injection defense), and a policy gate (control plane).¹

The offensive economics shifted¶

AI changed the cost structure of attacks asymmetrically. Vulnerability discovery has a ground-truth oracle (a crash, a failing sanitizer) and has become cheap and commoditised; exploit construction that evades mitigations has no such oracle and stays frontier-gated. The practical consequence for defenders is that the discovery tooling is now symmetric (defenders run the same scanners attackers do) and the old ship-then-patch window is closing, so security moves to continuous defender-side discovery, sanitizer-backed CI, and pre-deploy gating with no human on the critical path.⁷

Don't-miss checklist¶

Enumerate the agent's capabilities and the reach of every input it trusts; treat that as the threat model.
Break the lethal trifecta deliberately: remove private-data access, untrusted content, or the exfiltration channel on high-risk paths.
Preserve user identity and intent to the backend; never let actions run as an unattributable service account.
Adopt OWASP LLM Top 10, MITRE ATLAS, and NIST AI RMF as the shared checklist and vocabulary.
Assume the model cannot separate instructions from data; enforce boundaries outside it.

Failure modes¶

Trifecta complete. Private data, untrusted input, and an outbound channel coexist; one injection exfiltrates.
Excessive agency. The agent holds broader permissions than the user, with no per-action authority check (OWASP LLM06).
Unattributable action. Backend sees a service account, not the user; audit cannot reconstruct who or why.
Defense-as-single-filter. A lone injection classifier is treated as sufficient; defense in depth is missing.
Stale patch window. Security assumes time to patch after disclosure; automated discovery has closed that window.

Open questions & validation¶

No robust, general separation of instructions and data inside a single model context exists yet; treat any claim of one skeptically.
Quantify each agent's blast radius (what one compromised action can reach) and validate the trifecta-breaking controls against it.
Multimodal injection moves the boundary again; text-only threat models undercount it (prompt-injection defense).

References¶

OWASP Top 10 for LLM Applications: https://genai.owasp.org/llm-top-10/
MITRE ATLAS (adversarial threat landscape for AI systems): https://atlas.mitre.org/
NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework
Simon Willison, The lethal trifecta for AI agents: https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
Greshake et al., Not what you've signed up for (indirect prompt injection): https://arxiv.org/abs/2302.12173
CVE-2025-32711 (EchoLeak, M365 Copilot zero-click exfiltration): https://nvd.nist.gov/vuln/detail/CVE-2025-32711
Wedge Networks, WedgeSecure Agent whitepaper (runtime "agentic edge" zero-trust control plane, prompt-injection and content-safety classifiers, egress/DLP): https://www.wedgenetworks.com/assets/White-Papers/WedgeSecure-Agent-Whitepaper-Feb-17-2026.pdf
Measuring the offensive/defensive capabilities this model describes: cybersecurity agent evaluation

Identity (who asked) and intent (what they wanted) both flatten into a generic service-account call before the backend sees them; the model cannot reliably separate instructions from data, so trust boundaries must be enforced outside the model. ↩↩↩
The lethal trifecta is the coincidence of private-data access, untrusted content, and an external communication channel; removing any one leg breaks the injection-to-exfiltration chain. ↩
EchoLeak (CVE-2025-32711) was a zero-click exfiltration of Microsoft 365 Copilot context via a crafted email, reported to bypass several prompt-side defenses. ↩
OWASP Top 10 for LLM Applications; agents most often hit LLM01 (prompt injection), LLM02 (sensitive-information disclosure), LLM06 (excessive agency), and LLM10 (unbounded consumption). ↩
MITRE ATLAS catalogues real-world adversary tactics and techniques against AI-enabled systems. ↩
NIST AI RMF and its adversarial-ML taxonomy classify evasion, poisoning, privacy, and abuse attacks and frame organisational governance. ↩
Offensive asymmetry: vulnerability discovery has a ground-truth oracle and is cheap; mitigation-evading exploit construction lacks one and stays frontier-gated, pushing defense toward continuous discovery, sanitizer-backed CI, and pre-deploy gating. ↩