Markdown

Prompt-injection defense¶

Scope: defending an agent against prompt injection, where untrusted content in the context is interpreted as instructions. Covers the three forms (direct, indirect, streaming), what detection can and cannot do, the engineering of a real detector ensemble and streaming-time scoring, and why detection is one layer of a defense-in-depth stack rather than a solution. Sits under the threat model; pairs with sandboxing and the policy gate.

This page is defensive; it describes detection and mitigation, not attack construction. Code is reference-only.

flowchart LR
  DELTA["Streaming response deltas"] --> BUF["Proxy buffer"]
  BUF --> CK50["Checkpoint at ~50%"]
  CK50 -->|"score >= 0.5"| ABORT["Abort + discard tail"]
  CK50 -->|"below"| CK100["Checkpoint at 100%"]
  CK100 -->|"score >= 0.5"| ABORT
  CK100 -->|"below"| FWD["Forward to caller"]

Overview¶

A model reads one token stream and cannot reliably tell which parts are instructions and which are data. Prompt injection exploits exactly that: an attacker places instructions where the model will read them as commands. The carrier can be the user turn (direct), a retrieved document or tool output the agent ingests (indirect), or text that arrives during generation (streaming). Indirect injection is the dangerous case for agents, because the poisoned content arrives through a tool the agent was told to trust.¹ Detection helps, but the honest framing is that no classifier closes the gap; injection defense is layered.

Core knowledge¶

Detection by ensemble, with honest limits¶

A practical detector runs several independent checks and votes, because agreement between detectors is a stronger signal than any single model's confidence. A reference proxy combined a regex pre-filter, a fine-tuned DeBERTa classifier, and two trained injection detectors, and measured roughly 88% accuracy and 95% precision but only about 80% recall on a 153-sample injection corpus.² Two results from that measurement are worth carrying forward:

Per-detector quality varies wildly. One trained detector reached about 83% while a popular guard model sat near 42% and flagged almost every security-themed but benign prompt, a high false-positive "silent killer" that quietly blocks legitimate traffic.²
Apparent diversity can be fake. Two of the four detectors produced identical scores on every sample, so the "four-detector" ensemble was effectively three. Verify that ensemble members are actually independent before trusting the vote.²

The lesson is to measure recall and per-detector independence on your own corpus, not to trust a vendor accuracy number.

Score at the right moment, not only at the end¶

Scoring only the completed response returns a correct verdict too late: on a streaming connection the payload has already reached the caller before the abort fires. Buffer the stream in a proxy and score at checkpoints. A two-checkpoint policy (score at roughly the halfway mark and again at the end, aborting and discarding the tail on a crossing) is the engineering sweet spot, lifting true-positive rate well above end-only scoring at a little over one-and-a-half detector calls per response, because injection payloads tend to arrive early.⁴

The aggregate "how much text before a detector flags" number is misleading, because it hides a large divergence by attack class: direct injections are flagged after a few percent of the text, while indirect injections can take most of the message. Tune checkpoints against the attack classes you actually face, not the average.⁴

Architectural defenses beat filters¶

Detection is necessary but not sufficient; the durable wins are structural.

Spotlighting marks untrusted content with delimiters, datamarking, or encoding so the model can tell data from instructions, cutting attack success sharply in the original study.⁵
Training-time defenses (StruQ, SecAlign) teach the model a structured instruction/data separation, raising the bar before any runtime filter runs.⁶
The policy gate handles what detection cannot: even a perfectly detected injection still needs the action to be refused. Excessive agency (OWASP LLM06) is policy-shaped, enforced at the control plane, not by a classifier.
Break the lethal trifecta. Remove private-data access, untrusted content, or the exfiltration channel on the path and the injection has nothing to steal (threat model).

A subtlety: do not defend with an LLM-as-judge that is itself injectable. A small, purpose-trained judge is harder to talk out of its verdict than a general model asked to grade.²

Multimodal moves the boundary¶

Text detectors do not run on images, audio, or screenshots, and fusion happens inside the model's vision or audio encoder before any text detector sees the input. An instruction rendered in tiny font inside a banner image is read by the vision encoder as text but never resolved by a human reviewer, a clean indirect-injection carrier. The defensive shape generalises (a per-modality ensemble, for example an OCR pre-pass feeding the text detectors), but the multimodal detection corpus lags text by roughly two years, so recall on image injection is materially lower. Treat multimodal agents as higher-risk until that gap closes.⁷

Monitor evasion: the agent as a proxy¶

Detection assumes the monitor sees the trajectory honestly, but an adaptive attacker can blind it. In the agent-as-a-proxy attack, the agent is induced to repeat adversarial strings inside its own reasoning and tool-call traces, so a safety monitor reading those traces misclassifies a malicious run as benign, with reported attack success approaching certainty and good cross-model transfer. The unsettling corollary is that observing the tool-call path can widen the attack surface rather than narrow it. The countermeasure is not more trust in the trace: add a perplexity or anomaly check on tool outputs and boundary tokens, and keep the policy gate as the thing that actually refuses the action (observability).³

Don't-miss checklist¶

Treat all retrieved documents and tool outputs as untrusted; indirect injection is the main agent risk.
Measure recall and per-detector independence on your own corpus; do not trust a single accuracy figure.
Score streaming responses at checkpoints (about 50% and 100%), aborting and discarding the tail on a hit.
Add spotlighting or a training-time instruction/data separation; do not rely on detection alone.
Enforce the action refusal at the policy gate; a detected injection still needs to be stopped from acting.
Treat image, audio, and screen inputs as a weaker-covered surface and constrain those agents harder.

Failure modes¶

End-only scoring. The verdict is correct but arrives after the payload was delivered.
Fake ensemble diversity. Two detectors are correlated or identical; the vote adds no robustness.
High-FPR guard. A noisy detector blocks benign security-themed traffic and gets disabled, removing the defense.
Detection treated as sufficient. No policy gate or sandbox behind it, so a missed injection acts freely.
Injectable judge. An LLM-as-judge defense is itself talked out of its verdict by the injection.
Text-only coverage. Image and audio carriers bypass the text detectors entirely.

Open questions & validation¶

Recall, not accuracy, is the binding metric for injection detection; validate it on adversarial, in-distribution data.
Robust multimodal injection detection is immature; quantify image and audio recall before relying on it.
Whether training-time separation (StruQ, SecAlign) holds against adaptive attacks at agent scale is still open.

References¶

Greshake et al., Not what you've signed up for (indirect prompt injection): https://arxiv.org/abs/2302.12173
Hines et al., Defending against indirect prompt injection (spotlighting): https://arxiv.org/abs/2403.14720
Chen et al., StruQ: defending against prompt injection with structured queries: https://arxiv.org/abs/2402.06363
Chen et al., SecAlign: defending against prompt injection with preference optimization: https://arxiv.org/abs/2410.05451
Yi et al., Benchmarking indirect prompt injection (BIPIA): https://arxiv.org/abs/2312.14197
Zhan et al., InjecAgent (indirect injection in tool-using agents): https://arxiv.org/abs/2403.02691
OWASP Top 10 for LLM Applications (LLM01 Prompt Injection): https://genai.owasp.org/llm-top-10/
LLMTrace (detector ensemble and streaming detection, inference-security gateway): https://llmtrace.io
Isbarov & Kantarcioglu, Bypassing AI control via agent-as-a-proxy attacks: https://arxiv.org/abs/2602.05066

Greshake et al. introduced indirect prompt injection, where instructions enter through content the agent retrieves rather than the user turn; the model cannot separate instructions from data in one token stream. ↩
A reference proxy (LLMTrace) measured a regex-plus-three-model ensemble at about 88% accuracy, 95% precision, and 80% recall on a 153-sample corpus; per-detector quality ranged from about 83% to about 42% (the weak one flagging nearly all benign security-themed prompts), and two detectors produced identical scores, making the four-member vote effectively three. A small purpose-trained judge resists injection better than a general LLM-as-judge. ↩↩↩↩
In the agent-as-a-proxy attack (Isbarov & Kantarcioglu), an agent repeats adversarial strings in its reasoning and tool-call traces so a safety monitor misclassifies a malicious trajectory as safe, with high attack success and cross-model transfer; observing the tool-call path can widen the attack surface. Counter with perplexity or anomaly checks on tool outputs, not more trust in the trace. ↩
End-only scoring delivers the verdict after the payload; buffering and scoring at two checkpoints (about 50% and 100%, abort-and-discard-tail on a crossing) raises true-positive rate at modest extra cost. The "characters before a flag" figure diverges by roughly 17x between direct and indirect injection, so tune per attack class. ↩↩
Hines et al. (spotlighting) mark untrusted content with delimiters, datamarking, or encoding so the model distinguishes data from instructions, reducing attack success substantially. ↩
StruQ and SecAlign are training-time defenses that teach a structured separation of instructions and data, raising robustness before any runtime filter. ↩
Text detectors do not run on image, audio, or screen inputs; modality fusion happens inside the encoder before text detection. An instruction in tiny font inside an image is a working indirect-injection carrier, and the multimodal detection corpus lags text by roughly two years. ↩