How To Defend Against Prompt Injection And Other AI Attacks

How to Defend Against Prompt Injection and Other AI Attacks

Prompt Injection Prevention, Security vs Utility, CaMeL vs Undefended

Task completion and provable security coverage, based on the CaMeL research results

Task completion Provable security
Undefended system completed 84 percent of tasks with no provable security. CaMeL completed 77 percent of tasks with provable security at 77 percent. 0% 20% 40% 60% 80% 100% Undefended 84% 0% CaMeL 77% 77%

Source: “Defeating Prompt Injections by Design”, CaMeL, arXiv:2503.18813.

You know that feeling when your agent demo looks perfect, then it reads a “helpful” web page and tries to email your financials to a stranger named Bob, who does not work at your company. That is the core problem. Agentic systems are brittle in the wild. If we want real AI agent security, we need prompt injection prevention that holds up when the inputs are messy, adversarial, and fast changing. This post is a practical field guide, grounded in recent research and hard lessons from live incidents, for anyone asking how to harden AI agents without turning them into museum pieces.

1. The Problem: Prompt Injection Is A System Issue, Not A Slogan

Prompt injection is simple to describe and stubborn to kill. A model ingests untrusted content, it finds an instruction hidden inside, then follows it. Direct prompt injection says “ignore previous instructions.” Indirect prompt injection hides the payload inside a PDF, a web page, a calendar invite, or a tool result. Most failures do not look cinematic. They look like one subtly wrong argument passed to a tool.

Many teams start with band-aids. “Remind the model to be safe.” “Put untrusted text between delimiters.” “Ask it to verify its own plan.” These tricks help on happy paths. They do not survive creative inputs at scale. You need system design that treats prompt injection prevention as a first-class requirement, the same way classic software treats input validation and control flow integrity.

2. The Adversary Has Agents Too

SOC analyst watches bots probe defenses while dashboards highlight prompt injection prevention across tools.
SOC analyst watches bots probe defenses while dashboards highlight prompt injection prevention across tools.

Attackers moved from copy-paste scripts to agentic playbooks. We see campaigns that use models to triage stolen data, generate convincing extortion notes, and adapt messages to each target. Less skilled actors ride this wave, because modern tooling lowers the bar for complex operations. That matters for defenders because it changes tempo. Your detection, response, and hardening loops need to keep up. Prompt injection prevention is no longer a research curiosity. It is table stakes for any workflow that touches untrusted content and tools.

3. The Breakthrough: Design For Safety With Google CaMeL

The best recent idea comes from the CaMeL framework, short for Capabilities for Machine Learning, which treats prompt injection prevention as a systems problem. The core move is to separate control flow from data flow, then enforce what may pass into each tool at execution time. In evaluations, CaMeL completed 77 percent of AgentDojo tasks with provable security, compared to 84 percent for an undefended system, which is a good trade when you care about safety by design.

3.1 The Dual LLM Pattern, Instantiated

Split UI shows planner and quarantined parser separation enabling prompt injection prevention via dual LLM design.
Split UI shows planner and quarantined parser separation enabling prompt injection prevention via dual LLM design.

CaMeL makes the dual LLM pattern concrete. A Privileged LLM sees only the trusted user request. It writes the plan, as code, without ever peeking at untrusted data. A Quarantined LLM parses untrusted content into structured fields, and it cannot call tools. So injected text cannot hijack tool execution directly. This clean split is the backbone for prompt injection prevention that holds under pressure.

Yet separation alone is not enough. An attacker can still steer the arguments that flow into tools, which can be just as damaging. CaMeL addresses that gap by making security a property of values, not just prompts.

3.2 Capabilities And Policies

Capability passports gate a tool call as a policy modal enforces prompt injection prevention before sending an email.
Capability passports gate a tool call as a policy modal enforces prompt injection prevention before sending an email.

Every value in the plan carries a capability, a little passport that records where the data came from and who is allowed to see it. Before any tool runs, a policy checks those passports. For example, an email-sending policy can require that recipients already have read access to a referenced document. Policies are simple Python functions; they can gate specific tools, and they can be as strict as your environment demands. This is the heart of robust prompt injection prevention because the guard lives at the tool boundary, where harm happens.

3.3 An Interpreter That Tracks Data Flow

The CaMeL interpreter executes the Privileged LLM’s code, builds a dependency graph, and enforces policies at each tool call. The Quarantined LLM helps parse untrusted blobs into typed fields, which keeps the plan crisp and auditable. No model surgery required. That makes adoption far easier than retraining or fine-tuning for every edge case.

4. A Practical Playbook For Prompt Injection Prevention

You want something you can ship next week. Here is a no-nonsense playbook, tuned for teams building agents with tool access.

4.1 Separate Control And Data

Have one component that writes the plan, and a different one that parses untrusted content. Do not let the planner see untrusted data. Do not let the parser call tools. This structural decision gives you a place to anchor prompt injection prevention without awkward prompt gymnastics. It also helps with audits, since you can log plans and parser outputs independently.

4.2 Enforce Capabilities At Tool Boundaries

Tag values as they enter the system. Record provenance, allowed readers, and any special restrictions. Create tight policies for high-risk tools like file send, payment, external HTTP, and code execution. Fail closed. Ask for user confirmation if a policy would be violated. This is the one place where a small amount of friction pays for itself many times over.

4.3 Use Structured Outputs For The Parser

The Quarantined LLM should return structured outputs for the parser, not prose. Define minimal schemas, add a “have_enough_information” flag, and retry only through the planner. This defeats a whole class of indirect prompt injection tricks and keeps your dependency graph sane.

4.4 Whitelist Tools Per Task

Grant the minimum set of tools required for the plan. If the user asks for a calendar change, the agent does not need file deletion, repository push, or a network fetch. This dramatically reduces the blast radius when something slips past your first layers of prompt injection prevention. Implement a per-task tool whitelist.

4.5 Require Human Approval For High-Impact Actions

Anything with money, deletion, external sharing, or secrets gets a checkpoint. Keep the approval compact and contextual, show provenance, and log the decision. You will catch mistakes and attacks, and you will build trust with auditors.

4.6 Build Observability That Survives Incidents

Log the plan, the parser schema and outputs, the tool inputs, the tool results, and every policy decision. Store capability tags alongside values in your data lake. This supports both live monitoring and forensics. If you want lasting prompt injection prevention, you need eyes.

4.7 Red Team Regularly, Then Fix The Controls

Test direct prompt injection and indirect prompt injection across your top 20 workflows. Use red team content that looks like the wild: web reviews, meeting notes, financial summaries, and vendor emails. When you find issues, fix the plan, the schema, or the policy, not just the prompt.

5. Threats To Controls Mapping

One table you can keep near your runbook.

ThreatTypical SymptomPrimary ControlsNotes
Direct prompt injectionModel says it will ignore instructions, or rewrites the planPlanner isolation, plan logging, policy checks at toolsKeep the planner blind to untrusted content.
Indirect prompt injectionWrong arguments passed to a toolQuarantined parser, typed schemas, capability checksThis is where most real issues hide.
Data exfiltration via tool callUnintended sharing or network fetch with secretsCapability readers, “public vs specific recipients” policiesMake sharing explicit.
Tool misuseUnexpected tool gets calledPer-task tool whitelist, strict policy on high-risk toolsFail closed.
Memory poisoningOld notes steer a new actionProvenance tags, freshness checks, human approval for sensitive actionsTreat memory as untrusted unless explicitly tagged.
Side channelsInformation leaks through counts or timingStrict dependency mode, block derived state-changing calls after parser decisionsDesign for noisy channels, not perfection.

Prompt injection prevention is a portfolio, not a silver bullet. The more of these you turn on, the less exciting your incidents will be.

6. Implementation Notes That Pay Off

6.1 Keep The Plan Small And Deterministic

Short plans are easier to audit and harder to subvert. Make the planner write explicit steps, not loops over unknown data sets, unless you really need them. Deterministic plans simplify policy reasoning and improve the quality of your prompt injection prevention.

6.2 Design Schemas To Minimize Parser Power

If the only valid output is an email address and a filename, then define a schema that allows exactly that. Avoid free text. Add a boolean like “have_enough_information,” which lets the system retry the plan without giving the parser a chance to sneak suggestions back into planning.

6.3 Policies As Code, Reviewed Like Code

Express policies as simple functions. Keep them versioned, code reviewed, and tested. Start with high-value ones: send_email, external_http, transfer_money, write_file, run_code. The discipline matters more than any single rule.

6.4 Think In Values, Not Strings

Capabilities travel with values. If you split a document into chunks, the chunks inherit the document’s readers and provenance. If you compute a summary, the summary’s capability should reflect its inputs. This mindset shift is central to robust prompt injection prevention.

7. Limits, Honestly Stated

No system stops every failure mode. CaMeL’s guarantees target actions at the tool boundary. It does not try to stop text-only misinterpretations that never touch a tool. In tests, separation plus policies drove the attack success rate near zero, and the remaining “wins” were outside the threat model, such as simply printing a malicious review back to the user. The important part is that the model did not send money or leak files because those actions were gated by policies.

Side channels exist. If a loop runs N times based on a private value, a remote counter may infer that N. A strict dependency mode can reduce this class of leaks by treating control conditions as data dependencies for policy checks. That mindset lets you block or confirm risky operations that would otherwise leak through behavior, which strengthens your overall prompt injection prevention posture.

8. Strategy For Teams Getting Started

  1. Pick one agent that touches untrusted data and one high-risk tool.
  2. Split planner and parser. Give the parser structured outputs only.
  3. Add two policies, send_email and external_http, and run with confirmation prompts.
  4. Tag values with provenance and readers. Log everything.
  5. Red team for prompt injection and indirect prompt injection across three realistic tasks.
  6. Fix the plan, the schema, or the policy, not just the wording of your prompt.
  7. Repeat for the next agent.

This sequence takes you from slogans to concrete prompt injection prevention inside a sprint or two. It also sets up a foundation you can extend across your stack.

9. Where This Leaves Us

We do not need to beg the model to be good. We can make the system safe by construction. Google CaMeL shows a clear path, separate control and data, track provenance, and enforce policies at tool time. The research community has done the hard part of proving that careful design beats vibes. Now it is on us to build.

Call to action. Take one workflow that touches untrusted content, then ship the split-planner design with capabilities and two high-risk policies. Measure how often you block unsafe actions. Publish your results. If you want a sounding board on your architecture for AI agent security, reach out. The community needs more real stories about prompt injection prevention that actually work, not just clever prompts. You can start today.

Notes: CaMeL is described in “Defeating Prompt Injections by Design.” It formalizes the dual LLM pattern, tracks data dependencies, and enforces capabilities at tool time. The study demonstrates strong security with modest utility costs, and it requires no model modifications.

:
Prompt Injection
Crafted text that nudges an AI system to ignore its original instructions and take actions the attacker wants.
Indirect Prompt Injection
A hidden instruction embedded in content the model later reads, for example a web page or PDF, which then steers the system’s behavior.
Prompt Injection Prevention
A set of design and runtime controls that keep untrusted text from driving unsafe tool actions, usually by isolating data and enforcing checks at execution time.
Dual LLM Pattern
An architecture that splits roles between two models. One plans and calls tools, the other parses untrusted data. They never share privileges, which limits attack paths.
Privileged LLM
The planning model that sees trusted user intent and produces the step-by-step plan. It can call tools, but does not read untrusted data.
Quarantined LLM
The parsing model that reads untrusted content and returns structured fields. It cannot call tools, which prevents text from triggering actions.
Google CaMeL
Capabilities for Machine Learning, a framework that adds capability tags to values, tracks provenance, and checks policies before any tool runs.
Capability
A tag attached to a value that records where it came from and what it is allowed to do or share. Capabilities travel with the data through the plan.
Capability Policy
A short piece of code that decides if a tool call is permitted, based on the capabilities of its inputs and the destination. It blocks unsafe actions at the boundary.
Provenance
The recorded origin of data, for example which file, site, or tool produced it. Provenance helps policies reason about whether sharing is allowed.
Control Flow vs Data Flow
Control flow is the plan that decides what to do next. Data flow is the information that fills the plan’s inputs. Keeping them separate is central to security.
Tool Whitelist
A minimal set of tools the agent may use for a task. Reducing available tools limits what an attacker can misuse if something slips through.
Audit Logs
Complete records of plans, parser outputs, tool inputs and results, and policy decisions. These logs support monitoring, incident response, and compliance.
Dependency Graph
A map of how values are produced and used across steps. Interpreters use it to enforce capability checks and to keep untrusted data from reaching sensitive sinks.
Strict Dependency Mode
A runtime stance that treats control conditions as data dependencies for policy checks. It reduces leaks through behavior, for example timing or counts.
Data Exfiltration
Any unauthorized movement of sensitive information outside its allowed boundary, for example emailing a private document to a public address.
Least Privilege
A principle that grants only the permissions needed to complete a task. It shrinks the blast radius when failures occur.
Red Teaming
Regular, structured testing with realistic attacks to uncover weaknesses in prompts, schemas, tools, and policies, then fix the system design.
AgentDojo
A benchmark suite for agent tasks that has been used to measure designs like CaMeL, including their security and utility tradeoffs.
AI Firewall, Informal
A metaphor for architecture that blocks unsafe actions. In this context, the “firewall” is the combination of separation, capabilities, and policy checks, not a network appliance.

1) What Is Prompt Injection And Why Is It So Hard To Stop?

Prompt injection is when crafted text steers an AI system away from its instructions, often to exfiltrate data or misuse tools. It is hard to stop because the model cannot reliably tell trusted instructions from untrusted ones, and attackers can hide instructions inside web pages, PDFs, or emails that the model later reads. Lists of mitigations exist, but detection alone is not enough since even a small miss rate lets attacks through at scale. Treat it like a system design problem, not just prompt wording, and make prompt injection prevention a first class control.

2) What Is The Difference Between Direct And Indirect Prompt Injection?

Direct prompt injection happens when a user types instructions that override the system’s goals, for example “ignore previous instructions.” Indirect prompt injection hides a malicious instruction inside content the AI later processes, for example a web page or document with invisible text. Indirect attacks are more dangerous for agents with tools, because the model treats the embedded text as context and may attempt actions on its behalf. Effective prompt injection prevention requires isolating untrusted content and gating tool calls, not just filtering phrases.

3) What Is The “Dual LLM Pattern” For AI Security?

The dual LLM pattern splits responsibilities. A Privileged LLM plans and calls tools but never sees untrusted data. A Quarantined LLM reads untrusted data and returns structured outputs but cannot call tools. An orchestrator keeps them separate and treats the quarantine output as tainted. This limits how a hidden instruction can influence actions and improves prompt injection prevention for agents that browse, read files, or call APIs. The pattern has become a common reference point for developers hardening real systems.

4) What Is Google’s CaMeL Framework And Why Is It A Breakthrough For AI Agent Security?

CaMeL, short for Capabilities for Machine Learning, wraps the model in a protective layer. It extracts control flow from the trusted user request, parses untrusted data separately, and adds capability tags that carry provenance and permissions with values. Policies check those capabilities at tool time, so unsafe actions are blocked even if the model was influenced by a prompt. In public results, CaMeL demonstrated provable security on a substantial share of AgentDojo tasks, which is notable because most prior defenses were probabilistic filters. Think of it as architecture, not a patch, designed to make prompt injection prevention a property of the system.

5) How Does CaMeL’s “Design Based” Security Compare To Traditional Input Filtering?

Input filtering and detection try to spot bad strings. That helps, but attackers adapt and even a small false negative rate is enough to cause incidents. CaMeL changes the execution model. It separates planning from untrusted parsing, tracks data flow, and enforces capability policies before any tool runs. The result is stronger, more predictable control at the point of impact. In short, filters reduce some noise, architecture removes classes of failure. Use both, but anchor your prompt injection prevention on design, not hope.

Leave a Comment