Prompt Injection

How do we protect against Prompt Injection anyways?

Prompt injection sucks but some companies think they have the solution. Do they? Read my blog to find out they don't.

Fred Rohrer

16 Jun 2026 • 3 min read

Infosec on their way to a Sev 1 after prompt injection attack.

Prompt injection is perhaps the most dangerous and risky aspect of running an LLM in production. Generally we talk about when, not if, a system is injected against. As a quick reminder; prompt injection is the inclusion of tokens that steer an LLM to do something its bill-paying implementer objects to. Benign examples include asking for baking recipes in a legal tech app, or the now famous Chipotle Token Maxxer AI.

What is the industry doing about it?

Training the Instruction Hierarchy

or "weigh the system prompt higher" approach. The foundational work is OpenAI's instruction hierarchy paper. It argues the core vulnerability is that LLMs often treat system prompts as the same priority as text from untrusted users and third parties, and trains the model to follow higher-privilege instructions when they conflict while ignoring lower-privilege overrides. Notably, they found this drastically increases robustness even for attack types not seen during training while imposing minimal degradation on standard capabilities. Empirically OpenAI has proven that training in a native instruction hierarchy does not lower its scores. (Wallace et al., 2024, arXiv:2404.13208).

Follow-up work pushes this idea to its limits: a recent "Many-Tier Instruction Hierarchy" paper notes that current works assign privilege based on a small fixed set of role labels (system > developer > user > assistant > tool) and it's unclear whether models generalize to arbitrarily many tiers relevant once you have agents pulling from many sources of varying trust (arXiv:2604.09443). OpenAI also has an IH-Challenge training-dataset paper and has since added a "developer" role between system and user.

Temporal Encoding

or "provenance embedding". Wu et al.'s Instructional Segment Embedding adds a BERT-inspired embedding that embeds instruction priority information directly into the model, because they argue existing approaches like delimiters and instruction-based training don't address the issue at the architectural level. They report robust-accuracy increases of up to 15.75% and 18.68% on two benchmarks, plus a 4.1% improvement on instruction-following (AlpacaEval)(arXiv:2410.09102)

Training-based Injection Defenses

or "the cat and mouse game". The idea is to train on pairs where the injected instruction is the bad completion (Chen et al., arXiv:2410.05451). Meta extended this into a model called Meta-SecAlign-70B. The obvious caveat here is that model training will always be behind new prompt injection attempts and we have a million times cost asymmetry. RL-Hammer even strongly suggests that prompt injection attempts can be reinforcement-learned.

Architecture

or "more is better". This is the idea that all of these problems can be solved not in the models themselves, but in the architecture, pipelines and orchestration. DeepMind operationalizes it by borrowing classic software-security ideas: it explicitly extracts the control and data flows from the trusted query so that untrusted data retrieved by the LLM can never impact program flow, and does this without relying on more AI to catch the attack (Debenedetti et al., arXiv:2503.18813).

Simon Willison (probably) came up with Dual LLM patterns in which a priviledged LLM plans the work and another touches the dirty tokens but has no tool access. In practice this system has two issues: Untrusted data could still make its way to the planning agent and thus to the tools, since the tools need some kind of input from data. Meanwhile defining security policies and guardrails is a huge governance burden.

Finally, what actually works?

There is no single technique that has defeated prompt injection. LLMs do not have social constructs, they can't weigh parental authority higher than random input. They have no sense of time or the ability to truly learn. The goal of this research is to approximate authority, but the real innovation will happen in with continuous model training. For now we have to be careful and adhere to the lethal trifecta.