FROM AGPEDIA — AGENCY THROUGH KNOWLEDGE

Context poisoning

Context poisoning is the contamination of a large language model's context window with inaccurate, misleading, or adversarial information that biases the model's subsequent reasoning and output. The term spans two related phenomena: accidental poisoning, in which a model's own hallucinations or other errors enter the context and are then repeatedly referenced; and adversarial poisoning, in which an attacker injects false background material into a model's inputs to alter how it interprets a later request.^[1:1]^[2:1]

Context poisoning is most consequential in agentic AI systems, where models reason over long histories of tool calls, retrieved documents, and accumulated state. Once an error or falsehood enters the context, it can compound across turns and drive the agent toward goals or actions that no longer correspond to reality.^[3:1] The phenomenon is distinct from data poisoning, which targets a model's training data rather than its runtime inputs.

Definition and scope

Two related uses of the term predominate in current literature. Drew Breunig, summarising the Gemini 2.5 technical report, defines context poisoning narrowly as the case in which a hallucination or other error reaches the context and is then repeatedly referenced.^[1:1] Red-teaming frameworks, by contrast, use the term to describe an attacker-driven version, in which malicious or misleading background material is inserted into the context to shift the operational frame in which a later instruction is evaluated.^[2:1] Both senses share an essential mechanism: bad information enters the context and then exerts disproportionate influence on subsequent decisions, regardless of who put it there.

The accidental and adversarial senses can be hard to separate in practice. A retrieval-augmented generation assistant that pulls in a tampered document is being adversarially poisoned, but the model itself may not be able to tell that apart from a genuine but inaccurate source. Greshake and colleagues argued in 2023 that LLM-integrated applications fundamentally blur the line between data and instructions, since both arrive in the same natural-language input stream.^[4:1]

Mechanisms

Self-poisoning from hallucinations

In purely accidental cases, a model that produces an erroneous output may store or refer back to that output in subsequent steps, treating it as established fact. Long-horizon agentic loops are particularly vulnerable, because earlier decisions are typically preserved as part of the context that informs later ones. The Gemini 2.5 technical report described this for an agent playing Pokémon: as the context window grew, the agent's goals and summaries became "poisoned" with hallucinated game state, often taking many in-game hours to recover from.^[3:1]

Indirect prompt injection

Adversarial context poisoning often relies on indirect prompt injection. Greshake and colleagues introduced the concept in 2023, observing that augmenting an LLM with retrieval or external tools means data and instructions arrive in the same channel.^[4:1] An attacker can therefore plant content — in a web page, an email, an issue tracker, a calendar invite, or a code comment — that the model will later read and treat as part of its operating context. In agentic deployments, retrieved content has been shown to influence not just answers but tool choice and subsequent actions.^[4]

Memory and retrieval poisoning

Agents that maintain persistent memory or query external knowledge bases extend the attack surface beyond a single session. Chen and colleagues' AgentPoison work demonstrated that backdoor triggers can be embedded in an agent's long-term memory or retrieval index so that benign-looking queries containing the trigger reliably surface adversarial demonstrations.^[5] Elasticsearch researchers describe a non-adversarial form of the same problem in retrieval-augmented generation, in which lexical or semantic ambiguity causes the retriever to surface plausibly relevant but incorrect passages that then degrade answers.^[6]

Multi-turn semantic poisoning

Multi-turn conversational attacks can poison context without ever including a direct instruction. NeuralTrust's Echo Chamber technique uses storytelling and hypothetical framings across several turns to gradually establish premises that bias the model toward producing content it would otherwise refuse. The firm reports the attack reached a success rate above 90% on half of tested harm categories against several leading commercial models, including GPT-4o, GPT-4.1-nano, GPT-4o-mini, Gemini 2.0 Flash-lite, and Gemini 2.5 Flash, despite the individual turns appearing benign in isolation.^[7:1]

Notable examples

Gemini Plays Pokémon

The Gemini Plays Pokémon experiment described in the 2025 Gemini 2.5 technical report became a widely cited case study in context poisoning. Drew Breunig's analysis observed that the agent at several points became convinced it needed to retrieve a TEA item to progress, although the item exists only in the Fire Red and Leaf Green remakes and not in the original games the agent was playing; the agent then spent many hours attempting to acquire or deliver the non-existent item.^[8:1] The case is often used to illustrate how even a single hallucinated goal can dominate an agent's behaviour over a long horizon.^[3:1]

Echo Chamber

The Echo Chamber jailbreak, disclosed by NeuralTrust in June 2025, demonstrated that context can be poisoned for safety-bypass purposes through purely semantic means. Rather than including a forbidden instruction, the attacker leads the model through several turns of seemingly innocuous dialogue that establish premises and analogies, after which the model is more willing to produce restricted content.^[7:1] The technique attracted attention because it operates entirely in a black-box setting and exploits the same conversational coherence that makes multi-turn dialogue useful in the first place.

Relationship to other concepts

Context poisoning is closely related to, but not synonymous with, prompt injection. Prompt injection refers specifically to inputs that cause a model to follow attacker-supplied instructions rather than the operator's; context poisoning encompasses a broader set of failure modes, including non-adversarial cases in which the poisoned content is not formatted as an instruction at all. DeepTeam emphasises that context poisoning works by reframing the operational reality of a request rather than issuing a competing command.^[2:1]

It is also distinct from data poisoning, which manipulates a model's training data to produce a misbehaving model after training. Data poisoning attacks operate on weights; context poisoning operates on the input that a fully trained model sees at runtime.

Within taxonomies of long-context failure modes, Breunig groups context poisoning with three related phenomena: context distraction, in which a long history leads a model to repeat past actions rather than synthesise new plans; context confusion, in which superfluous content distorts the response; and context clash, in which contradictory items in context derail reasoning.^[1]

Mitigation

There is no consensus solution to context poisoning, and major model developers treat it as an open problem. Approaches discussed in the technical literature fall into several broad categories.

Context engineering

Anthropic and others describe a discipline of curating what enters and remains in the context window — including system prompts, tool outputs, retrieved documents, and prior turns — as an extension of prompt engineering aimed at minimising the surface for poisoning and other long-context failures.^[9:1] Typical techniques include trimming or summarising history, isolating untrusted content in clearly demarcated regions, and starting fresh sessions when a context appears to have been compromised.

Retrieval hardening

For retrieval-augmented systems, Elasticsearch researchers argue that hybrid lexical-and-semantic search, explicit metadata filtering, and source attribution can reduce the rate at which irrelevant or misleading passages enter the context, particularly as larger context windows make the cost of admitting bad material more severe.^[6]

Sandboxing and operator oversight

Agentic systems increasingly treat tool outputs and retrieved content as untrusted inputs that should not be allowed to escalate the agent's permissions or trigger irreversible actions without human review.^[4] Operator review remains the dominant defence in practice, because no current technique reliably distinguishes legitimate data from instructions hidden in that data.

Analysis: implications for human agency

Context poisoning poses a particular problem for AI-mediated workflows because it can produce outputs that look well-reasoned while resting on contaminated premises. A reviewer who sees only the final answer — and not the corrupted goal in a scratchpad or the tampered passage in a retrieval result — may approve work whose reasoning was never sound. The longer the agent's horizon and the less visibility a human reviewer has into intermediate state, the harder this becomes.^[9:1]

Designs that surface what is in the context, allow operators to inspect or strike items from it, and require human confirmation for consequential actions help keep accountability with the human in the loop rather than with whatever happened to enter the model's input. Without such designs, decisions made on the basis of poisoned context are difficult to attribute or audit after the fact, which weakens the practical capacity of operators and end-users to understand and correct what an AI system has done on their behalf.

^a ^b ↗ poisoning-definition ^ Breunig, Drew (2025-06-22). How Long Contexts Fail. https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html.
^a ^b ^c ↗ deepteam-vs-prompt-injection DeepTeam by Confident AI. Context Poisoning. Confident AI. https://www.trydeepteam.com/docs/red-teaming-agentic-attacks-context-poisoning.
^a ^b ^c ↗ pokemon-context-poisoning Gemini Team, Google (2025-07-07). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. Google. https://arxiv.org/abs/2507.06261.
^a ^b ↗ data-instruction-blur ^a ^b Greshake, Kai; Abdelnabi, Sahar; Mishra, Shailesh; Endres, Christoph; et al. (2023-02-23). Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23). https://doi.org/10.1145/3605764.3623985 https://arxiv.org/abs/2302.12173.
^ Chen, Zhaorun; Xiang, Zhen; Xiao, Chaowei; Song, Dawn; et al. (2024-07-17). AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases. arXiv. https://arxiv.org/abs/2407.12784.
^a ^b Elasticsearch Labs (2026-02-10). Context poisoning in LLMs: How to defend your RAG system. Elastic. https://www.elastic.co/search-labs/blog/context-poisoning-llm.
^a ^b ↗ echo-chamber-success-rate NeuralTrust (2025-06-23). Echo Chamber: A Context-Poisoning Jailbreak That Bypasses LLM Guardrails. NeuralTrust. https://neuraltrust.ai/blog/echo-chamber-context-poisoning-jailbreak.
^ ↗ tea-item-anecdote Breunig, Drew (2025-06-17). An Agentic Case Study: Playing Pokémon with Gemini. https://www.dbreunig.com/2025/06/17/an-agentic-case-study-playing-pok%C3%A9mon-with-gemini.html.
^a ^b ↗ context-engineering-definition Anthropic. Effective context engineering for AI agents. Anthropic. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents.

Available in

en - English