Department of Computer Science, Virginia Tech
AI agents are increasingly trusted with complex, multi-step workflows: managing financial transactions, interacting with third-party APIs, and executing code autonomously. This autonomy, however, exposes a fundamental security gap: because agents process untrusted external content — emails, web pages, RAG results — they are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions embedded in the environment hijack the agent's behavior.
The security community has responded with monitoring-based defenses. The prevailing approach, exemplified by systems like LlamaFirewall, uses a secondary trusted model to jointly scrutinize the agent's Chain-of-Thought (CoT), tool calls, and tool outputs. The assumption: a comprehensive view of the agent's internal reasoning provides a strong defense.
We challenge this assumption. By forcing the monitor to process the agent's own outputs, defenders inadvertently create a direct channel through which attackers can manipulate the monitor. We show that this is not a theoretical concern — it is practically exploitable at scale.
A standard IPI attack tries to mislead the agent. Our approach goes further: the adversarial string is designed not only to hijack the agent's behavior, but to manipulate the monitor that observes the agent. By coercing the agent to repeat an optimized adversarial string within its chain-of-thought reasoning and tool call parameters, we force the monitor to classify a malicious trajectory as safe.
Crucially, the agent need not "know" it is participating in an attack. It simply repeats the injected string — a natural consequence of tasks like appending text to a message or including a string in a transaction field. The monitor, seeing the agent faithfully echo this string across multiple contexts, is manipulated at the token level.
The agent acts as an unwitting proxy, delivering adversarial strings directly into the monitor's context.
Standard GCG optimizes an adversarial string for a single insertion point. In agentic contexts, the monitor observes the adversarial string repeated across K distinct temporal contexts — a reasoning step, a function argument, an execution log. A string optimized in isolation may fail when re-contextualized.
Parallel-GCG addresses this by optimizing a single adversarial string jointly across all K insertion sites. Formally, given a fixed set of context slices {c0, ..., cK} extracted from a reference agent trace, we construct a composite sequence and minimize a joint loss over all occurrences:
Given a static context c, standard GCG finds an adversarial string a by minimizing the negative log-likelihood of a target response y:
In agentic settings, the monitor sees a repeated across K sites in the trajectory — reasoning steps, tool arguments, execution logs. A string optimized for one site may fail in another context. Parallel-GCG addresses this by constructing a composite trace that interleaves a between fixed context slices {c0, …, cK}:
We then minimize the loss over this full trace rather than a single insertion. By the chain rule, the gradient accumulates contributions from every site where a appears:
This means each token update to a is informed by all contexts simultaneously, producing a string that reliably deceives the monitor regardless of where it appears in the trajectory.
The coordinate gradient at each token position aggregates contributions from all K insertion sites, ensuring the optimized string remains effective regardless of its local context in the trajectory.
For multi-layer defenses like LlamaFirewall (PromptGuard 2 + AlignmentCheck), we employ multi-objective GCG, simultaneously minimizing loss against both the classifier and the LLM-based monitor.
We evaluate on the AgentDojo benchmark (v1.2.1), using the Slack suite (89 samples after filtering agent-level failures). We test three monitoring protocols — AlignmentCheck, PromptGuard 2 + AlignmentCheck, and Extract-and-Evaluate — across multiple open-source monitoring LLMs.
| Monitor | Model | Baseline | ASR@1 | ASR@5 | |
|---|---|---|---|---|---|
| AlignmentCheck | |||||
Llama-3.1-8B |
0.0 | 42.3 | 42.3 | ||
Mistral-7B |
14.4 | 90.2 | 93.3 | ||
Qwen2.5-7B |
2.1 | 62.5 | 90.7 | ||
| PromptGuard 2 + AlignmentCheck | |||||
Llama-3.1-8B |
1.0 | 44.3 | 76.3 | ||
Mistral-7B |
12.4 | 86.4 | 99.0 | ||
Qwen2.5-7B |
2.1 | 61.2 | 86.6 | ||
| Extract-and-Evaluate | |||||
Llama-3.1-8B |
0.0 | 11.2 | 54.7 | ||
Mistral-7B |
10.3 | 68.2 | 95.5 | ||
Qwen2.5-7B |
2.3 | 10.5 | 37.2 | ||
A counterintuitive finding: while hybrid monitoring (CoT + actions) is more robust than CoT-only monitoring against static baseline attacks, it is more vulnerable to our adaptive attacks. By monitoring tool calls, hybrid monitors are exposed to a larger surface of attacker-controlled content. This is a structural weakness, not a failure of any particular model.
We demonstrate that GPT-4o mini (46% ASR@5) and Llama 3.1 70B (60% ASR@5) can both be used to successfully bypass a Qwen2.5-72B monitor — a model with comparable capabilities. The attack does not require the agent to be smarter or larger than the monitor; it only requires that the agent faithfully repeat the optimized string. Scaling up the monitoring model cannot resolve a structural architectural vulnerability.
We propose the Agent-as-a-Proxy attack framework, which utilizes the victim agent to deliver adversarial attack strings directly to the monitoring system — without the agent's awareness.
We introduce Parallel-GCG, a gradient-based optimization algorithm that ensures adversarial strings remain effective when repeated across agent reasoning and tool-use contexts.
We uncover a paradox in hybrid monitoring: the very feature that makes it stronger against static attacks — observing both CoT and tool calls — creates a larger attack surface for adaptive adversaries.
We provide evidence that model scale offers no immunity. Agents with comparable capabilities to the monitor can reliably deceive frontier-scale monitoring models, challenging the scalable oversight hypothesis in the agentic domain.
If you find this work useful, please cite: