Preprint · February 2026

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Jafar Isbarov  ·  Murat Kantarcioglu

Department of Computer Science, Virginia Tech

arXiv Code Cite
TL;DR We show that monitoring-based defenses for AI agents — including state-of-the-art hybrid monitors — can be systematically bypassed by weaponizing the agent itself as a delivery mechanism for adversarial strings that fool the monitor. Model scale provides no protection.

01 Introduction

AI agents are increasingly trusted with complex, multi-step workflows: managing financial transactions, interacting with third-party APIs, and executing code autonomously. This autonomy, however, exposes a fundamental security gap: because agents process untrusted external content — emails, web pages, RAG results — they are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions embedded in the environment hijack the agent's behavior.

The security community has responded with monitoring-based defenses. The prevailing approach, exemplified by systems like LlamaFirewall, uses a secondary trusted model to jointly scrutinize the agent's Chain-of-Thought (CoT), tool calls, and tool outputs. The assumption: a comprehensive view of the agent's internal reasoning provides a strong defense.

We challenge this assumption. By forcing the monitor to process the agent's own outputs, defenders inadvertently create a direct channel through which attackers can manipulate the monitor. We show that this is not a theoretical concern — it is practically exploitable at scale.

02 Agent-as-a-Proxy Framework

A standard IPI attack tries to mislead the agent. Our approach goes further: the adversarial string is designed not only to hijack the agent's behavior, but to manipulate the monitor that observes the agent. By coercing the agent to repeat an optimized adversarial string within its chain-of-thought reasoning and tool call parameters, we force the monitor to classify a malicious trajectory as safe.

Crucially, the agent need not "know" it is participating in an attack. It simply repeats the injected string — a natural consequence of tasks like appending text to a message or including a string in a transaction field. The monitor, seeing the agent faithfully echo this string across multiple contexts, is manipulated at the token level.

Attacker
embeds IPI
Agent
reads + repeats
Monitor
sees adversarial string
SAFE ✓
bypassed

The agent acts as an unwitting proxy, delivering adversarial strings directly into the monitor's context.

03 Parallel-GCG

Standard GCG optimizes an adversarial string for a single insertion point. In agentic contexts, the monitor observes the adversarial string repeated across K distinct temporal contexts — a reasoning step, a function argument, an execution log. A string optimized in isolation may fail when re-contextualized.

Parallel-GCG addresses this by optimizing a single adversarial string jointly across all K insertion sites. Formally, given a fixed set of context slices {c0, ..., cK} extracted from a reference agent trace, we construct a composite sequence and minimize a joint loss over all occurrences:

Given a static context c, standard GCG finds an adversarial string a by minimizing the negative log-likelihood of a target response y:

ℒ(c ⊕ a; y) = −∑i log P(yi | c, a, y<i)

In agentic settings, the monitor sees a repeated across K sites in the trajectory — reasoning steps, tool arguments, execution logs. A string optimized for one site may fail in another context. Parallel-GCG addresses this by constructing a composite trace that interleaves a between fixed context slices {c0, …, cK}:

S(a) = c0a ⊕ c1a ⊕ · · · ⊕ cK−1a ⊕ cK

We then minimize the loss over this full trace rather than a single insertion. By the chain rule, the gradient accumulates contributions from every site where a appears:

E ℒ(S(a); y) = ∑k=1K ∂ℒ / ∂E(k)

This means each token update to a is informed by all contexts simultaneously, producing a string that reliably deceives the monitor regardless of where it appears in the trajectory.

The coordinate gradient at each token position aggregates contributions from all K insertion sites, ensuring the optimized string remains effective regardless of its local context in the trajectory.

For multi-layer defenses like LlamaFirewall (PromptGuard 2 + AlignmentCheck), we employ multi-objective GCG, simultaneously minimizing loss against both the classifier and the LLM-based monitor.

Agent trace
c₀ a c₁ a c₂ a c₃ ↓ site 1 ↓ site 2 ↓ site 3
Σ gradients → update a one string, robust across all contexts

03 Results

We evaluate on the AgentDojo benchmark (v1.2.1), using the Slack suite (89 samples after filtering agent-level failures). We test three monitoring protocols — AlignmentCheck, PromptGuard 2 + AlignmentCheck, and Extract-and-Evaluate — across multiple open-source monitoring LLMs.

>90%
Attack success rate against AlignmentCheck
Mistral-7B & Qwen2.5-7B monitors
99%
ASR@5 against PromptGuard 2 + AlignmentCheck
Mistral-7B monitor
60%
Llama 3.1 70B bypassing Qwen2.5-72B monitor
No capability gap required
Monitor Model Baseline ASR@1 ASR@5
AlignmentCheck
Llama-3.1-8B 0.042.3 42.3
Mistral-7B 14.490.2 93.3
Qwen2.5-7B 2.162.5 90.7
PromptGuard 2 + AlignmentCheck
Llama-3.1-8B 1.044.3 76.3
Mistral-7B 12.486.4 99.0
Qwen2.5-7B 2.161.2 86.6
Extract-and-Evaluate
Llama-3.1-8B 0.011.2 54.7
Mistral-7B 10.368.2 95.5
Qwen2.5-7B 2.310.5 37.2

The Hybrid Monitoring Paradox

A counterintuitive finding: while hybrid monitoring (CoT + actions) is more robust than CoT-only monitoring against static baseline attacks, it is more vulnerable to our adaptive attacks. By monitoring tool calls, hybrid monitors are exposed to a larger surface of attacker-controlled content. This is a structural weakness, not a failure of any particular model.

Model Scale Provides No Immunity

We demonstrate that GPT-4o mini (46% ASR@5) and Llama 3.1 70B (60% ASR@5) can both be used to successfully bypass a Qwen2.5-72B monitor — a model with comparable capabilities. The attack does not require the agent to be smarter or larger than the monitor; it only requires that the agent faithfully repeat the optimized string. Scaling up the monitoring model cannot resolve a structural architectural vulnerability.

04 Contributions

05 Citation

If you find this work useful, please cite:

@article{isbarov2026agentproxy, title = {Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks}, author = {Isbarov, Jafar and Kantarcioglu, Murat}, journal = {arXiv preprint arXiv:2602.05066}, year = {2026} }