This Week's Signal
- Your agents are contaminating themselves, and no attacker is required. Multi-user LLM agents that share state across sessions silently corrupt one user's outputs with another's context at rates of 57-71% from benign interactions alone, and write-time sanitisation fails to prevent contamination in executable artefacts such as code and tool configurations (RAP-2026-016).
- MCP server detection is now possible, but the attack surface is worse than expected. The first systematic dataset of 114 malicious MCP servers reveals that direct code injection achieves 100% attack success universally, while a new two-stage detection tool (Connor) achieves F1 94.6% and has already found two malicious servers in the wild (RAP-2026-017).
- The gap between "safe model" and "safe agent" is quantified: 40-75% of attacks succeed. Five frontier LLMs that routinely refuse harmful requests in chat permit indirect prompt injection at rates of 40% to 75% when deployed as personal agents, with workspace instruction files (analogous to CLAUDE.md or .cursorrules) achieving 69.4% average attack success (RAP-2026-018).
ACT NOW
No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
Authors: Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, Yue Zhao ArXiv: 2604.01350
Stream: S2 -- Agent Security | RAXE ID: RAP-2026-016
Executive Takeaway
Multi-user LLM agents that share state across sessions contaminate one user's outputs with another's context at rates of 57-71%, without any attacker. Write-time text sanitisation reduces contamination in conversational systems but leaves "substantial residual risk" when shared state includes executable artefacts such as code or tool configurations, with failures manifesting as silent wrong answers that evade detection.
Core Finding
The researchers formalise unintentional cross-user contamination (UCC) as a new failure class distinct from adversarial memory poisoning. UCC occurs when "information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope" (Abstract). The key distinction: "UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied" (Abstract).
The paper introduces a taxonomy of three contamination types evaluated across two shared-state agent platforms: EHRAgent (clinical database queries on MIMIC-III and eICU) and MURMUR (Slack workspace collaboration). Semantic contamination, where the agent inherits a user-specific interpretation of an ambiguous term, reaches 59-89% across datasets (§5, Table 2). Transformation contamination (inherited data transformation rules) reaches 21-69%. Procedural contamination (inherited workflow strategies) reaches 44-67%.
Technical Mechanism
The attack requires no adversary. A single agent instance serves multiple users through a shared knowledge layer. When User A's interaction produces scope-bound artefacts (a date interpretation, a rounding convention, a counting methodology), these persist in shared state. When User B later queries the agent, retrieved context includes User A's artefacts, which the agent applies without scope verification.
The write-time sanitisation defence (SSI) "interposes on the write path" to rewrite persisted interactions, "retaining reusable content while removing scope-bound artifacts" (§4). SSI nearly eliminates contamination on conversational platforms (Slack: 57% to 6%), but when shared state includes executable artefacts (solution code, SQL templates, tool configurations), "the convention survives in the solution code and propagates through exemplar reuse" (§5.5). Procedural contamination is particularly resistant because it "shapes the entire solution structure" rather than localising to any single function call (§5.5).
Defender Impact
Audit shared-state architecture now. (RAXE assessment) Any deployment where a single agent instance serves multiple users with shared memory should be reviewed. The question is whether user state is scoped per-user or shared across identities. Shared coding assistants, shared customer service agents, and shared research tools are all structurally exposed.
Do not rely on text-level sanitisation alone. SSI-style defences work for conversational content but fail for executable artefacts. Organisations should evaluate provenance tagging, access-controlled storage for non-textual state, and "generating fresh solutions from sanitised specifications rather than reusing compiled code" (§5.7). The finding that contamination produces "silent wrong answers" (§5.6) means standard error monitoring will not catch these failures.
Limitations and Open Questions
The evaluation uses GPT-4o as the sole backbone; UCC susceptibility may vary across model families (RAXE assessment). The 34 manually designed source conventions may not cover all contamination patterns in production deployments. Whether artefact-level defences (provenance tagging, per-user scoping) can reduce residual procedural contamination risk is left as future work. The interaction between UCC and retrieval-augmented generation architectures, where shared knowledge lives in a vector database, is not studied.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 5 | 4 |
From Component Manipulation to System Compromise: Understanding and Detecting Malicious MCP Servers
Authors: Yiheng Huang, Zhijia Zhao, Bihuan Chen, Susheng Wu, Zhuotong Zhou, Yiheng Cao, Xin Hu, Xin Peng ArXiv: 2604.01905
Stream: S2 -- Agent Security | RAXE ID: RAP-2026-017
Executive Takeaway
Researchers built the first systematic dataset of 114 malicious MCP servers covering 19 influence paths and 6 attack goals, then developed Connor, a two-stage detection system achieving F1 94.6%, outperforming existing approaches by 8.9-59.6%. Direct code injection in MCP server configurations achieves 100% attack success universally, while Claude Desktop shows substantially stronger defences against prompt injection than Cursor.
Core Finding
The paper introduces a component-centric attack framework for MCP servers, identifying seven attackable components: tool descriptions (TD), argument schemas (AS), tool source code (TSC), tool responses (TR), resource implementation code (RSC), resource responses (RR), and server configuration (CONFIG). Each component offers a distinct injection surface with different mediation paths through the LLM (§4).
From these components, the researchers derive 19 influence paths mapped to 6 attack goals (data leakage, reverse shell, download-and-execute, ransomware, sabotaging, backdoor), producing 114 proof-of-concept malicious servers, "the first component-centric PoC dataset" (Abstract). Direct code injection achieves "100% ASR across all host-LLM configurations" (§4.2, RQ1). For prompt injection attacks, "pre-execution context is more vulnerable...than post-execution artifacts" (§4.2, RQ1), and multi-component attacks that split "malicious logic across different components" achieve higher success rates than single-component attacks.
Technical Mechanism
The attack taxonomy characterises each influence path by four properties: mediation type (whether the attack flows through the LLM or runs directly), execution stage, code sink, and data carrier (§4.1). Each combination describes a distinct attack route from a manipulated MCP component to the final malicious action. After canonicalisation and deduplication, 26 feasible paths reduce to 16 semantically independent paths.
Connor detects malicious servers through a two-stage pipeline. The pre-execution stage comprises a Config Analyser that parses shell commands against 425 risky token patterns from GTFOBins and an Intent Inspector that distinguishes legitimate function descriptions from injected adversarial content (§5.1). The in-execution stage generates diverse queries aligned with each tool's stated intent, collects structured execution traces, and uses behavioural deviation analysis to identify trajectory-level anomalies, detecting deviations "at each interaction step rather than waiting for the complete execution" (§5.2). In real-world scanning of 1,672 MCP servers, Connor identified two previously unknown malicious servers (Abstract).
Defender Impact
MCP server vetting is now a measurable problem. (RAXE assessment) Connor demonstrates that automated detection of malicious MCP servers is feasible with high accuracy. Organisations deploying MCP integrations should implement pre-installation scanning, particularly for shell commands in server configurations, which are the highest-risk vector (100% ASR for direct code injection).
Host selection matters. Claude Desktop achieves "0% ASR for prompt injections targeting builtin operations" while Cursor ranges from 3.3% to 66.7% (§4.2, RQ2). Chain length provides partial defence: 3-stage attacks drop ASR by approximately 10.7 percentage points compared to 2-stage attacks (§4.2, RQ3, Table 6). This connects directly to the MCP threat model established in RAP-2026-013/014 (Issue 003). Connor provides the defensive detection capability those papers identified as missing.
Limitations and Open Questions
The PoC dataset uses bounded constraints (no more than 2 tools, 2 LLM rounds) that may not capture all real-world attack complexity (§4.1). Resource interactions are unsupported by Claude Desktop, limiting evaluation generality. Whether Connor's behavioural deviation approach transfers to non-stdio MCP transports (SSE, HTTP) is not evaluated. The 425 risky token patterns require ongoing maintenance as new attack utilities emerge (RAXE assessment).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 4 | 4 |
ClawSafety: "Safe" LLMs, Unsafe Agents
Authors: Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge ArXiv: 2604.01438
Stream: S2 -- Agent Security | RAXE ID: RAP-2026-018
Executive Takeaway
Across 2,520 sandboxed trials on five frontier LLMs, personal AI agents with elevated privileges succumb to indirect prompt injection at rates of 40-75%, even when the underlying models routinely refuse harmful requests in chat. Workspace instruction files achieve 69.4% average attack success, and declarative framing ("does not match") bypasses all defences that imperative framing ("update X") reliably triggers. Only one model, Claude Sonnet 4.6, maintains hard boundaries against credential forwarding and destructive actions.
Core Finding
The paper introduces ClawSafety, a benchmark of 120 adversarial test scenarios spanning five professional domains (software engineering, finance, healthcare, legal, DevOps) and three injection channels (workspace skill files, emails from trusted senders, web pages). The central finding is that chat-level safety does not equal agent-level safety (§2.2). Attack success rates range from 40.0% (Sonnet 4.6) to 75.0% (GPT-5.1) on the OpenClaw scaffold, with a clear trust-level gradient: "SKILL injection consistently achieves the highest ASR, followed by email and then web content" (§4.4). This gradient is "steepest for Sonnet 4.6 (55.0/45.0/20.0)" and "flattest for GPT-5.1 (90.0/75.0/60.0), which is uniformly vulnerable regardless of injection channel" (§4.4).
The scaffold effect is significant: the same model (Sonnet 4.6) moves from 40.0% to 48.6% ASR across frameworks, "a shift of 8.6 percentage points from scaffold choice alone" (§4.4). On Nanobot, email injection overtakes skill injection, reversing the trust-level gradient observed on OpenClaw.
Technical Mechanism
The benchmark evaluates three injection vectors that exploit how agents process trusted workspace content. Skill injection embeds adversarial instructions in workspace instruction files (analogous to CLAUDE.md or .cursorrules) that the agent treats as "system-level operating procedures" (§B.1). Email injection plants attack content in messages from established colleagues. Web injection presents adversarial content in pages the agent browses during legitimate tasks.
The most significant mechanistic finding concerns defence boundaries: "the boundary is intent-sensitive but not content-sensitive: imperative phrasing triggers defenses regardless of presentation quality, while declarative phrasing bypasses all defenses regardless of content suspicion" (§4.6). Declarative framing succeeds because "reporting discrepancies is expected behavior during incident response; the most effective injections frame adversarial content as something to report, not something to execute" (§4.6).
Conversation length compounds the risk: ASR rises from 50.0% to 77.5% for Sonnet 4.6 between 10 and 64 turns, as "the agent internalises team norms, operational procedures, and colleague relationships" (§4.7). Removing named-colleague identities reduces exfiltration by 52.5 percentage points (§4.7, Table 4).
Defender Impact
Model selection is a security decision with quantified consequences. Sonnet 4.6 achieves 0% ASR on credential forwarding and destructive actions across all domains and vectors, "a hard boundary no other model maintains" (§4.5). GPT-5.1 permits both at 60-63%.
Design your workspaces defensively. (RAXE assessment) The 52.5-percentage-point reduction from removing named-colleague identities is the single largest controllable factor in the study. Shorter, more transactional agent sessions and role-based rather than personal identity designs meaningfully reduce attack surface. The skill injection vector (workspace instruction files treated as operating procedures) connects directly to the MCP tool description injection surface documented in RAP-2026-013/014: both exploit agents treating attacker-controlled configuration as system-level authority.
Limitations and Open Questions
The benchmark evaluates OpenClaw-style personal agents only; enterprise multi-agent systems (LangGraph, AutoGen, CrewAI) with different trust boundaries are not studied (RAXE assessment). The 64-turn default conversation length maximises attack success by design; real-world deployments span shorter, more transactional sessions that may exhibit lower ASR (RAXE assessment). No defensive countermeasures (input filtering, output monitoring, tool-call interception) are evaluated. The most effective attack payloads are not released "in a directly reusable form" (Ethics Statement).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 4 | 5 |
WATCH
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
Authors: Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han ArXiv: 2604.01604
Stream: S4 -- Prompt Injection / S1 -- Adversarial ML | RAXE ID: RAP-2026-019
Executive Takeaway
The researchers demonstrate that mechanistic interpretability tools, specifically cross-layer transcoders and attribution graphs, can be weaponised to surgically disable LLM safety mechanisms. CRaFT improves jailbreak attack success from 6.7% to 48.2% on Gemma-3-1B-it by identifying the single internal feature that causally mediates refusal, rather than the features that merely activate strongly on harmful prompts. The resulting harmful outputs are approximately twice as detailed and convincing as those produced by prior methods.
Core Finding
CRaFT challenges the assumption that safety-relevant features are the ones that activate most strongly when processing harmful prompts. The paper demonstrates that "activation magnitude alone does not indicate causal influence" (§2.2) on the model's refusal decision. Instead, CRaFT uses circuit analysis through cross-layer transcoders to trace how features influence next-token predictions at the refusal-compliance boundary.
The results on Gemma-3-1B-it are substantial. CRaFT achieves an average ASR of 48.2% across four benchmarks (JailBreak, HarmBench, AdvBench, SorryBench) using the LlamaGuard4 classifier, up from 6.7% without any attack (§6.1, Table 2). Critically, CRaFT also scores 2.50 on the StrongREJECT judge metric, approximately twice the scores of Refusal-SAE (1.37) and Refusal-Direction (0.60), indicating that its harmful completions are "more specific and convincing" rather than superficially compliant but vacuous (§6.1).
Technical Mechanism
CRaFT operates in three stages. First, boundary-critical sampling identifies prompts where "refusal and compliance tokens both receive high probability" (§4.1), using the top 100 prompts from WildJailbreak scored by the minimum of refusal and compliance probabilities. These boundary prompts expose the computational fork where the model's decision could go either way.
Second, the framework extracts attribution graphs from a pre-trained cross-layer transcoder and computes each feature's influence on next-token logits by propagating influence scores through the circuit graph using iterative matrix aggregation (§4.2). Features are ranked by this circuit influence score rather than activation magnitude. The critical difference: activation-based selection identifies a feature in layer 22 (§6.2, Table 3) that has no effect on refusal when steered (ASR remains 5.0%). Influence-based selection identifies a feature in layer 3 that causally mediates refusal and achieves 52.0% ASR when steered.
Third, CRaFT applies layer-scaled steering to the single top-ranked feature, with the steering strength increasing proportionally from early to late layers (§4.3). Steering only one feature avoids the conflicting effects observed when manipulating multiple features simultaneously.
Defender Impact
Interpretability is now a dual-use capability. (RAXE assessment) CRaFT demonstrates that the same tools developed to understand model safety mechanisms (sparse autoencoders, transcoders, circuit analysis) can identify precisely which features to manipulate for maximum attack effectiveness. As interpretability tooling matures and becomes available for more model families, this attack surface expands. Defenders should consider this when evaluating the safety margin of refusal-trained models.
Refusal mechanisms may be shallower than assumed. The finding that a single feature in layer 3 (of a model with many layers) causally controls refusal suggests that safety alignment may be "concentrated in a surprisingly small number of features" (RAXE assessment). This is consistent with the "refusal direction" hypothesis from prior work, but CRaFT demonstrates it with higher precision. Models with safety behaviour distributed across many features and layers would be more resistant to this attack.
Limitations and Open Questions
CRaFT "depends on availability of pre-trained sparse model such as CLT" (§8) and is currently limited to Gemma-3-1B-it because interpretability tooling (GemmaScope2) exists only for this model. Whether the single-feature refusal pattern generalises to larger models, different model families (Llama, GPT), or models with more distributed safety mechanisms is unknown. The white-box access requirement (model weights and transcoder) limits the attack to open-weight models. The boundary-critical sampling uses prompts from WildJailbreak, which may not cover all refusal categories.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 3 | 3 | 5 | 3 |
Stream Coverage
| Stream | Papers This Week | Running Total (April) | Coverage |
|---|---|---|---|
| S1: Adversarial ML | 0 (1 cross-listed) | 0 (1 cross-listed) | Gap |
| S2: Agent Security | 3 | 3 | Surplus |
| S3: Supply Chain | 0 | 0 | Gap |
| S4: Prompt Injection | 0 (1 cross-listed) | 0 (1 cross-listed) | Gap |
| Total | 4 | 4 |
Coverage Notes
S2 dominates for the second consecutive week (3/4 in Issue 003, 3/4 in Issue 004). This reflects a genuine concentration of high-quality research output in agent security during late March and early April 2026, not a scanning blind spot. CRaFT (RAP-2026-019) is cross-listed S4/S1, providing partial coverage of those streams.
S1 and S3 remain unrepresented as standalone papers. The scan identified one S3 candidate (Combating Data Laundering in LLM Training, 2604.01904) but after full read the paper's focus on membership inference evasion for data rights enforcement was judged as low practitioner actionability for this audience. S1 candidate Spike-PTSD (2604.01750, adversarial attacks on spiking neural networks) was deferred as too niche for mainstream AI security practitioners.
Issue 005 should actively prioritise S1 and S3 papers to restore balance.
Cross-References to RAXE Advisories
| Paper | Related Finding | Connection |
|---|---|---|
| RAP-2026-016: No Attacker Needed | RAP-2026-012 (HEARTBEAT, Issue 003) | Both study agent memory integrity failures. HEARTBEAT requires an external attacker placing misinformation; UCC requires no attacker at all. Together they describe the full attack surface from external pollution to internal scope confusion. |
| RAP-2026-017: Connor | RAP-2026-013/014 (MCP threat model, Issue 003) | Issue 003 documented the MCP attack surface; Connor provides the first systematic detection system for malicious MCP servers, closing the defensive gap those papers identified. |
| RAP-2026-018: ClawSafety | RAP-2026-012 (HEARTBEAT, Issue 003); RAP-2026-013/014 (MCP, Issue 003) | The skill injection vector is architecturally identical to MCP tool description injection. The multi-source corroboration finding (§B.3.2) mirrors HEARTBEAT's social consensus attack lever. |
| RAP-2026-019: CRaFT | RAP-2026-015 (LLM judges, Issue 003) | RAP-2026-015 studied detecting prompt attacks from outside the model; CRaFT attacks refusal mechanisms from inside using interpretability tools. |
Selection Criteria
Selection criteria this week:
- Three S2 papers are included because each addresses a fundamentally different failure mode: unintentional shared-state corruption without any attacker (RAP-2026-016), detection of deliberately malicious MCP infrastructure (RAP-2026-017), and the safety evaluation gap between model alignment and agent deployment (RAP-2026-018). The three papers together map the agent security attack surface from design flaws through supply chain malice to adversarial exploitation.
- RAP-2026-017 (Connor) is the defensive continuity play from Issue 003: last week documented the MCP attack surface, this week delivers the first detection system.
- RAP-2026-019 (CRaFT) provides the non-S2 perspective by attacking safety mechanisms from inside the model using mechanistic interpretability, a genuinely novel approach at the S4/S1 intersection.
Deferred:
- Combating Data Laundering in LLM Training (2604.01904) -- S3 candidate, solid membership inference research, but low practitioner actionability for this audience
- SelfGrader: Stable Jailbreak Detection via Token-Level Logits (2604.01473) -- S4 defensive paper, first reserve
- Spike-PTSD: Adversarial Attack on Spiking Neural Networks (2604.01750) -- S1 candidate, niche SNN domain
Methodology
This issue of the RAXE Research Radar covers AI security papers published on arXiv between 2026-03-29 and 2026-04-05. Papers are selected based on four dimensions (Threat Realism, Defensive Urgency, Novelty, Research Maturity) with a minimum average score of 3.0. Each paper is read in full, with structured claim extractions documented in reading notes before summaries are written. Summaries are reviewed against reading notes to ensure factual traceability. Analytical claims beyond paper evidence are labelled "(RAXE assessment)".
Anti-hallucination protocol: All four papers were fetched from arXiv HTML and read in full. Reading notes with verbatim quotes and section references were completed for all four papers before summary drafting. Summaries were written from reading notes only, not from abstracts or memory. All arXiv IDs were verified against the arXiv API before inclusion.
Relevance badges:
- act_now -- immediate practical impact; evaluate this week
- watch -- emerging technique; track development
- horizon -- early-stage research; awareness only
The Research Radar is an independent publication from RAXE Labs' vulnerability advisory service. It does not constitute vulnerability disclosure or actionable threat intelligence. Papers are summarised for practitioner awareness; readers should consult the original publications for complete technical details.
RAXE Labs Research Radar Issue #4 -- Published 2026-04-05