This Week's Signal
- Your LLM API router may be stealing your credentials and rewriting your tool calls. A measurement study of 428 commodity routers found 9 actively injecting malicious code into agent tool-call payloads, 17 abusing researcher-owned AWS credentials, and 1 draining cryptocurrency. None of the four major agent frameworks tested verify response integrity, and rewriting overhead is indistinguishable from normal model jitter (RAP-2026-021).
- Skill documentation is the new attack surface, and it bypasses alignment where explicit instructions fail. Malicious logic embedded in code examples and configuration templates within skill documentation achieves 11.6% to 33.5% bypass rates across four agent frameworks, while explicit malicious instructions achieve 0% under strong defences. Four confirmed vulnerabilities disclosed (RAP-2026-022).
- Even benign, unmodified skills are exploitable through adversarial prompting. A closed-loop red-teaming framework achieves attack success rates of up to 0.26 on real-world production skills across 10 LLMs, with roughly 65% of successful exploits requiring three or more iterative refinement rounds (RAP-2026-023).
ACT NOW
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
Authors: Hanzhi Liu, Chaofan Shou, Hongbo Wen, Yanju Chen, Ryan Jingyang Fang, Yu Feng ArXiv: 2604.08407
Stream: S3 -- Supply Chain | RAXE ID: RAP-2026-021
Executive Takeaway
LLM API routers sit on a trust boundary the ecosystem currently treats as transparent transport. A measurement study of 428 commodity routers found 9 actively injecting malicious code into tool-call payloads, 17 abusing researcher-owned AWS credentials, and 1 draining cryptocurrency from a researcher-controlled wallet. None of the four major agent frameworks tested implement response-integrity verification, and rewriting overhead is indistinguishable from normal model jitter at 0.013 ms per request (§6.1).
Core Finding
The researchers formalise four attack classes against LLM API routers (§4, Table 1). AC-1, response-side payload injection, rewrites tool-call arguments in transit while preserving JSON schema validity, so "a single rewritten tool call is sufficient for arbitrary code execution on the client machine" (§4.1.1). AC-2, passive secret exfiltration, silently collects credentials from plaintext traffic without modifying responses. Two adaptive variants extend these: AC-1.a substitutes legitimate dependency names with typosquats in install commands, creating "a durable supply-chain foothold that persists beyond the current session" (§4.2.1), and AC-1.b gates injection on session-state predicates such as request-count thresholds or autonomous execution mode (§4.2.2, Table 4).
The measurement corpus comprises 28 paid routers purchased from Taobao, Xianyu, and Shopify storefronts, alongside 400 free routers from public communities (§5.1). Among paid routers, 3.57% inject malicious code. Among free routers, 2% inject code, 4.25% abuse canary credentials, and 2 deploy adaptive evasion triggers including a warm-up gate that activates only after 50 prior calls (§5.2).
Technical Mechanism
The fundamental vulnerability is architectural: routers terminate client-side TLS and re-originate a separate connection upstream, creating an application-layer man-in-the-middle position by design. "No deployed mechanism binds the provider-origin tool-call response to what the client finally receives" (§3). The weakest-link property means a single malicious router anywhere in a multi-hop chain invalidates all integrity guarantees (§4).
Poisoning studies quantify the blast radius. A single leaked OpenAI key generated 100M GPT-5.4 tokens (§5.3). Twenty weak router decoys attracted roughly 2B tokens, 13 GB of visible prompt and response data, and 99 credentials across 440 Codex sessions. Of those sessions, 401 (91.1%) were already running in autonomous YOLO mode, meaning "sophisticated adaptive evasion was unnecessary: basic payload injection would have been sufficient" (§5.3).
The researchers built Mine, a research proxy implementing all four attack classes, and tested it against OpenClaw, OpenCode, Codex, and Claude Code. Response-integrity checks: 0 out of 4 frameworks (§6.1, Table 5). AC-1 rewrite compatibility: 100%. Buffering overhead: 0.004 to 0.005 ms between final natural token and first rewritten token, "well below the 820 ms median upstream latency and indistinguishable from ordinary model jitter" (§6.1).
Defender Impact
Audit your API routing chain now. (RAXE assessment) Any organisation routing LLM traffic through third-party proxies, aggregators, or resellers should inventory its routing topology. The paper demonstrates that "paying for a router does not remove tool-call injection risk" (§5.4). Free community routers carry higher risk (4.25% credential abuse) but paid routers are not immune (3.57% code injection).
Deploy a policy gate as a first filter, not a final defence. The fail-closed policy gate blocks 100% of AC-1 and AC-1.a payloads with 1.0% false positives and 2.3 microsecond decision cost (§7.1, Table 6). However, it is "bypassed in 100% of samples" under adaptive attack when the attacker uses allowlisted domains or local stagers (§7.1). Anomaly screening catches 89.0% of AC-1 at 6.7% false positive rate but drops to 50.0% for AC-1.a and 50.9% for AC-1.b (§7.2). Append-only transparency logging at 1.26 KB per entry supports forensic analysis but cannot prevent attacks (§7.3).
The long-term fix requires provider cooperation. The researchers propose "provider-signed canonical response envelopes, similar in spirit to DKIM for email" (§8.2), but note that "none of the major provider tool-use APIs or the current MCP specification expose a deployed response-signing mechanism for tool-call arguments today" (§8.2).
Limitations and Open Questions
The measurement targets the most active publicly reachable commodity router markets; enterprise and invite-only deployments are not studied (§5.5). The Chinese marketplace focus (Taobao, Xianyu) may not generalise to Western enterprise procurement channels (RAXE assessment). The proposed provider-signed envelope requires provider cooperation, and adoption timeline and incentive alignment are not analysed (RAXE assessment). Whether the AC-1.b conditional delivery triggers observed (warm-up gates, YOLO mode gates) are representative of broader router behaviour is unknown.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 5 | 5 |
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
Authors: Yubin Qu, Yi Liu, Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang, Lei Ma ArXiv: 2604.03081
Stream: S3 -- Supply Chain / S2 -- Agent Security | RAXE ID: RAP-2026-022
Executive Takeaway
Malicious logic embedded in skill documentation, not in skill code, bypasses safety alignment at rates of 11.6% to 33.5% across four agent frameworks and five models, while explicit malicious instructions achieve 0% execution under strong defences. The researchers generated 1,070 adversarial skills covering 15 MITRE ATT&CK categories and disclosed four confirmed vulnerabilities in Codex, Gemini CLI, Claude Code, and OpenHands, with two fixes deployed (§5.4).
Core Finding
The paper introduces Document-Driven Implicit Payload Execution (DDIPE), which "embeds malicious logic in code examples and configuration templates within skill documentation" (Abstract). Unlike prior skill-injection attacks that plant explicit instructions, DDIPE exploits how agents reuse documentation examples during normal tasks. Two embedding strategies are defined (§4.2): Strategy A plants payloads in Python or Bash code blocks within Markdown documentation, and Strategy B plants them in YAML, JSON, or Makefile configuration templates. Three camouflage techniques disguise the payloads: functional compliance disguise that "packages data theft as environment telemetry or security auditing," silent exception suppression that "wraps malicious calls in try/except: pass blocks," and false authority suggestion that "adds compliance certification language and names C2 endpoints to resemble internal services" (§4.2).
The adversarial skill pipeline expands 81 expert-crafted seeds to 1,070 samples across two LLM-driven generation stages, a 13.2 times expansion with 99.1% format validation pass rate (§5.2). The 1,070 skills span 15 MITRE ATT&CK categories, with supply-chain poisoning dominating at 47.5% (Table 1).
Technical Mechanism
Across eight test configurations (4 frameworks, varying models), DDIPE achieves bypass rates from 11.6% (Codex with GPT-5.4) to 33.5% (OpenHands with GLM-4.7), measured as the fraction of skills that generate or execute malicious code (§5.3, Table 3). Direct execution rates, where the agent actually runs the payload on the host, range from 2.3% (Claude Code with Sonnet 4.6) to 27.1% (OpenHands with GLM-4.7).
The interaction between model alignment and framework architecture is the paper's most operationally significant finding. GLM-4.7's direct execution rate amplifies 11.3 times when moving from Claude Code (2.4%) to OpenHands (27.1%), revealing 219 "sleeper payloads" that the architectural defences of Claude Code suppress but that surface on less defended frameworks (§5.3). A single 479-byte pip configuration file achieved universal breach across all three models under Claude Code (§5.3). Static analysis catches 90.7% of adversarial samples, but 27 payloads (2.5%) evade both static detection and model alignment (§6).
Defender Impact
Treat skill documentation as untrusted input. (RAXE assessment) The explicit instruction baseline achieves 0% execution under Claude Code with Sonnet 4.6, but DDIPE achieves 2.3% because agents treat documentation examples as legitimate reference material. Organisations deploying agent skill ecosystems should apply the same scrutiny to documentation code blocks and configuration templates as they apply to executable code.
Configuration file integrity is the highest-priority control. The universal breach sample is a 479-byte pip.conf file that redirects package installations to an attacker-controlled registry (§5.3). Defenders should implement allowlists for package index URLs and monitor for modifications to pip.conf, .npmrc, and similar configuration files in agent workspaces.
Multi-model ensemble reduces residual risk. Cross-model joint bypass drops to 1.6% (17 of 1,070 samples), compared to 13% to 20% for single models (§5.3).
Limitations and Open Questions
The generator model (Claude Opus 4.6) may cause same-family overfitting, though cross-family transfer is confirmed (§7). High error rates for OpenHands with MiniMax-M2.5 (64.5%) and Codex with GPT-5.4 (61.0%) mean bypass and execution rates are conservative lower bounds (§7). Only SkillScan is evaluated for static detection; dynamic sandboxing and LLM-based auditing remain untested (§7). The paper does not measure the current prevalence of malicious skills on SkillsMP's 631,813-skill marketplace (RAXE assessment).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 4 | 4 | 4 |
WATCH
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement
Authors: Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang, Jingcheng Deng, Zihao Wei, Shicheng Xu, Yuyao Ge, Xueqi Cheng ArXiv: 2604.04989
Stream: S2 -- Agent Security | RAXE ID: RAP-2026-023
Executive Takeaway
Even well-intended, unmodified agent skills harbour latent vulnerabilities that an attacker can exploit through adversarial prompting alone, without modifying the skill itself. SkillAttack, a closed-loop red-teaming framework, achieves attack success rates of 0.73 to 0.93 on adversarial skills and up to 0.26 on real-world skills across 10 LLMs, outperforming all baselines by a wide margin (§4.2, Table 1). Roughly 65% of successful exploits require three or more iterative refinement rounds, demonstrating that single-shot security audits miss the majority of exploitable vulnerabilities (§4.2, Table 2).
Core Finding
The paper demonstrates that the threat model for agent skill exploitation extends beyond malicious skill creation. "Non-malicious skills may also harbor latent vulnerabilities that an attacker can exploit solely through adversarial prompting, without modifying the skill itself" (Abstract). The attacker crafts user prompts that steer the agent towards sensitive operations exposed by the skill's interface without altering any skill code, system prompt, or runtime environment (§3.1).
Across 10 LLMs tested on 30 obvious-injection skills, 41 contextual-injection skills, and 100 real-world skills from ClawHub, SkillAttack achieves ASR of 0.87 (gpt-5.4, obvious), 0.93 (kimi-k2.5, obvious), and 0.26 (glm-5, real-world Hot100) (Table 1). The Direct Attack baseline peaks at 0.13 on obvious skills; Skill-Inject peaks at 0.43 (Table 1).
Technical Mechanism
SkillAttack operates as a three-stage pipeline (§3). Stage 1, vulnerability analysis, uses an agent-as-judge framework to extract vulnerability metadata from each skill: type, attacker-controllable inputs, sensitive operations, and triggering conditions (§3.2). Stage 2, surface-parallel attack generation, constructs multiple attack paths in parallel across all identified vulnerabilities (§3.3). Stage 3, feedback-driven exploit refinement, collects the execution trajectory, artefacts, and response from each attempt, evaluates success, and on failure refines the attack path based on feedback (§3.4).
Round 1 succeeds in only 12.8% of cases; round 3 is the most frequent breakthrough point at 36.0%, and rounds 3 and 4 combined account for roughly 65% of all successful exploits (§4.2, Table 2). A case study on a LinkedIn job-posting skill illustrates the pattern: the agent resists tool engagement for two rounds, but in round 3 "the agent finally attempted a read call on the scripts directory, surfacing the credential string and API endpoint" (§4.3). The threat profile for real-world skills "pivots toward operational threats: Data Exfiltration and Malware/Ransomware together exceed 70%" of successful attacks (§4.2, Figure 3).
Defender Impact
Static skill auditing is necessary but not sufficient. (RAXE assessment) SkillAttack demonstrates that a skill can pass static review and still be exploitable through adversarial interaction patterns. Organisations relying solely on pre-installation skill vetting should complement it with runtime monitoring of tool-call patterns and execution trajectories.
The 0.09 to 0.26 real-world ASR is the operationally relevant figure. While adversarial skill ASR is high (0.73 to 0.93), defenders should focus on the Hot100 real-world results. An ASR of 0.26 on glm-5 means roughly one in four real-world skills can be exploited through prompting under this attacker model. claude-sonnet-4-5 achieves 0.10, the lowest real-world ASR (Table 1).
Limitations and Open Questions
The framework uses a single judge model (Gemini 3.0 Pro Preview); multiple judges or human annotation would strengthen reliability (§6). Only prompt-level attacks are considered; multi-agent collusion and environment-level interventions are not addressed (§6). The 171 skills evaluated represent a fraction of real-world ecosystems (§6). Whether the 5-round attack budget reflects realistic attacker investment per skill is unclear (RAXE assessment). No defences are proposed (§6).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 3 | 4 | 4 |
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits
Authors: Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu ArXiv: 2604.01473
Stream: S4 -- Prompt Injection (defensive) | RAXE ID: RAP-2026-024
Executive Takeaway
SelfGrader reformulates jailbreak detection as a numerical grading problem over token-level logits, achieving 1.32% average attack success rate across six standard attacks on LLaMA-3-8B while using 173 times less memory and 26 times less latency than gradient-based alternatives (§4.2). The dual-perspective scoring rule, which evaluates both maliciousness and benignness of a query, reports 1.92% average false positive rate on benign LLaMA-3-8B benchmarks, including 0.00% on GSM8K and HumanEval, 1.36% on AlpacaEval, and 6.30% on OR-Bench (§4.2, Figure 1; Appendix, Evaluations on Benign Prompts).
Core Finding
Existing guardrail methods either inspect internal model features (introducing substantial latency and memory overhead) or analyse text responses (suffering from keyword-matching bias that paraphrasing-based attacks readily bypass). SelfGrader addresses both limitations by extracting logits over a compact set of numerical tokens (digits 0 through 9) and interpreting their distribution as an internal safety signal (§3.3).
The key innovation is dual-perspective logit (DPL) scoring. Two guardrail prompts evaluate the same query: one assesses maliciousness and the other assesses benignness. The ablation study demonstrates that this dual perspective is essential: removing the benignness assessment reduces attack success rate to 0.05% but inflates the false positive rate from 1.91% to 29.26%, making the guardrail unusable in production (§4.3, Table 4).
Across six standard attacks on LLaMA-3-8B, SelfGrader achieves 1.32% average ASR and 12.95% pass-guardrail rate (§4.2, Table 1). Against three adaptive attacks (TAP, LLM-Fuzzer, X-Teaming), the average ASR is 2.67%, compared to 43.33% for Llama Guard Pre and 52.33% for WildGuard (§4.2, Table 2).
Technical Mechanism
SelfGrader operates in three steps (§3.3). First, the guardrail extracts logits over numerical tokens from the target LLM's vocabulary for both a maliciousness prompt and a benignness prompt. "LLMs possess basic numerical reasoning abilities, such as recognizing that 9 is greater than 0" (§3.3). Second, temperature-normalised softmax produces perspective-specific scores, which are combined via the DPL formula with coefficient lambda equal to 0.5 and top-k tail trimming at 0.2Q. Third, a binary threshold produces the guardrail decision.
The method resists obfuscation attacks by design. "Emoji insertion does not cause degradation and even makes unsafe generations slightly easier to detect" (§4.2, Table 3). Against IJP with emoji insertion, SelfGrader achieves 0.2% ASR versus Llama Guard at 6.4% (Table 3). This resistance arises because token-level logits operate in a closed numerical space rather than depending on generated text that can be paraphrased or obfuscated.
Defender Impact
A practical guardrail option for resource-constrained deployments. SelfGrader requires 377.61 MB of memory and 0.78 seconds latency, compared to Token Highlighter at 65,536.66 MB and 15.20 seconds (§4.2). Organisations that need jailbreak detection on smaller hardware or with tighter latency budgets now have a viable alternative to generation-based methods.
Complements rather than replaces existing defences. (RAXE assessment) SelfGrader excels against obfuscation attacks that defeat keyword-matching guardrails but does not claim universal coverage. Layering SelfGrader as a lightweight pre-filter before a heavier guardrail could improve overall defence efficiency.
Limitations and Open Questions
Evaluation covers four LLMs at the 7B to 13B scale; performance on larger or frontier models is unknown (RAXE assessment). White-box gradient-based adaptive attacks specifically targeting the logit distribution are not evaluated (RAXE assessment). Cross-lingual robustness is not tested (RAXE assessment). The 0.78-second per-query overhead may not be acceptable for high-throughput API endpoints (RAXE assessment). DPL requires two inference passes per query (§5).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 3 | 3 | 4 |
Stream Coverage
| Stream | Papers This Week | Running Total (April) | Coverage |
|---|---|---|---|
| S1: Adversarial ML | 0 | 0 (1 cross-listed from Issue 004) | Gap |
| S2: Agent Security | 1 (SkillAttack) + 1 cross-listed (DDIPE) | 4 + 1 cross-listed | Surplus |
| S3: Supply Chain | 2 (Your Agent Is Mine, DDIPE) | 2 | Restored |
| S4: Prompt Injection | 1 (SelfGrader) | 2 | Partial |
| Total | 4 | 8 |
Coverage Notes
S3 receives its first standalone coverage in the Radar series, with two papers addressing distinct layers of the agent supply chain: API router intermediaries (RAP-2026-021) and skill documentation poisoning (RAP-2026-022). This restores the stream balance gap identified in Issues 003 and 004.
S2 receives one additional paper (SkillAttack), but the inclusion is justified by a genuinely distinct threat model: exploitation of unmodified benign skills through adversarial prompting, rather than creation or modification of malicious skills. S4 receives its first defensive paper, providing the issue with a constructive counterweight to the three offensive papers.
S1 remains unrepresented as a standalone paper. The April 2026 scan surfaced two S1 candidates (Safety, Security, and Cognitive Risks in World Models, 2604.01346; Spike-PTSD, 2604.01750) but both fell below the 3.0 threshold due to limited practitioner actionability and narrow evaluation scope. Issue 006 should actively prioritise S1 papers.
Cross-References to RAXE Advisories
| Paper | Related Finding | Connection |
|---|---|---|
| RAP-2026-021: Your Agent Is Mine | RAXE-2026-053 (LiteLLM JWT auth cache collision) | Same ecosystem (LiteLLM), different attack layer. RAXE-2026-053 targets application-level authentication; RAP-2026-021 targets transport-level tool-call integrity. |
| RAP-2026-021: Your Agent Is Mine | RAP-2026-017 (Connor/MCP detection, Issue 004) | Both address trust boundaries between agents and external services. §8.3 explicitly notes MCP servers face the same plaintext-access vulnerability. |
| RAP-2026-022: DDIPE | RAP-2026-018 (ClawSafety, Issue 004) | ClawSafety's skill injection vector is the explicit-instruction version of DDIPE. DDIPE demonstrates that implicit documentation poisoning succeeds where explicit instructions fail (0% baseline versus 2.3%+ DDIPE). |
| RAP-2026-022: DDIPE | RAP-2026-017 (Connor, Issue 004) | Connor detects malicious MCP servers; DDIPE poisons skill documentation. Both exploit agent trust in externally sourced content. |
| RAP-2026-023: SkillAttack | RAP-2026-022 (DDIPE, this issue) | DDIPE creates malicious skills; SkillAttack exploits unmodified ones. Together they demonstrate that neither creation nor vetting of skills is sufficient. |
| RAP-2026-024: SelfGrader | RAP-2026-019 (CRaFT, Issue 004) | CRaFT attacks refusal mechanisms from inside the model via interpretability tools; SelfGrader detects attacks from outside using logit distributions, a complementary defensive signal. |
Selection Criteria
Selection criteria this week: - Two S3 papers are included because each addresses a fundamentally different layer of the agent supply chain: transport-level router intermediaries (RAP-2026-021) and content-level skill documentation (RAP-2026-022). Together with the S2 paper on benign skill exploitation (RAP-2026-023), they map the full agent supply-chain attack surface from infrastructure through content to interaction. - RAP-2026-021 (Your Agent Is Mine) is the headliner as the first measurement study to quantify malicious router prevalence in the wild, with confirmed credential theft, code injection, and cryptocurrency drainage. - RAP-2026-023 (SkillAttack) provides the distinct S2 perspective that even vetted, benign skills remain exploitable through adversarial interaction patterns. - RAP-2026-024 (SelfGrader) provides the defensive counterweight, offering a practical, low-overhead guardrail approach using a novel formulation.
Deferred: - Combating Data Laundering in LLM Training (2604.01904) -- S3 candidate from Issue 004; abstract-only read completed; low practitioner actionability; superseded by two stronger S3 papers - Safety, Security, and Cognitive Risks in World Models (2604.01346) -- S1 candidate; single author, limited empirical evaluation, below threshold - Spike-PTSD: Adversarial Attack on Spiking Neural Networks (2604.01750) -- S1 candidate; niche SNN domain, low mainstream practitioner actionability
Methodology
This issue of the RAXE Research Radar covers AI security papers selected for the 2026-04-05 to 2026-04-13 research cycle. RAP-2026-021 (2026-04-09) and RAP-2026-023 (2026-04-05) are current-window arXiv papers; RAP-2026-022 (2026-04-03) and RAP-2026-024 (2026-04-01) were carried forward from earlier April scan results after full-read validation. Papers are selected based on four dimensions (Threat Realism, Defensive Urgency, Novelty, Research Maturity) with a minimum average score of 3.0. Each paper is read in full, with structured claim extractions documented in reading notes before summaries are written. Summaries are reviewed against reading notes to ensure factual traceability. Analytical claims beyond paper evidence are labelled "(RAXE assessment)".
Anti-hallucination protocol: All four papers were fetched from arXiv HTML and read in full. Reading notes with verbatim quotes and section references were completed for all four papers before summary drafting. Summaries were written from reading notes only, not from abstracts or memory. All arXiv IDs were verified against the arXiv API before inclusion.
Relevance badges: - act_now -- immediate practical impact; evaluate this week - watch -- emerging technique; track development - horizon -- early-stage research; awareness only
The Research Radar is an independent publication from RAXE Labs' vulnerability advisory service. It does not constitute vulnerability disclosure or actionable threat intelligence. Papers are summarised for practitioner awareness; readers should consult the original publications for complete technical details.
RAXE Labs Research Radar Issue #5 -- Published 2026-04-13