This Week's Signal
- The agent skill supply chain is broken — and automated scanners cannot tell you how. Five scanners disagree by an order of magnitude on malicious skill rates, but real supply-chain hijacking via abandoned GitHub usernames is exploitable today (121 skills across 7 vulnerable repositories), and greybox fuzzing of agent frameworks renders prompt-level defences largely ineffective. In separate evaluations, tool whitelisting (RAP-2026-009) and privilege separation (RAP-2026-008) were the only controls that materially reduced attack success (RAP-2026-005).
- Single-source telemetry has structural limits that no detection tuning can overcome. The best single log source covers less than 40% of advanced supply-chain attack steps; complementary two-source pairing lifts reconstruction to ~64%. Model-layer attacks are invisible to conventional host and network telemetry altogether (RAP-2026-006).
- Mechanistic understanding of AI safety failures is catching up to the attacks. In VLMs, jailbreaks are now measurable as a distinct internal state — not a perception failure — enabling targeted inference-time defences (RAP-2026-010). Multimodal safety gaps tend to widen with capability upgrades (RAP-2026-011), and concept unlearning controls remain vulnerable to image-modality bypass (RAP-2026-007). Across agent frameworks, privilege separation (0% ASR on LLMail-Inject, RAP-2026-008) and tool filtering (17.4% ASR on AgentDojo, RAP-2026-009) outperform prompt-level defences by wide margins.
ACT NOW
Malicious Or Not: Adding Repository Context to Agent Skill Classification
Authors: Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich ArXiv: 2603.16572
Stream: S2 — Agent Security | RAXE ID: RAP-2026-005
Executive Takeaway
Current automated scanners flag as many as 41.93% of marketplace skills as malicious, but the paper's repository-aware re-scoring leaves only 15 of 2,887 scanner-flagged skill-repository combinations (0.52%) in malicious-flagged repositories. More urgently, the paper identifies a real, previously undocumented supply-chain attack: adversaries can hijack abandoned GitHub repositories indexed by skill marketplaces, silently replacing legitimate skills before they are downloaded.
Core Finding
The largest empirical study of the AI agent skill ecosystem to date collected 238,180 unique skills from ClawHub, Skills.sh, SkillsDirectory, and GitHub (§3.1, Table 1). The paper asks whether the high malicious-classification rates reported by individual marketplaces reflect real risk or scanner artefacts.
On classification rates, the paper finds the answer is mostly artefact. Across five scanners, fail rates ranged from 3.79% (Snyk on Skills.sh) to 41.93% (the OpenClaw scanner on ClawHub) — a tenfold spread (§5, Table 2). Cross-scanner consensus was negligible: "only 33 out of 27,111 skills (0.12%) are flagged as malicious by all five scanners" (§5, Cross-Scanner Agreement). When the authors applied repository-context scoring to the 2,887 skills flagged by both the Cisco Skill Scanner and their LLM classifier, "only 0.52% remain in malicious flagged repositories" (§6, Takeaway).
In parallel, the paper identifies two structural attack vectors that are both real and underreported: repository hijacking and an API information disclosure bug in ClawHub (§4.0.2).
Technical Mechanism
Repository-context scoring operates in two stages. A codebase score (weighted 70%) uses an LLM to assess whether a skill's description aligns with the surrounding repository's code, README, and documentation. A metadata score (weighted 30%) estimates repository maturity through signals such as age, star count, fork count, and issue activity (§3.3). The composite penalises repositories that appear aligned but exhibit other suspicious characteristics.
Repository hijacking exploits the link-out distribution model used by Skills.sh and SkillsDirectory. Both platforms index skills by pointing to GitHub repository URLs rather than hosting files directly. When an original repository owner renames their GitHub account, the previous username becomes available for registration. An adversary who claims that username and recreates the repository will intercept future skill downloads. The authors found "121 skills that forward to seven vulnerable repositories," with the most-downloaded hijackable skill having reached 2,032 downloads (§4.0.2). ClawHub is not affected because it hosts skills directly.
The ClawHub API disclosure is a separate finding. The marketplace's API returns the GitHub-linked email address for each skill owner, even though this field is not exposed in the ClawHub web interface nor on default GitHub profiles (§4.0.2).
A static secrets scan using a modified TruffleHog found 12 functioning credentials embedded across the corpus, covering NVIDIA, ElevenLabs, Gemini, MongoDB, and others (§4.0.1).
Defender Impact
Immediate action — audit skill distribution dependencies. Organisations that allow AI agents to install third-party skills should identify which marketplace(s) they use. If the marketplace links out to GitHub repositories rather than hosting skills directly, every indexed skill is a potential hijacking target. Defenders should pin skills to specific commit hashes rather than mutable branch heads, and monitor for repository-ownership changes on skills already installed.
Scanner tuning. Single-scanner alerts on skill repositories should not be treated as high-confidence findings. Given the 0.12% five-scanner consensus rate, treating any single-scanner flag as a confirmed threat will generate unmanageable noise. (RAXE assessment) A practical threshold is to require at minimum two independent scanners to flag a skill, supplemented by manual repository review for high-value environments.
Credential hygiene. Developers publishing agent skills should run secrets-scanning tooling (TruffleHog or equivalent) before committing, with API credential rotation as a mandatory step if a live credential is found in history.
Marketplace selection. ClawHub's direct-hosting model is more resistant to supply-chain hijacking than the link-out model. Organisations choosing skill marketplaces for their agent deployments should factor distribution architecture into their vendor assessment. The authentication gap at Skills.sh — which has no publisher authentication at all — is an additional concern (§4.0.2).
Limitations and Open Questions
The 15 repositories flagged after repository-context re-scoring are not confirmed malicious; this is the paper's authors' own acknowledgement (§6). The manual validation sample covers only 18 repositories. Repository-context scoring uses LLM analysis — adversaries who understand the scoring rubric could engineer repositories that pass it while hiding malicious intent. The skill-growth curves in Figure 2 show rapid ecosystem expansion; the attack surface is growing faster than measurement cadence. The 12 live credentials were valid at collection time but current status is unknown.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 4 | 4 |
SynthChain: A Synthetic Benchmark and Forensic Analysis of Advanced and Stealthy Software Supply Chain Attacks
Authors: Zhuoran Tan, Wenbo Guo, Taylor Brierley, Jiewen Luo, Jeremy Singer, Christos Anagnostopoulos ArXiv: 2603.16694
Stream: S3 — Supply Chain | RAXE ID: RAP-2026-006
Executive Takeaway
Single-source telemetry is structurally insufficient for complete reconstruction of advanced software supply chain attacks. SynthChain quantifies this gap across seven representative attack scenarios grounded in real malicious packages and exploit campaigns, and recommends pairing complementary telemetry streams plus IAM/API audit logs for cloud and CI/CD paths.
Core Finding
The central claim of SynthChain is that single-source telemetry is not a detection tuning problem — it is a structural observability problem. The paper states this directly: "single-source detection is inherently incomplete for advanced supply-chain attacks: even an ideal detector operating on a single stream cannot recover a complete compromise chain when required evidence is missing by design" (Section 1).
The authors back this with measurement. Across seven controlled supply chain attack scenarios, the best single telemetry stream achieved weighted tag/step coverage of only 0.391 and mean chain reconstruction of 0.403 (Section 1). In plain terms, the best single log source covered less than 40% of the attack chain steps, even under laboratory conditions with full knowledge of what to look for.
Adding a second complementary source raised those figures to 0.636 coverage and 0.639 reconstruction — approximately a 1.6x gain (Section 1). Critically, the paper also finds that gains are not monotonic with source count: adding more sources that duplicate existing evidence introduces noise without improving detection. Complementarity of the sources chosen matters more than volume of logs collected (Section 1, cross-scenario analysis Section 7).
Technical Mechanism
SynthChain is a near-production testbed comprising four physical hosts (Windows and Linux) and one containerised environment. It replays seven attack scenarios grounded in real malicious packages drawn from a statistical analysis of 16,272 packages from the OpenSSF corpus (Section 3.2). That analysis itself is instructive: 15,583 of those packages activate at install or download time, and Base64-based encoding is the dominant evasion technique — appearing in over 5,000 packages. Exotic evasion like steganography is statistically rare but present in targeted advanced scenarios.
The seven scenarios span a representative spectrum: steganography-based payload delivery (SC1: Stegano), persistence via Windows startup folder (SC2: Starter), parallel npm dependency chain attack (SC3: Parallel), sequential npm dependency chain (SC4: NPMEX), the 3CX multi-stage backdoor (SC5: 3CX), a cloud CI/CD pipeline attack targeting identity and access management (SC6: CloudEX), and a neural network model backdoor injected via a malicious model dependency (SC7: LayerInj) (Sections 4.3, 6.3).
Telemetry is collected across process lineage, Windows Event Logs, Syslog, Zeek network captures, Suricata alerts, and eBPF-based container instrumentation. ATT&CK annotations were generated using Mythic C2 exports for operator-driven steps, and GPT-5.1 proposals with human validation for payload-originated behaviours (Section 4.1.1).
Defender Impact
First, audit your telemetry architecture. If your SIEM or EDR ingests only one primary log type (endpoint telemetry or network capture but not both), SynthChain's data suggests you are structurally capped below 40% chain reconstruction for advanced supply chain intrusions. The paper frames this as an observability limit rather than a simple rule-tuning gap.
Second, prioritise complementary sources over log volume. In the benchmark, the strongest two-source combinations outperformed any single source on reconstruction metrics. The relevant metric is whether the sources share joinable identifiers (process ID, user identity, network endpoint) to connect attack phases across sources (Section 7, CSA-1 and CSA-4).
Third, invest in IAM and API audit logging for cloud and CI/CD environments. The cloud pipeline scenario (SC6: CloudEX) achieved a step reconstruction score of only 0.25 even with multiple conventional log sources, because critical actions occurred in the cloud identity and API layer rather than on host endpoints. The paper explicitly recommends IAM/API audit streams as the targeted addition that yields "outsized gains for structurally hard cases" (Section 7, CSA-4). (RAXE assessment: this aligns with our findings on AI model supply chain attacks — RAXE-2026-024 — where package-level scanning misses post-deployment model tampering.)
Fourth, treat model-layer supply chain attacks as a distinct detection category. The SC7 (neural network model backdoor) scenario demonstrated that adding more host and network log sources can actually reduce detection precision, because the malicious behaviour is semantic — embedded in model weights — rather than expressed as anomalous OS-level events. Conventional telemetry cannot distinguish benign from malicious model loading (Section 6, SC7 case study). Dedicated model integrity checking and inference-time monitoring are required (RAXE assessment).
Limitations and Open Questions
The testbed covers seven scenarios across four hosts — rich for a research benchmark, but not representative of production-scale diversity. Benign background noise is simulated, not real user activity, which may inflate detection metrics. The study explicitly excludes macOS and cloud-native (serverless, SaaS) environments. The coarse rule-based chain reconstruction may under- or over-count steps where log schemas differ from the assumed canonical fields (Section 9).
The paper recommends IAM/API audit streams for cloud attack paths but does not evaluate that recommendation — the SC6 result remains at 0.25 with no measured uplift from the suggested fix. How much improvement those streams would deliver in practice is an open question. It is also unknown whether the two-source fusion result reproduces in real SIEM deployments with schema heterogeneity and retention limits.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 4 | 3 |
Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection
Authors: Darren Cheng, Wen-Kwang Tsao ArXiv: 2603.13424
Stream: S4 — Prompt Injection | RAXE ID: RAP-2026-008
Executive Takeaway
Splitting an LLM agent into a privilege-separated Reader (no action tools) and Actor (no raw input) reduces prompt injection success by 323x on an established benchmark. The full pipeline — adding JSON-structured inter-agent communication — achieves 0% attack success rate against 649 attacks that bypassed the single-agent baseline. Organisations running multi-tool agents should adopt privilege separation as a baseline architectural requirement.
Core Finding
The paper proposes a two-mechanism architectural defence against prompt injection in LLM agents, evaluated on the Microsoft LLMail-Inject benchmark running inside OpenClaw. The benchmark funnels 461,640 attack submissions down to 22,899 unique payloads, of which 649 succeed against a gpt-5-mini single-agent baseline — itself already a 2.83% ASR, placing it in the same range as GPT-4o-mini with all five original LLMail defences combined (§4.3, §5.3).
The four-condition ablation against these 649 attacks yields clear results: agent isolation alone achieves 0.31% ASR (2 attacks survive, 323x improvement); JSON formatting alone achieves 14.18% ASR (92 attacks, 7.1x improvement); the full pipeline combining both achieves 0% ASR (§4.1). The ablation confirms that "agent isolation is the dominant mechanism" and that "JSON formatting provides additional hardening but is not sufficient on its own" (§4.1).
The paper's strongest claim is architectural: "the action agent never receives raw injection content regardless of model behavior on any individual input" (§5.1). This is a structural guarantee, not a statistical one.
Technical Mechanism
The single-agent baseline — one agent that both reads untrusted emails and holds the send_email tool — splits into two cooperating agents with disjoint tool access (§3).
Agent 1 (Reader) has access only to store_summary. It processes raw email content and produces a JSON-structured summary with validated fields (sender, subject, body summary, action items). It cannot invoke send_email — this constraint is enforced by the platform's tool registration, not by prompt instructions (§3, §5.1).
Agent 2 (Actor) has access to send_email, get_pending_summary, and store_result. It reads only the Reader's JSON output. It never sees raw email content (§3).
A lightweight regex-based validator inspects summaries between agents, targeting email address literals, tool-call syntax fragments, and common injection trigger phrases. In experiments, the validator runs in audit mode to enable clean measurement (§3). Notably, 63.7% of injection content still leaks into the Reader's summaries (§6). The defence does not prevent content propagation — it prevents actionable content propagation.
Defender Impact
The core principle is immediately adoptable. Any organisation running multi-tool LLM agents should separate the component that processes untrusted content from the component that holds dangerous tools (send, execute, delete, post). This separation should be enforced at the platform level (tool registration), not at the prompt level (§5.1).
Scenario 4 exfiltration attacks are fully blocked in this benchmark. Scenario 4 (data exfiltration) had 28.3% ASR under JSON-only defence but 0% under both two-agent configurations (§4.2).
Applicable beyond email agents. (RAXE assessment) The Reader/Actor pattern maps to any agent that reads untrusted content and takes actions: code assistants, customer-service bots, or MCP-integrated tools. This directly addresses the class of vulnerabilities seen in RAXE-2026-014 (MCP path traversal) and RAXE-2026-022 (Claude Code domain bypass).
Limitations and Open Questions
Only gpt-5-mini was tested; transfer to Claude, Gemini, or open-weight models is not demonstrated. The 649-attack corpus does not include adaptive attacks designed to target the two-agent architecture — the authors acknowledge this gap (§6). The regex-based validator is acknowledged as evadable. No performance overhead is reported. The 63.7% leak rate means that if future attacks encode executable instructions into structured JSON fields, defence degrades (§6).
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 5 | 5 | 4 | 4 |
VeriGrey: Greybox Agent Validation
Authors: Yuntong Zhang, Sungmin Kang, Ruijie Meng, Marcel Bohme, Abhik Roychoudhury ArXiv: 2603.17639
Stream: S2 — Agent Security | RAXE ID: RAP-2026-009
Executive Takeaway
VeriGrey adapts greybox fuzzing — the dominant technique in traditional software vulnerability discovery — to autonomous LLM agents. By using tool invocation sequences as coverage feedback and context-bridging mutations to craft semantically plausible injection prompts, it achieved 33 percentage points higher attack success than black-box methods on AgentDojo, 90% attack success against Google's Gemini CLI, and 100%/90%/80% success rates in triggering malicious supply-chain skills in OpenClaw across three LLM backends.
Core Finding
The central contribution is a two-part innovation transplanted from software fuzzing into the agent security domain. First, VeriGrey replaces traditional branch coverage with tool-call sequence tracking: the ordered list of tools an agent invokes becomes the feedback signal that guides prompt exploration. Prompts that trigger previously unseen tool sequences are prioritised, ensuring systematic coverage of the agent's behavioural space. Second, a context-bridging mutation operator rewrites injection payloads so that the malicious task appears to be a necessary step in completing the legitimate user task. Ablation results confirm this is the dominant factor: removing context bridging reduced overall injection success by 25.8 percentage points, versus 11.1 points for removing the feedback mechanism (Section 5.4, Table 2).
Critically, the approach was validated against production systems, not just benchmarks. Against Gemini CLI with Gemini-2.5-Pro backend, VeriGrey found 9 of 10 injection attack paths including SSH key exfiltration, bash history theft, and malicious cron job installation (Section 6, Table 5). Against OpenClaw, simple SKILL.md documentation rewrites — framing malicious installs as natural workflow steps, emphasising agent autonomy, fabricating usage examples — bypassed safety mechanisms with near-total success rates across Kimi-K2.5, Opus 4.6, and GPT-5.2 backends (Section 7.3, Table 6).
Technical Mechanism
VeriGrey operates as a two-part system transplanted from software fuzzing into the agent security domain.
Tool-call sequence feedback replaces traditional branch coverage. Instead of measuring which code paths are exercised, VeriGrey tracks the ordered sequence of tools an agent invokes during each execution. Prompts that trigger previously unseen tool-call sequences are prioritised via energy-based seed scheduling, ensuring systematic exploration of the agent's behavioural space rather than random prompt mutation (Section 3.2).
Context-bridging mutation rewrites injection payloads so that the malicious task appears to be a logically necessary step in completing the legitimate user task. Rather than inserting a disconnected instruction ("ignore previous instructions and..."), context bridging constructs a narrative link between the user's original goal and the attacker's desired action. Ablation confirms this is the dominant factor: removing context bridging reduced overall injection success by 25.8 percentage points, versus 11.1 points for removing the feedback mechanism and 4.5 points for removing prompt sandwiching (Section 5.4, Table 2).
Defender Impact
Prompt-level defences are insufficient. Prompt sandwiching reduced VeriGrey's success by only 4.5 percentage points; data delimiters by 2.1 points (Section 5.5, Table 3). These defences assume injections are syntactically distinguishable from legitimate content, which context bridging deliberately defeats.
Tool filtering is the most effective current defence — reducing VeriGrey's success to 17.4% while maintaining 81.7% legitimate task completion (RAXE assessment: this validates architectural over prompt-level controls). Organisations deploying agents with MCP integrations or skill marketplaces should implement tool whitelisting as a minimum countermeasure.
Supply-chain vetting needs to go beyond code scanning. The OpenClaw results demonstrate that malicious intent can be hidden in documentation (SKILL.md), not code. Current scanners (KOI, Snyk) flagged the original malicious skills, but the mutated documentation variants evaded detection entirely. Skill marketplace operators need documentation-level threat analysis, not just static code review (RAXE assessment).
Limitations and Open Questions
Scope is limited to single-session scenarios; multi-session attacks and persistent memory poisoning are not addressed (Section 9). Real-world testing covers only two agents (Gemini CLI, OpenClaw). Defence evaluation is limited to four established mechanisms — more recent architectural defences (dual-LLM, capability-based access control) are not evaluated. Budget sensitivity is unexplored: all experiments use 100 execution budget, and how results degrade under rate limiting is unknown. Cost per campaign is not reported. The responsible disclosure status of the Gemini CLI and OpenClaw vulnerabilities is not stated in the paper.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 4 | 5 | 3 |
WATCH
REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
Authors: Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, Renyang Liu ArXiv: 2603.16576
Stream: S1 — Adversarial ML | RAXE ID: RAP-2026-007
Executive Takeaway
Safety controls that remove harmful or copyrighted concepts from image generation models can be bypassed using adversarial image inputs, without access to the target model's internals. REFORGE, a publicly released framework, achieves this in roughly 35 seconds per attack. Organisations relying on concept unlearning as a safety control should treat image-side inputs as a material bypass path. (RAXE assessment)
Core Finding
Image generation model unlearning (IGMU) — the practice of selectively deleting a concept from a trained model without full retraining — is vulnerable to adversarial image inputs. The paper introduces REFORGE, a black-box red-teaming framework that combines a crafted image prompt with an unmodified text prompt to cause erased concepts to re-emerge in model outputs (§I Introduction).
REFORGE was evaluated against six distinct unlearning methods — ESD, UCE, AdvUnlearn, DoCo, MACE, and ConceptPrune — across three concept categories: nudity (a sensitive content category), object removal (parachute), and artistic style (Van Gogh). Averaged across all six methods, REFORGE achieves attack success rates of 66.55% on nudity, 70.36% on object removal, and 74.99% on style unlearning (Table I, §IV-B).
Importantly, even AdvUnlearn — a method specifically designed to resist adversarial inputs during training — did not fully neutralise REFORGE. Against AdvUnlearn, REFORGE achieved 62.66% on nudity and 57.77% on parachute, compared to near-zero rates for competing text-based attacks (§IV-B). The paper states that "current IGMU methods remain vulnerable to multi-modal adversarial inputs" (§V).
Technical Mechanism
REFORGE operates in four stages (§III-B Overview).
First, a concept reference image — any publicly available image depicting the target concept — is converted into a simplified, stroke-based representation using a large-kernel median filter, colour quantisation to six colours, and region-based stroke rendering. This removes fine detail whilst preserving global layout, creating a visually innocuous starting point (§III-C).
Second, a spatial mask is derived from the cross-attention layers of a publicly available proxy model (Stable Diffusion v1.4). Cross-attention maps identify which spatial regions of the image are most strongly linked to the target concept token in the text prompt. The mask concentrates subsequent perturbation on these regions rather than across the whole image (§III-D).
Third, the adversarial image is iteratively refined by minimising the mean-squared error between its latent-space representation (in the proxy model's encoder) and the latent representation of the original concept reference image. The mask gates the gradient update so that only concept-relevant regions are modified (§III-E).
Finally, the refined image plus the original text prompt are submitted to the target model — which has no parameter connection to the proxy — and the output is assessed for concept re-emergence (§III-F).
Defender Impact
Organisations that deploy image generation services with concept unlearning as a safety control — to prevent generation of copyrighted styles, NSFW content, or specific objects — should note the following:
The image input channel is an active attack surface. If the service accepts image prompts (image-to-image or inpainting modes), REFORGE-style attacks apply without access to the target model's internals. The attack requires API access to the target model, a public reference image, and a public proxy model for optimisation (§III-A Threat Model).
The speed of the attack (approximately 35 seconds versus 290–1,000 seconds for prior methods, §IV-D) makes repeated automated probing materially more practical. (RAXE assessment)
ConceptPrune, a structural pruning approach, was particularly susceptible: REFORGE achieved 97.77–98.00% ASR against it (Table I). AdvUnlearn, which trains adversarially, showed more resistance but was not robust.
Potential mitigations (RAXE assessment, not from the paper): restricting or monitoring image-side inputs, detecting highly blurred or stroke-like inputs as anomalous, and auditing whether unlearning controls are evaluated against image-modality attacks as well as text-modality attacks. The paper does not propose specific defences.
Limitations and Open Questions
No limitations section appears in the paper. All experiments use Stable Diffusion variants; applicability to other architectures (DALL-E, Flux, Midjourney) is untested. The "black-box" framing involves a public proxy model for gradient computation, which is more capable than a true query-only black-box scenario. Automated classifiers (NudeNet, ResNet-50) are used to measure ASR; their false-positive rate is not characterised. Whether input-validation controls (detecting sketch-style inputs) would constitute a low-cost defence remains an open question.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 3 | 4 | 3 |
Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Authors: Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen ArXiv: 2603.17372
Stream: S1 — Adversarial ML | RAXE ID: RAP-2026-010
Executive Takeaway
When images accompany harmful text prompts, vision-language models (VLMs) enter a distinct internal state that the authors can identify, measure, and surgically remove. The key insight is that VLMs do not fail to perceive harm during jailbreaks — they recognise it (49-76% of jailbreak responses contain safety warnings) but comply anyway. The proposed defence, JRS-Rem, subtracts the jailbreak-inducing component from model representations at inference time, reducing attack success by 65-72 percentage points on the strongest attack configurations while causing near-zero utility degradation.
Core Finding
The paper demonstrates that jailbreak, refusal, and benign inputs occupy three separable clusters in the VLM representation space, with AUROC exceeding 0.85 in middle and deep layers and linear probing F1 above 0.85. This directly contradicts the prevailing "safety perception failure" hypothesis: models internally distinguish harmful from benign inputs even when they ultimately comply with the harmful request. Table 1 shows that 49.55% to 76.18% of successful jailbreak responses contain explicit safety warnings before providing the harmful content — the model knows the content is dangerous and says so, then proceeds regardless.
The authors define a "jailbreak direction" as the normalised mean difference between jailbreak and refusal representations at each layer, then decompose each image-induced representation shift into a jailbreak-related component and a residual. JRS-Rem subtracts the jailbreak component when it exceeds a threshold (tau=0.2), reducing LLaVA-1.5-7B ASR from 84.9% to 12.4% on the hardest configuration and ShareGPT4V-7B from 71.7% to 2.1%, with utility loss of 0.5 points or less across benign benchmarks.
Technical Mechanism
JRS-Rem operates in three steps, all at inference time with no model retraining required.
Step 1 — Compute jailbreak direction. At each transformer layer, compute the mean representation for a calibration set of jailbreak samples and refusal samples separately, then take the normalised difference: d = (mu_jail - mu_ref) / ||mu_jail - mu_ref||. This unit vector captures the axis along which jailbreak diverges from refusal in representation space. Only 50 jailbreak + 50 refusal samples are needed; cross-dataset cosine similarity exceeds 0.7 across all tested distributions (Section 5.3, Figure 8a).
Step 2 — Decompose image-induced shift. For each input, run two forward passes: one with the image and one without. The difference gives the image-induced representation shift. Project this shift onto the jailbreak direction via scalar projection to isolate the jailbreak-related component. The normalised shift (proportion of total shift that is jailbreak-relevant) determines whether correction is triggered (Section 4.1).
Step 3 — Conditional correction. If the normalised shift exceeds threshold tau=0.2, subtract the jailbreak-related component from the representation. The correction formula is: h_corrected = h - s * d, where s is the scalar projection. The threshold prevents correction from firing on benign inputs, preserving utility (Section 5.1).
The two additional forward passes add negligible overhead relative to the full generation pass for typical response lengths (Section 5.1).
Defender Impact
The mechanistic insight matters more than the specific defence (RAXE assessment). The finding that jailbreaks are a distinct, measurable internal state — not a perception failure — reframes how defenders should think about VLM safety. It suggests that post-hoc detection of jailbreak states is feasible, even if the specific JRS-Rem intervention cannot be deployed.
Deployability is constrained to open-weight, self-hosted VLMs. JRS-Rem requires access to hidden-state activations during inference, excluding API-served models (GPT-4V, Gemini, Claude). For organisations running self-hosted VLMs, the 100-sample calibration cost and two-forward-pass overhead make this immediately practical (RAXE assessment).
Limitations and Open Questions
Only three VLMs were tested (LLaVA-1.5-7B, ShareGPT4V-7B, InternVL-Chat-19B) — all relatively small. Frontier-scale models (70B+, mixture-of-experts) and commercial API-served VLMs are absent; the jailbreak direction property may not hold or may require different calibration at larger scales. No adaptive attacks were evaluated: adversaries who know about JRS-Rem could potentially craft images that produce shifts orthogonal to the jailbreak direction, evading detection while still inducing harmful behaviour. Implicit attack residual ASR remains high (35.3% on LLaVA HADES-IH), meaning implicit attacks are not a solved problem. No production deployment evidence exists — inference latency, integration complexity, and interaction with other system-level defences are not assessed. The authors acknowledge a dual-use risk (the jailbreak direction could be used to amplify rather than suppress attacks) but do not analyse it in depth.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 4 | 3 | 5 | 3 |
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
Authors: Segyu Lee, Boryeong Cho, Hojung Jung, Seokhyun An, Juhyeong Kim, Jaehyun Kwak, Yongjin Yang, Sangwon Jang, Youngrok Park, Wonjun Chang, Se-Young Yun ArXiv: 2603.17476
Stream: S1 — Adversarial ML | RAXE ID: RAP-2026-011
Executive Takeaway
Unified multimodal models (UMMs) — systems that both understand and generate across text and image modalities — have a measurable safety gap that widens when tasks involve multiple input images or multi-turn interaction. UniSAFE benchmarks 15 models across 7 task types using 6,802 curated instances and finds that image generation tasks consistently produce higher attack success rates than text generation, even within the same model. Critically, more capable models are not safer: the correlation between generative quality and text-output attack success is r=0.9634.
Core Finding
UniSAFE introduces a shared-target evaluation design: a single intended unsafe outcome is projected across all seven task types, enabling principled cross-task comparison. GPT-5 achieves roughly 30.5% average attack success rate on image-output tasks versus 8.1% on text-output tasks. Gemini-2.5 shows the same pattern at higher absolute levels: 48.3% versus 32.6% (Section 4.3). Image generation pipelines have weaker safety alignment than text generation pipelines, even within the same model.
The Show-o to Show-o2 upgrade provides the starkest evidence that capability scaling degrades safety: text-to-text ASR jumps from 15.53% to 54.10% and multimodal understanding ASR from 5.31% to 39.94% in one generation (Section 4.3). Commercial models rely on system-level (external classifier) refusal for image outputs versus model-level refusal for text (Section 4.2), suggesting image safety is structurally more brittle (RAXE assessment).
Technical Mechanism
UniSAFE's core design innovation is shared-target evaluation: a single intended unsafe outcome is projected across all seven task types (text-to-image TI, image editing IE, image composition IC, multi-turn editing MT, text-to-text TT, image-to-text IT, and multimodal understanding MU). Each task type is defined by a formal input/output modality tuple, ensuring that differences in attack success reflect genuine modality-level safety gaps rather than artefacts of heterogeneous test content.
Inputs are individually benign — unsafety emerges only from the cross-modal combination of text and image inputs. This design prevents models from refusing based on surface-level keyword detection and forces genuine safety reasoning.
Evaluation uses a three-judge MLLM ensemble (Gemini-2.5 Pro, GPT-5-nano, Qwen2.5-VL-72B) validated against human annotation with Pearson correlation r=0.962 (Section 3.3). The benchmark comprises 6,802 instances spanning 7 task types and 9 harm categories across 15 models.
Defender Impact
Image-output tasks require dedicated safety evaluation. The consistent modality gap means text-only safety benchmarks are insufficient for UMMs. Any organisation deploying a UMM for image generation must evaluate safety on image-output tasks specifically (RAXE assessment).
Capability upgrades demand safety re-evaluation. The near-perfect correlation means every model upgrade is a potential safety regression. Organisations should mandate safety re-benchmarking as a gate for any model version upgrade (RAXE assessment).
Limitations and Open Questions
Evaluation is limited to currently available UMMs in a fast-moving field. Some models could not be evaluated on all task types. The safety taxonomy may not cover all possible harm categories. MLLM judges, while validated against human annotation (r=0.962), are not perfect proxies for human judgement. No adversarial robustness testing of the benchmark itself is performed — models could potentially be tuned to game it. Visual assets for image composition and multi-turn tasks are synthetically generated, which may introduce artefacts.
Radar Rating
| Threat Realism | Defensive Urgency | Novelty | Research Maturity |
|---|---|---|---|
| 3 | 3 | 4 | 4 |
HORIZON
No papers this week are classified as horizon-only.
Stream Coverage
| Stream | Papers This Issue | Running Total (2026) |
|---|---|---|
| S1: Adversarial ML | 3 (RAP-007, RAP-010, RAP-011) | 4 |
| S2: Agent Security | 2 (RAP-005, RAP-009) | 4 |
| S3: Supply Chain | 1 (RAP-006) | 1 |
| S4: Prompt Injection | 1 (RAP-008) | 1 |
| Total | 7 | 10 |
Coverage Notes
All four research streams are covered. S2 (Agent Security) is particularly strong this issue: RAP-2026-005 maps the skill marketplace attack surface and RAP-2026-009 demonstrates automated exploitation of agent frameworks, forming an attack/defence pair with RAP-2026-008's privilege separation defence.
Cross-References to RAXE Advisories
| Paper | Related Advisory | Connection |
|---|---|---|
| RAP-2026-005: Agent Skill Ecosystem | RAP-2026-003 (OpenClaw, Week 12) | Direct extension — RAP-2026-005 analyses the broader marketplace ecosystem that RAP-2026-003's OpenClaw operates within; both find supply-chain attack vectors in agent skill distribution |
| RAP-2026-006: SynthChain | RAXE-2026-015 (PickleScan), RAXE-2026-024 (NeMo) | SynthChain's SC7 (model backdoor) finding — that conventional telemetry cannot detect semantic-level model poisoning — reinforces the limitations noted in RAXE supply chain advisories |
| RAP-2026-008: Agent Privilege Separation | RAXE-2026-014 (MCP Git), RAXE-2026-022 (Claude Code) | Privilege separation between reading and action layers is a direct structural mitigation for the injection-to-action class of vulnerability in both advisories |
| RAP-2026-009: VeriGrey | RAP-2026-008 (this issue), RAP-2026-005 (this issue) | VeriGrey demonstrates that prompt-level defences (sandwiching, delimiters) are largely ineffective under greybox fuzzing while tool filtering is the strongest tested defence — reinforcing RAP-2026-008's argument for architectural over prompt-level controls; OpenClaw skill mutation results extend RAP-2026-005's finding that documentation-level manipulation evades scanners |
| RAP-2026-007: REFORGE | None | No direct connection to current RAXE advisories; image generation safety is outside current advisory scope |
| RAP-2026-010: JRS-Rem | None | New research thread in VLM safety; no prior RAXE advisory coverage of representation-space defences |
| RAP-2026-011: UniSAFE | None | No direct advisory connection; establishes baseline measurement for multimodal model safety gaps |
Methodology
This issue of the RAXE Research Radar covers AI security papers published on arXiv between 2026-03-13 and 2026-03-20. Papers are selected based on four dimensions (Threat Realism, Defensive Urgency, Novelty, Research Maturity) with a minimum average score of 3.0. Each paper is read in full, with structured claim extractions documented in reading notes before summaries are written. Summaries are reviewed against reading notes to ensure factual traceability. Analytical claims beyond paper evidence are labelled "(RAXE assessment)".
The Research Radar is an independent publication from RAXE Labs' vulnerability advisory service. It does not constitute vulnerability disclosure or actionable threat intelligence. Papers are summarised for practitioner awareness; readers should consult the original publications for complete technical details.
Relevance badges: - act_now — immediate practical impact; evaluate this week - watch — emerging technique; track development - horizon — early-stage research; awareness only