Protecting AI Agents in Production
Real-World Attack Patterns Targeting AI Systems
Analysis of 28,194 threat detections across 74,636 agent interactions in just 7 days reveals evolving attack patterns including the emergence of inter-agent attacks as a new threat category.
Executive Summary
Key insights for security leaders and practitioners
The Bottom Line
One in three AI agent interactions in our dataset contains adversarial content. During customer testing, teams intentionally probe with adversarial techniques — this ratio will normalise in production, but provides a strong indication of the attack surface organisations face. Our analysis of 74,636 interactions detected 28,194 threats with 92.8% high-confidence classification.
What's New This Week
- Inter-agent attacks emerged as a distinct category (3.4%), with attackers now targeting agent-to-agent communication channels
- RAG poisoning surged to 10% of all threats, exploiting document retrieval systems
- Jailbreak detection confidence reached 96.3%, indicating these attack patterns are now highly predictable
Top 3 Threat Vectors
Recommended Actions
- Protect system prompts: 7.7% of attacks specifically target prompt extraction
- Implement layered detection: combine pattern matching with ML classification
- Audit agent permissions: tool abuse, and goal hijacking are increasing
- Scan RAG documents: context poisoning is a growing attack vector
Key Findings
Critical insights from 7 days of production threat detection
Data Exfiltration Dominates
Attackers primarily target system prompts and confidential context. Average detection confidence: 90.9%
Jailbreaks Show Clear Signatures
Well-established attack patterns enable reliable detection with highest confidence scores.
Agent Attacks Are Growing
Tool abuse, goal hijacking, and inter-agent attacks target agentic capabilities with 97%+ confidence.
RAG Poisoning Emerges
Context injection and retrieval manipulation attacks targeting RAG systems.
Cybersecurity Dominates Harm Categories
Malware generation, exploit development, and security bypass remain primary attacker objectives.
Threat Family Distribution
Click on any segment to explore detailed analysis
Select a threat family
Click on the chart to view details
Data Exfiltration
5,416 detections • 90.9% confidenceAttempts to extract sensitive information from LLM systems, primarily targeting system prompts, training data hints, and user context.
Techniques Observed
- System prompt extraction via direct questioning
- Encoded extraction attempts (Base64, ROT13)
- Context window manipulation
- "Repeat your instructions" variants
Mitigation
- Implement system prompt protection layers
- Use prompt injection detection before processing
- Monitor for repeated extraction attempts
Jailbreak
3,455 detections • 96.3% confidenceAttempts to bypass safety guidelines and content policies through various manipulation techniques.
Techniques Observed
- DAN (Do Anything Now) variants
- Roleplay scenarios ("Act as a character who...")
- Hypothetical framing ("In a fictional world...")
- Multi-turn crescendo attacks
Mitigation
- Implement multi-turn context analysis
- Deploy escalation detection
- Use confidence-based blocking (>95%)
RAG/Context Attack
2,817 detections • 93.4% confidenceAttacks targeting Retrieval-Augmented Generation systems through document poisoning and context manipulation.
Techniques Observed
- Document injection with hidden instructions
- Context window overflow
- Retrieval manipulation
- Delimiter injection in retrieved content
Mitigation
- Scan all documents before ingestion
- Implement strict content sanitization
- Use separate context windows for user input vs retrieved content
Prompt Injection
2,476 detections • 95.4% confidenceClassic prompt injection attacks attempting to override system instructions or manipulate model behaviour.
Techniques Observed
- Direct instruction override
- Delimiter-based injection
- Context confusion attacks
- Nested instruction attacks
Mitigation
- Input validation and sanitization
- Instruction hierarchy enforcement
- Clear delineation between system and user content
Tool/Command Abuse
2,287 detections • 86.5% confidenceAttacks targeting LLM tool-calling capabilities to execute unintended actions.
Techniques Observed
- Command injection in tool parameters
- Tool chaining for privilege escalation
- Parameter manipulation
- Unintended tool invocation
Mitigation
- Implement strict parameter validation
- Use allowlists for tool capabilities
- Monitor tool call sequences for anomalies
Encoding/Obfuscation
1,979 detections • 95.5% confidenceAttacks using various encoding schemes to bypass detection.
Encoding Types Detected
- Base64
- ROT13
- Unicode manipulation
- Whitespace encoding
- Homoglyph substitution
Mitigation
- Decode all input variants before processing
- Implement multi-layer encoding detection
- Monitor for repeated encoding attempts
Attack Technique Frequency
Hover over bars for confidence scores and details
| Rank | Technique | Count | % of Total | Confidence | Risk |
|---|---|---|---|---|---|
| 1 | Instruction Override | 2,727 | 9.7% | 95.6% | HIGH |
| 2 | Tool/Command Injection | 2,322 | 8.2% | 88.6% | CRITICAL |
| 3 | RAG Poisoning | 2,272 | 8.1% | 93.3% | HIGH |
| 4 | System Prompt Extraction | 2,165 | 7.7% | 96.7% | HIGH |
| 5 | Role/Persona Manipulation | 2,002 | 7.1% | 90.8% | MEDIUM |
| 6 | Encoding/Obfuscation | 1,999 | 7.1% | 93.9% | HIGH |
| 7 | Indirect Injection | 1,954 | 6.9% | 94.8% | HIGH |
| 8 | Tool Abuse | 1,793 | 6.4% | 88.8% | HIGH |
| 9 | Chain-of-Thought Leak | 1,634 | 5.8% | 84.5% | MEDIUM |
Harm Category Analysis
What attackers are trying to achieve with LLM exploitation
Cybersecurity / Malware
Malware generation, exploit development, security bypass techniques, credential harvesting, and infrastructure attacks.
Emerging Threats
New attack patterns targeting agentic AI systems
Inter-Agent Attacks
Attacks targeting multi-agent systems where one LLM communicates with another. Highest confidence scores (97.7%) indicate clear attack signatures.
Attack Patterns
- Poisoned messages between agents
- Agent impersonation
- Recursive attack propagation
- Trust exploitation between agents
Agent Goal Hijacking
Attacks attempting to redirect an autonomous agent's objectives through goal redefinition, priority manipulation, and constraint removal.
Chain-of-Thought Manipulation
Attacks targeting the reasoning process of LLMs through reasoning injection, logic chain poisoning, and intermediate step manipulation.
Recommendations
Actionable guidance based on threat intelligence
Implement Layered Defense
- Combine pattern matching (L1) with ML classification (L2)
- Use confidence thresholds for graduated responses
- Log all detections for analysis
Protect System Prompts
- Never echo system prompts to users
- Implement extraction detection
- Use instruction hierarchy
Secure Tool Integrations
- Validate all tool parameters
- Implement strict allowlists
- Monitor tool call patterns
Handle Multi-Turn Contexts
- Track escalation patterns
- Analyse full conversation context
- Implement session-level risk scoring
Monitor Emerging Patterns
- Inter-agent attacks are growing
- RAG poisoning is significant
- Encoding attacks show sophistication
Implement Confidence-Based Policies
Audit Agent Capabilities
- Review tool permissions
- Implement least-privilege
- Log all agent actions
Assess LLM Deployment Risk
- Inventory all LLM-powered applications
- Identify data exposure risks
- Implement runtime protection
Establish Detection Baselines
Plan for Agentic AI
- Agent attacks are increasing
- Multi-agent systems need authentication
- Goal hijacking is a real risk
Methodology
How RAXE detects and classifies threats
Pattern-Based Detection
- Deterministic rule matching
- Sub-millisecond latency
- 200+ threat patterns
ML Classification
- Gemma-based 5-head classifier
- Voting ensemble with confidence
- Family, technique, harm classification
Enterprise Intelligence Services
AI security consulting, threat intelligence, and agent runtime protection
This report is classified TLP:WHITE for unrestricted public distribution. Our enterprise practice delivers higher-classification intelligence products, security assessments, and consulting services tailored to your AI agent infrastructure and threat landscape.
Public Intelligence
Unlimited distribution
- Weekly threat landscape reports
- Attack technique trend analysis
- OWASP AI Top 10 alignment mapping
- Anonymised detection statistics
- General mitigation frameworks
Community Intelligence
Shareable within your sector and partner network
- Sector-specific threat briefings (FinServ, Healthcare, Tech)
- Detection signature library access
- Emerging jailbreak and injection patterns
- Shared IOC feeds for prompt attacks
- Peer benchmarking and industry comparison
- Monthly analyst briefings
Organisation Intelligence
Restricted to your organisation only
- Custom threat modelling for your AI stack
- Agent security architecture review
- Multi-agent system risk assessment
- RAG and tool chain security audit
- Detection policy development
- Weekly executive threat briefings
- Red team exercise reports
Restricted Advisory
Named recipients only, verbal or secure channel
- Incident response and active threat support
- Zero-day vulnerability disclosure (pre-embargo)
- Threat actor attribution and profiling
- Agent compromise forensics
- Dedicated analyst team (24/7 on-call)
- Board-level strategic briefings
- Custom wargaming and tabletop exercises
Agent Security Assessment
Comprehensive threat model of your AI agent infrastructure, including tool chains, memory systems, and inter-agent communication.
Red Team Exercises
Adversarial testing using MITRE ATLAS techniques: prompt injection, jailbreaks, goal hijacking, privilege escalation, and data exfiltration.
Compliance Mapping
Accelerated path to ISO 42001, NIST AI RMF, and EU AI Act compliance with pre-built evidence and control documentation.
Runtime Protection
Managed detection and response for production AI agents. 514 detection rules, ML classification, and 24/7 monitoring.
Ready to secure your AI agents?
Our team specialises in LLM security, agentic AI protection, and multi-agent system defence. We bring deep expertise in prompt injection, jailbreak prevention, and agent runtime security.
enterprise@raxe.aiProtect Your AI Applications
Deploy RAXE to detect and block these threats in real-time with <10ms latency.