Tool Misuse Detection in AI Agents
Agents can misuse tools: rate limit evasion, prompt injection, scope creep, data exfiltration.
Task
Build ToolMisuseDetector that detects:
- Rate limit violations: Tool called too frequently.
- Blocked patterns: Args matching disallowed patterns.
- Prompt injection: Tool args containing LLM instructions (
Ignore previous,Act as). - Scope creep: Agent using tools outside its authorized set.
- Data exfiltration: Large data writes to external destinations.
Non-Functional Requirements
- Detection latency < 10ms per call.
- False positive rate < 5%.
- All alerts include severity and recommended action.
- Detection rules must be configurable at runtime.
Constraints
- Rate limit window: sliding window (not fixed).
- Prompt injection: detect common jailbreak phrases.
- Exfiltration: flag any write of >10KB to external URLs.
Examples
Example 1:
Input:
detector.add_blocked_pattern('sql_query', 'DROP TABLE')
call = ToolCall('c1', 'sql_query', {'query': 'DROP TABLE users'}, None, time.time(), 'agent1', 'run1')
detector.analyze_call(call)Output:
[MisuseAlert(misuse_type='blocked_pattern', severity='critical', recommended_action='block')]Explanation: DROP TABLE matches blocked pattern; critical severity SQL injection.
Starter Code
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
from datetime import datetime
import re
@dataclass
class ToolCall:
call_id: str
tool_name: str
args: Dict
result: Any
timestamp: float
agent_id: str
run_id: str
@dataclass
class MisuseAlert:
alert_id: str
call_id: str
misuse_type: str
severity: str # low|medium|high|critical
description: str
recommended_action: str # allow|warn|block|terminate
class ToolMisuseDetector:
def __init__(self):
self.call_history: List[ToolCall] = []
self.alerts: List[MisuseAlert] = []
self.rate_limits: Dict[str, Tuple[int, float]] = {} # tool -> (max_calls, window_s)
self.blocked_patterns: Dict[str, List[str]] = {} # tool -> [arg patterns]
def configure_rate_limit(self, tool: str, max_calls: int, window_s: float) -> None:
pass
def add_blocked_pattern(self, tool: str, arg_pattern: str) -> None:
pass
def analyze_call(self, call: ToolCall) -> List[MisuseAlert]:
# TODO: Run all detection checks, return alerts
pass
def _check_rate_limit(self, call: ToolCall) -> Optional[MisuseAlert]:
pass
def _check_blocked_patterns(self, call: ToolCall) -> Optional[MisuseAlert]:
pass
def _check_prompt_injection(self, call: ToolCall) -> Optional[MisuseAlert]:
# TODO: Detect if tool args contain LLM instruction injection
pass
def _check_scope_creep(self, call: ToolCall) -> Optional[MisuseAlert]:
# TODO: Detect if agent is using tools beyond its authorized scope
pass
def _check_exfiltration(self, call: ToolCall) -> Optional[MisuseAlert]:
# TODO: Detect potential data exfiltration patterns
pass
Python3
ReadyLines: 1Characters: 0
Ready