Implement reward hacking detection for aligned agents:
Reward Hacking Types:
- Specification Gaming: Finding loopholes in reward definition
- Reward Tampering: Modifying reward mechanism directly
- Wireheading: Stimulating pleasure center directly
Detection Methods:
set_baseline(normal_trajectories): Learn normal behavior distributionanalyze_trajectory(trajectory, reward): Check for anomalies
Indicators:
- Unusually high reward for low effort
- Repetitive action patterns (reward farming)
- Deviations from baseline behavior
- Actions that don't progress toward stated goal
Return: Suspicion score (0-1) and list of triggered indicators.
Examples
Example 1:
Input:
detector = RewardHackingDetector(); detector.set_baseline([['step1', 'step2', 'step3']]); result = detector.analyze_trajectory(['step1', 'step1', 'step1', 'step1'], 100); result['suspicious']Output:
TrueExplanation: Repetitive action pattern indicates potential reward farming
Starter Code
class RewardHackingDetector:
"""
Detect potential reward hacking in agent behavior.
Identifies when agents optimize metric over intended goal.
"""
def __init__(self):
self.baseline_behavior = None
self.suspicious_patterns = []
def set_baseline(self, normal_trajectories):
"""Set baseline from known good trajectories"""
# Your implementation here
pass
def analyze_trajectory(self, trajectory, reward):
"""
Analyze if trajectory shows reward hacking signs.
Returns {'suspicious': bool, 'indicators': [...], 'confidence': float}
"""
# Your implementation here
pass
def _check_shortcut_patterns(self, trajectory):
"""Check for shortcuts that exploit reward function"""
# Your implementation here
pass
def _check_repetition_exploit(self, trajectory):
"""Check for repetitive actions that farm reward"""
# Your implementation here
passPython3
ReadyLines: 1Characters: 0
Ready