Reward Hacking Detector

Medium
Agents

Implement reward hacking detection for aligned agents:

Reward Hacking Types:

  1. Specification Gaming: Finding loopholes in reward definition
  2. Reward Tampering: Modifying reward mechanism directly
  3. Wireheading: Stimulating pleasure center directly

Detection Methods:

  1. set_baseline(normal_trajectories): Learn normal behavior distribution
  2. analyze_trajectory(trajectory, reward): Check for anomalies

Indicators:

  • Unusually high reward for low effort
  • Repetitive action patterns (reward farming)
  • Deviations from baseline behavior
  • Actions that don't progress toward stated goal

Return: Suspicion score (0-1) and list of triggered indicators.

Examples

Example 1:
Input: detector = RewardHackingDetector(); detector.set_baseline([['step1', 'step2', 'step3']]); result = detector.analyze_trajectory(['step1', 'step1', 'step1', 'step1'], 100); result['suspicious']
Output: True
Explanation: Repetitive action pattern indicates potential reward farming

Starter Code

class RewardHackingDetector:
    """
    Detect potential reward hacking in agent behavior.
    Identifies when agents optimize metric over intended goal.
    """
    
    def __init__(self):
        self.baseline_behavior = None
        self.suspicious_patterns = []
    
    def set_baseline(self, normal_trajectories):
        """Set baseline from known good trajectories"""
        # Your implementation here
        pass
    
    def analyze_trajectory(self, trajectory, reward):
        """
        Analyze if trajectory shows reward hacking signs.
        Returns {'suspicious': bool, 'indicators': [...], 'confidence': float}
        """
        # Your implementation here
        pass
    
    def _check_shortcut_patterns(self, trajectory):
        """Check for shortcuts that exploit reward function"""
        # Your implementation here
        pass
    
    def _check_repetition_exploit(self, trajectory):
        """Check for repetitive actions that farm reward"""
        # Your implementation here
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews