The AI Interview - Master AI/ML Interviews

Implement reward hacking detection for aligned agents:

Reward Hacking Types:

Specification Gaming: Finding loopholes in reward definition
Reward Tampering: Modifying reward mechanism directly
Wireheading: Stimulating pleasure center directly

Detection Methods:

set_baseline(normal_trajectories): Learn normal behavior distribution
analyze_trajectory(trajectory, reward): Check for anomalies

Indicators:

Unusually high reward for low effort
Repetitive action patterns (reward farming)
Deviations from baseline behavior
Actions that don't progress toward stated goal

Return: Suspicion score (0-1) and list of triggered indicators.

Examples

Example 1:

Input:

detector = RewardHackingDetector(); detector.set_baseline([['step1', 'step2', 'step3']]); result = detector.analyze_trajectory(['step1', 'step1', 'step1', 'step1'], 100); result['suspicious']

Output: True

Explanation: Repetitive action pattern indicates potential reward farming

Starter Code

class RewardHackingDetector:
    """
    Detect potential reward hacking in agent behavior.
    Identifies when agents optimize metric over intended goal.
    """
    
    def __init__(self):
        self.baseline_behavior = None
        self.suspicious_patterns = []
    
    def set_baseline(self, normal_trajectories):
        """Set baseline from known good trajectories"""
        # Your implementation here
        pass
    
    def analyze_trajectory(self, trajectory, reward):
        """
        Analyze if trajectory shows reward hacking signs.
        Returns {'suspicious': bool, 'indicators': [...], 'confidence': float}
        """
        # Your implementation here
        pass
    
    def _check_shortcut_patterns(self, trajectory):
        """Check for shortcuts that exploit reward function"""
        # Your implementation here
        pass
    
    def _check_repetition_exploit(self, trajectory):
        """Check for repetitive actions that farm reward"""
        # Your implementation here
        pass

Reward Hacking Detector

Examples

Starter Code