The AI Interview - Master AI/ML Interviews

Implement the complete Dr. GRPO (GRPO Done Right) objective function for reinforcement learning with large language models. Dr. GRPO fixes two critical biases in GRPO: (1) response-level length bias from 1/|o_i| normalization, and (2) question-level difficulty bias from std normalization. The objective uses unbiased advantages (reward minus mean) and computes token-level clipped importance ratios summed over all tokens. Given log probabilities from new and old policies, rewards, and clipping parameter epsilon, compute the Dr. GRPO objective value.

Examples

Example 1:

Input: log_probs_new=[[-0.2, -0.3], [-0.1, -0.4]], log_probs_old=[[-0.5, -0.6], [-0.4, -0.7]], rewards=[1.0, 0.0], epsilon=0.2

Output: -0.074929

Explanation: Step 1: Compute advantages. Mean = 0.5, so Â_1 = 0.5, Â_2 = -0.5. Step 2: For response 1, token 1: ratio = exp(-0.2-(-0.5)) = 1.35, obj = min(1.35*0.5, 1.2*0.5) = 0.6. Token 2: ratio = exp(-0.3-(-0.6)) = 1.35, obj = 0.6. Response 1 total = 1.2. Step 3: For response 2, token 1: ratio = exp(-0.1-(-0.4)) = 1.35, obj = min(1.35*(-0.5), 1.2*(-0.5)) = -0.675. Token 2: ratio = exp(-0.4-(-0.7)) = 1.35, obj = -0.675. Response 2 total = -1.35. Step 4: Average over G=2: (1.2 + (-1.35))/2 = -0.075.

Starter Code

import numpy as np

def compute_dr_grpo_objective(log_probs_new: list[list[float]], 
                               log_probs_old: list[list[float]], 
                               rewards: list[float], 
                               epsilon: float = 0.2) -> float:
	"""
	Compute the Dr. GRPO (GRPO Done Right) clipped objective.
	
	Args:
		log_probs_new: Log probabilities from new policy π_θ
		              Each response: [log π_θ(o_1|q), log π_θ(o_2|q,o_1), ...]
		log_probs_old: Log probabilities from old policy π_θ_old
		rewards: Rewards R(q, o_i) for each response
		epsilon: Clipping parameter for importance ratios
	
	Returns:
		Dr. GRPO objective value
	"""
	# Your code here
	pass

Dr. GRPO: Complete Objective Function

Examples

Starter Code