Dr. GRPO: Complete Objective Function

Medium
Reinforcement Learning

Implement the complete Dr. GRPO (GRPO Done Right) objective function for reinforcement learning with large language models. Dr. GRPO fixes two critical biases in GRPO: (1) response-level length bias from 1/|o_i| normalization, and (2) question-level difficulty bias from std normalization. The objective uses unbiased advantages (reward minus mean) and computes token-level clipped importance ratios summed over all tokens. Given log probabilities from new and old policies, rewards, and clipping parameter epsilon, compute the Dr. GRPO objective value.

Examples

Example 1:
Input: log_probs_new=[[-0.2, -0.3], [-0.1, -0.4]], log_probs_old=[[-0.5, -0.6], [-0.4, -0.7]], rewards=[1.0, 0.0], epsilon=0.2
Output: -0.074929
Explanation: Step 1: Compute advantages. Mean = 0.5, so Â_1 = 0.5, Â_2 = -0.5. Step 2: For response 1, token 1: ratio = exp(-0.2-(-0.5)) = 1.35, obj = min(1.35*0.5, 1.2*0.5) = 0.6. Token 2: ratio = exp(-0.3-(-0.6)) = 1.35, obj = 0.6. Response 1 total = 1.2. Step 3: For response 2, token 1: ratio = exp(-0.1-(-0.4)) = 1.35, obj = min(1.35*(-0.5), 1.2*(-0.5)) = -0.675. Token 2: ratio = exp(-0.4-(-0.7)) = 1.35, obj = -0.675. Response 2 total = -1.35. Step 4: Average over G=2: (1.2 + (-1.35))/2 = -0.075.

Starter Code

import numpy as np

def compute_dr_grpo_objective(log_probs_new: list[list[float]], 
                               log_probs_old: list[list[float]], 
                               rewards: list[float], 
                               epsilon: float = 0.2) -> float:
	"""
	Compute the Dr. GRPO (GRPO Done Right) clipped objective.
	
	Args:
		log_probs_new: Log probabilities from new policy π_θ
		              Each response: [log π_θ(o_1|q), log π_θ(o_2|q,o_1), ...]
		log_probs_old: Log probabilities from old policy π_θ_old
		rewards: Rewards R(q, o_i) for each response
		epsilon: Clipping parameter for importance ratios
	
	Returns:
		Dr. GRPO objective value
	"""
	# Your code here
	pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews