Implement the complete Dr. GRPO (GRPO Done Right) objective function for reinforcement learning with large language models. Dr. GRPO fixes two critical biases in GRPO: (1) response-level length bias from 1/|o_i| normalization, and (2) question-level difficulty bias from std normalization. The objective uses unbiased advantages (reward minus mean) and computes token-level clipped importance ratios summed over all tokens. Given log probabilities from new and old policies, rewards, and clipping parameter epsilon, compute the Dr. GRPO objective value.
Examples
Example 1:
Input:
log_probs_new=[[-0.2, -0.3], [-0.1, -0.4]], log_probs_old=[[-0.5, -0.6], [-0.4, -0.7]], rewards=[1.0, 0.0], epsilon=0.2Output:
-0.074929Explanation: Step 1: Compute advantages. Mean = 0.5, so Â_1 = 0.5, Â_2 = -0.5. Step 2: For response 1, token 1: ratio = exp(-0.2-(-0.5)) = 1.35, obj = min(1.35*0.5, 1.2*0.5) = 0.6. Token 2: ratio = exp(-0.3-(-0.6)) = 1.35, obj = 0.6. Response 1 total = 1.2. Step 3: For response 2, token 1: ratio = exp(-0.1-(-0.4)) = 1.35, obj = min(1.35*(-0.5), 1.2*(-0.5)) = -0.675. Token 2: ratio = exp(-0.4-(-0.7)) = 1.35, obj = -0.675. Response 2 total = -1.35. Step 4: Average over G=2: (1.2 + (-1.35))/2 = -0.075.
Starter Code
import numpy as np
def compute_dr_grpo_objective(log_probs_new: list[list[float]],
log_probs_old: list[list[float]],
rewards: list[float],
epsilon: float = 0.2) -> float:
"""
Compute the Dr. GRPO (GRPO Done Right) clipped objective.
Args:
log_probs_new: Log probabilities from new policy π_θ
Each response: [log π_θ(o_1|q), log π_θ(o_2|q,o_1), ...]
log_probs_old: Log probabilities from old policy π_θ_old
rewards: Rewards R(q, o_i) for each response
epsilon: Clipping parameter for importance ratios
Returns:
Dr. GRPO objective value
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready