Implement the GRPO (Group Relative Policy Optimization) objective function as defined in the DeepSeekMath paper. Your task is to compute the objective given likelihood ratios, advantages, and policy probabilities.
The GRPO objective combines a PPO-style clipped surrogate with an importance-weighted KL divergence penalty. The KL term uses the unbiased estimator:
DKL(i)=ρi(πθ(oi)πref(oi)−logπθ(oi)πref(oi)−1)
where πθ=ρ⋅πθold is the current policy probability.
The final objective is the average of the clipped surrogate objective minus the beta-weighted KL penalty.
Note: The input probabilities (pi_theta_old and pi_theta_ref) are per-sample likelihood values, not distributions. Do not normalize them - use them directly as provided.
Examples
Example 1:
Input:
grpo_objective([1.2, 0.8, 1.1], [1.0, 1.0, 1.0], [0.9, 1.1, 1.0], [1.0, 0.5, 1.5], epsilon=0.2, beta=0.01)Output:
1.03277Explanation: The function computes the clipped surrogate objective and subtracts the importance-weighted KL penalty. The KL uses the estimator: rho * (r - log(r) - 1) where r = pi_ref / pi_theta.
Starter Code
import numpy as np
def grpo_objective(rhos, A, pi_theta_old, pi_theta_ref, epsilon=0.2, beta=0.01) -> float:
"""
Compute the GRPO objective function.
Args:
rhos: List of likelihood ratios (pi_theta / pi_theta_old).
A: List of advantage estimates.
pi_theta_old: List of old policy probabilities (per-sample, not normalized).
pi_theta_ref: List of reference policy probabilities (per-sample, not normalized).
epsilon: Clipping parameter for the surrogate objective.
beta: KL divergence penalty coefficient.
Returns:
The computed GRPO objective value.
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready