The AI Interview - Master AI/ML Interviews

Implement the GRPO (Group Relative Policy Optimization) objective function as defined in the DeepSeekMath paper. Your task is to compute the objective given likelihood ratios, advantages, and policy probabilities.

The GRPO objective combines a PPO-style clipped surrogate with an importance-weighted KL divergence penalty. The KL term uses the unbiased estimator:

$D_{KL}^{(i)} = \rho_i \left( \frac{\pi_{ref}(o_i)}{\pi_\theta(o_i)} - \log \frac{\pi_{ref}(o_i)}{\pi_\theta(o_i)} - 1 \right)$

where $\pi_\theta = \rho \cdot \pi_{\theta_{old}}$ is the current policy probability.

The final objective is the average of the clipped surrogate objective minus the beta-weighted KL penalty.

Note: The input probabilities (pi_theta_old and pi_theta_ref) are per-sample likelihood values, not distributions. Do not normalize them - use them directly as provided.

Examples

Example 1:

Input: grpo_objective([1.2, 0.8, 1.1], [1.0, 1.0, 1.0], [0.9, 1.1, 1.0], [1.0, 0.5, 1.5], epsilon=0.2, beta=0.01)

Output: 1.03277

Explanation: The function computes the clipped surrogate objective and subtracts the importance-weighted KL penalty. The KL uses the estimator: rho * (r - log(r) - 1) where r = pi_ref / pi_theta.

Starter Code

import numpy as np

def grpo_objective(rhos, A, pi_theta_old, pi_theta_ref, epsilon=0.2, beta=0.01) -> float:
    """
    Compute the GRPO objective function.

    Args:
        rhos: List of likelihood ratios (pi_theta / pi_theta_old).
        A: List of advantage estimates.
        pi_theta_old: List of old policy probabilities (per-sample, not normalized).
        pi_theta_ref: List of reference policy probabilities (per-sample, not normalized).
        epsilon: Clipping parameter for the surrogate objective.
        beta: KL divergence penalty coefficient.

    Returns:
        The computed GRPO objective value.
    """
    # Your code here
    pass

Implement the GRPO Objective Function

Examples

Starter Code