Implement the Group Relative Advantage calculation used in GRPO (Group Relative Policy Optimization) from the DeepSeek R1 paper. In GRPO, for each prompt, the model generates a group of G outputs. Each output receives a reward, and the advantage for each output is computed by normalizing rewards within the group. This normalization ensures that the policy update is relative to other outputs for the same prompt, which is key to GRPO's effectiveness.
Examples
Example 1:
Input:
rewards = [0.0, 1.0, 0.0, 1.0]Output:
[-1.0, 1.0, -1.0, 1.0]Explanation: Mean = 0.5, Std = 0.5. Each reward is normalized: (0-0.5)/0.5 = -1.0 for incorrect outputs, (1-0.5)/0.5 = 1.0 for correct outputs. This gives positive advantage to correct responses and negative to incorrect ones.
Starter Code
import numpy as np
def compute_group_relative_advantage(rewards: list[float]) -> list[float]:
"""
Compute the Group Relative Advantage for GRPO.
For each reward r_i in a group, compute:
A_i = (r_i - mean(rewards)) / std(rewards)
If all rewards are identical (std=0), return zeros.
Args:
rewards: List of rewards for a group of outputs from the same prompt
Returns:
List of normalized advantages
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready