Group Relative Advantage for GRPO

Easy
Reinforcement Learning

Implement the Group Relative Advantage calculation used in GRPO (Group Relative Policy Optimization) from the DeepSeek R1 paper. In GRPO, for each prompt, the model generates a group of G outputs. Each output receives a reward, and the advantage for each output is computed by normalizing rewards within the group. This normalization ensures that the policy update is relative to other outputs for the same prompt, which is key to GRPO's effectiveness.

Examples

Example 1:
Input: rewards = [0.0, 1.0, 0.0, 1.0]
Output: [-1.0, 1.0, -1.0, 1.0]
Explanation: Mean = 0.5, Std = 0.5. Each reward is normalized: (0-0.5)/0.5 = -1.0 for incorrect outputs, (1-0.5)/0.5 = 1.0 for correct outputs. This gives positive advantage to correct responses and negative to incorrect ones.

Starter Code

import numpy as np

def compute_group_relative_advantage(rewards: list[float]) -> list[float]:
	"""
	Compute the Group Relative Advantage for GRPO.
	
	For each reward r_i in a group, compute:
	A_i = (r_i - mean(rewards)) / std(rewards)
	
	If all rewards are identical (std=0), return zeros.
	
	Args:
		rewards: List of rewards for a group of outputs from the same prompt
		
	Returns:
		List of normalized advantages
	"""
	# Your code here
	pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews