GSPO: Group Sequence Policy Optimization

Hard
Reinforcement Learning

Implement Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm for training large language models. Unlike token-level methods like GRPO that suffer from training instability, GSPO uses sequence-level importance ratios with length normalization. Given log probabilities from new and old policies along with rewards for a group of sequences, compute the clipped objective. The algorithm prevents catastrophic model collapse by applying clipping at the sequence level rather than token level.

Examples

Example 1:
Input: log_probs_new=[[1.0, 0.5], [0.8, 1.2]], log_probs_old=[[0.0, 0.0], [0.0, 0.0]], rewards=[0.9, 0.7], epsilon=0.2
Output: Approximately -0.089
Explanation: First compute advantages: mean=0.8, std=0.1, so A=[1.0, -1.0]. For seq 1: log ratio sum = 1.5, length = 2, so s_1 = exp(1.5/2) = 2.117, clipped to 1.2, objective = 1.2 × 1.0 = 1.2. For seq 2: log ratio sum = 2.0, length = 2, so s_2 = exp(2.0/2) = 2.718, clipped to 1.2, objective = 1.2 × (-1.0) = -1.2. Average = 0.0. The length normalization prevents extreme importance ratios.

Starter Code

import numpy as np

def gspo_objective(log_probs_new: list[list[float]], 
                   log_probs_old: list[list[float]], 
                   rewards: list[float], 
                   epsilon: float = 0.2) -> float:
	"""
	Compute GSPO (Group Sequence Policy Optimization) clipped objective.
	
	Args:
		log_probs_new: Log probability sequences from new policy
		log_probs_old: Log probability sequences from old policy
		rewards: Reward for each sequence
		epsilon: Clipping range for importance ratios
	
	Returns:
		Average clipped objective value
	"""
	# Your code here
	pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews