Implement Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm for training large language models. Unlike token-level methods like GRPO that suffer from training instability, GSPO uses sequence-level importance ratios with length normalization. Given log probabilities from new and old policies along with rewards for a group of sequences, compute the clipped objective. The algorithm prevents catastrophic model collapse by applying clipping at the sequence level rather than token level.
Examples
Example 1:
Input:
log_probs_new=[[1.0, 0.5], [0.8, 1.2]], log_probs_old=[[0.0, 0.0], [0.0, 0.0]], rewards=[0.9, 0.7], epsilon=0.2Output:
Approximately -0.089Explanation: First compute advantages: mean=0.8, std=0.1, so A=[1.0, -1.0]. For seq 1: log ratio sum = 1.5, length = 2, so s_1 = exp(1.5/2) = 2.117, clipped to 1.2, objective = 1.2 × 1.0 = 1.2. For seq 2: log ratio sum = 2.0, length = 2, so s_2 = exp(2.0/2) = 2.718, clipped to 1.2, objective = 1.2 × (-1.0) = -1.2. Average = 0.0. The length normalization prevents extreme importance ratios.
Starter Code
import numpy as np
def gspo_objective(log_probs_new: list[list[float]],
log_probs_old: list[list[float]],
rewards: list[float],
epsilon: float = 0.2) -> float:
"""
Compute GSPO (Group Sequence Policy Optimization) clipped objective.
Args:
log_probs_new: Log probability sequences from new policy
log_probs_old: Log probability sequences from old policy
rewards: Reward for each sequence
epsilon: Clipping range for importance ratios
Returns:
Average clipped objective value
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready