Generator-Critic Self-Play Loop

Medium
Agents

Generator-Critic Self-Play

Self-play iteratively improves outputs by alternating generation and critique.

Task

Build SelfPlayAgent where:

  1. Generator produces an attempt (given task + prior critique).
  2. Critic scores 0-1 and provides critique text.
  3. Loop repeats until score >= threshold or max_rounds reached.
  4. Returns best attempt (highest score across all rounds).

Constraints

  • Stop early on threshold met.
  • Best attempt = highest scored, not necessarily last.
  • Critique is always fed back to generator as context.

Examples

Example 1:
Input: scores=[0.5,0.8,0.95]; idx=[0] def gen(t,p='',c=''): return f'v{idx[0]}' def crit(t,a): s=scores[idx[0]]; idx[0]+=1; return ('ok',s) agent=SelfPlayAgent(gen,crit,max_rounds=10) agent.run('task',threshold=0.9)
Output: {'best_attempt':'v2','best_score':0.95,'rounds_taken':3}
Explanation: Score exceeds 0.9 at round 3; early stop.

Starter Code

from typing import List, Dict, Callable, Tuple
from dataclasses import dataclass

@dataclass
class Round:
    round_num: int
    attempt: str
    critique: str
    score: float

class SelfPlayAgent:
    def __init__(self, generator: Callable, critic: Callable, max_rounds: int = 5):
        # generator(task, previous_attempt, critique) -> str
        # critic(task, attempt) -> Tuple[str, float]  (critique, score 0-1)
        self.generator = generator
        self.critic = critic
        self.max_rounds = max_rounds
        self.rounds: List[Round] = []

    def run(self, task: str, threshold: float = 0.9) -> Dict:
        # TODO: Run loop, stop early if score >= threshold
        # Return {'best_attempt': str, 'best_score': float, 'rounds_taken': int}
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews