Generator-Critic Self-Play
Self-play iteratively improves outputs by alternating generation and critique.
Task
Build SelfPlayAgent where:
- Generator produces an attempt (given task + prior critique).
- Critic scores 0-1 and provides critique text.
- Loop repeats until score >= threshold or max_rounds reached.
- Returns best attempt (highest score across all rounds).
Constraints
- Stop early on threshold met.
- Best attempt = highest scored, not necessarily last.
- Critique is always fed back to generator as context.
Examples
Example 1:
Input:
scores=[0.5,0.8,0.95]; idx=[0]
def gen(t,p='',c=''): return f'v{idx[0]}'
def crit(t,a): s=scores[idx[0]]; idx[0]+=1; return ('ok',s)
agent=SelfPlayAgent(gen,crit,max_rounds=10)
agent.run('task',threshold=0.9)Output:
{'best_attempt':'v2','best_score':0.95,'rounds_taken':3}Explanation: Score exceeds 0.9 at round 3; early stop.
Starter Code
from typing import List, Dict, Callable, Tuple
from dataclasses import dataclass
@dataclass
class Round:
round_num: int
attempt: str
critique: str
score: float
class SelfPlayAgent:
def __init__(self, generator: Callable, critic: Callable, max_rounds: int = 5):
# generator(task, previous_attempt, critique) -> str
# critic(task, attempt) -> Tuple[str, float] (critique, score 0-1)
self.generator = generator
self.critic = critic
self.max_rounds = max_rounds
self.rounds: List[Round] = []
def run(self, task: str, threshold: float = 0.9) -> Dict:
# TODO: Run loop, stop early if score >= threshold
# Return {'best_attempt': str, 'best_score': float, 'rounds_taken': int}
pass
Python3
ReadyLines: 1Characters: 0
Ready