Implement agent evaluation metrics:
record_episode(goal, actions, success, tokens, time): Store episode datatask_success_rate(): percent successful / totalaverage_steps_to_success(): Mean actions for successful episodesefficiency_score(): success_rate / (avg_tokens / 1000)- Normalize tokens to thousands
tool_usage_distribution(): Count each tool across all episodes
Episode Data:
{'goal': ..., 'actions': [...], 'success': bool, 'tokens': int, 'time': float}
Actions Format:
[{'tool': 'search', 'input': ...}, {'tool': 'calc', ...}]
Edge Cases:
- No episodes: return 0 or empty
- No successes: average_steps returns None
Examples
Example 1:
Input:
ev = AgentEvaluator(); ev.record_episode('g', [{'tool':'t'}], True, 100, 1.0); ev.record_episode('g', [], False, 50, 0.5); ev.task_success_rate()Output:
0.5Explanation: 1 success out of 2 episodes = 50%
Starter Code
class AgentEvaluator:
"""
Evaluate agent performance across multiple dimensions.
"""
def __init__(self):
self.metrics = {}
self.episodes = []
def record_episode(self, goal, actions, success, tokens_used, time_taken):
"""Record an agent episode"""
# Your implementation here
pass
def task_success_rate(self):
"""Calculate overall success rate"""
# Your implementation here
pass
def average_steps_to_success(self):
"""Average number of actions for successful episodes"""
# Your implementation here
pass
def efficiency_score(self):
"""
Calculate efficiency: success_rate / avg_tokens_per_task
Higher is better (more success with fewer tokens)
"""
# Your implementation here
pass
def tool_usage_distribution(self):
"""Return distribution of tool calls across episodes"""
# Your implementation here
passPython3
ReadyLines: 1Characters: 0
Ready