Agent Evaluation Framework
Production agents need systematic evaluation across diverse tasks.
Task
Build an AgentEvaluator that:
- Runs individual test cases against an agent function.
- Supports multiple grading strategies: exact match, contains, LLM-as-judge, custom.
- Reports aggregate metrics: accuracy, average latency, token usage, weighted score.
- Handles agent exceptions gracefully (mark as failed, score 0).
Constraints
- Weighted score: sum(case.weight * case.score) / sum(weights).
- Track latency per case (milliseconds).
- LLM judge prompt:
'Does this answer the question? Q: {input} A: {actual} Expected: {expected}. Score 0-1:'.
Examples
Example 1:
Input:
cases = [EvalCase('1', '2+2', '4', 'contains')]
eval = AgentEvaluator(lambda q: '4')
eval.run_suite(cases)Output:
{'accuracy': 1.0, 'weighted_score': 1.0, 'avg_latency_ms': ..., 'total_cases': 1}Explanation: Agent returns '4' which contains expected '4'.
Starter Code
from typing import List, Dict, Any, Callable
from dataclasses import dataclass
@dataclass
class EvalCase:
case_id: str
input: str
expected_output: str
grader: str # 'exact' | 'contains' | 'llm' | 'custom'
weight: float = 1.0
custom_grader: Callable = None
@dataclass
class EvalResult:
case_id: str
passed: bool
score: float
actual_output: str
latency_ms: float
tokens_used: int
class AgentEvaluator:
def __init__(self, agent_fn: Callable[[str], str], llm_judge_fn: Callable = None):
self.agent_fn = agent_fn
self.llm_judge_fn = llm_judge_fn
def run_case(self, case: EvalCase) -> EvalResult:
pass
def run_suite(self, cases: List[EvalCase]) -> Dict:
# TODO: Run all cases, return aggregate metrics
pass
def _grade(self, case: EvalCase, actual: str) -> float:
# TODO: Grade based on grader type
pass
Python3
ReadyLines: 1Characters: 0
Ready