Implement Agent Evaluation Framework

Medium
Agents

Agent Evaluation Framework

Production agents need systematic evaluation across diverse tasks.

Task

Build an AgentEvaluator that:

  1. Runs individual test cases against an agent function.
  2. Supports multiple grading strategies: exact match, contains, LLM-as-judge, custom.
  3. Reports aggregate metrics: accuracy, average latency, token usage, weighted score.
  4. Handles agent exceptions gracefully (mark as failed, score 0).

Constraints

  • Weighted score: sum(case.weight * case.score) / sum(weights).
  • Track latency per case (milliseconds).
  • LLM judge prompt: 'Does this answer the question? Q: {input} A: {actual} Expected: {expected}. Score 0-1:'.

Examples

Example 1:
Input: cases = [EvalCase('1', '2+2', '4', 'contains')] eval = AgentEvaluator(lambda q: '4') eval.run_suite(cases)
Output: {'accuracy': 1.0, 'weighted_score': 1.0, 'avg_latency_ms': ..., 'total_cases': 1}
Explanation: Agent returns '4' which contains expected '4'.

Starter Code

from typing import List, Dict, Any, Callable
from dataclasses import dataclass

@dataclass
class EvalCase:
    case_id: str
    input: str
    expected_output: str
    grader: str  # 'exact' | 'contains' | 'llm' | 'custom'
    weight: float = 1.0
    custom_grader: Callable = None

@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float
    actual_output: str
    latency_ms: float
    tokens_used: int

class AgentEvaluator:
    def __init__(self, agent_fn: Callable[[str], str], llm_judge_fn: Callable = None):
        self.agent_fn = agent_fn
        self.llm_judge_fn = llm_judge_fn

    def run_case(self, case: EvalCase) -> EvalResult:
        pass

    def run_suite(self, cases: List[EvalCase]) -> Dict:
        # TODO: Run all cases, return aggregate metrics
        pass

    def _grade(self, case: EvalCase, actual: str) -> float:
        # TODO: Grade based on grader type
        pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews