Code Execution Verifier for Programming Benchmarks

Medium
LLM

Implement a code execution verifier used in programming benchmarks like HumanEval and MBPP. When evaluating code generation models, we need to verify whether the generated code produces correct outputs for given test cases.

Your verifier should process a list of test case results, where each result contains:

  • 'expected': The expected output as a string
  • 'actual': The actual output from code execution (may be None if execution failed)
  • 'status': Execution status ('success', 'error', or 'timeout')

The verifier should:

  1. Mark tests with non-success status as 'error'
  2. For successful executions, compare outputs after stripping whitespace
  3. Support numeric tolerance for floating-point comparisons (try parsing as floats first)
  4. Fall back to exact string matching for non-numeric outputs

Return a dictionary containing pass_rate, error_rate, passed_count, total_count, and a list of verdicts ('pass', 'fail', or 'error') for each test. Handle empty test case lists appropriately.

Examples

Example 1:
Input: test_cases = [{'expected': '5', 'actual': '5', 'status': 'success'}, {'expected': 'foo', 'actual': 'bar', 'status': 'success'}, {'expected': '10', 'actual': None, 'status': 'error'}]
Output: {'pass_rate': 0.3333, 'error_rate': 0.3333, 'passed_count': 1, 'total_count': 3, 'verdicts': ['pass', 'fail', 'error']}
Explanation: The first test case passes (exact string match '5'). The second test case fails because 'foo' does not equal 'bar'. The third test case is marked as 'error' because the status is 'error'. Thus 1 out of 3 tests passed (0.3333), and 1 out of 3 had execution errors (0.3333).
Example 2:
Input: Hidden test case or specific edge case
Output: Correct evaluated result
Explanation: An additional example to demonstrate the robustness of the implementation.

Starter Code

import numpy as np

def verify_code_execution(
    test_cases: list[dict],
    numeric_tolerance: float = 1e-6
) -> dict:
    """
    Verify code execution results for a programming benchmark.
    
    Args:
        test_cases: List of dicts with keys:
            - 'expected': Expected output string
            - 'actual': Actual output string (or None if execution failed)
            - 'status': 'success', 'error', or 'timeout'
        numeric_tolerance: Tolerance for floating-point comparisons
        
    Returns:
        Dict with keys:
            - 'pass_rate': Proportion of passed tests (float, rounded to 4 decimals)
            - 'error_rate': Proportion of execution errors (float, rounded to 4 decimals)
            - 'passed_count': Number of passed tests (int)
            - 'total_count': Total number of tests (int)
            - 'verdicts': List of 'pass', 'fail', or 'error' for each test
    """
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews