Implement a code execution verifier used in programming benchmarks like HumanEval and MBPP. When evaluating code generation models, we need to verify whether the generated code produces correct outputs for given test cases.
Your verifier should process a list of test case results, where each result contains:
- 'expected': The expected output as a string
- 'actual': The actual output from code execution (may be None if execution failed)
- 'status': Execution status ('success', 'error', or 'timeout')
The verifier should:
- Mark tests with non-success status as 'error'
- For successful executions, compare outputs after stripping whitespace
- Support numeric tolerance for floating-point comparisons (try parsing as floats first)
- Fall back to exact string matching for non-numeric outputs
Return a dictionary containing pass_rate, error_rate, passed_count, total_count, and a list of verdicts ('pass', 'fail', or 'error') for each test. Handle empty test case lists appropriately.
Examples
Example 1:
Input:
test_cases = [{'expected': '5', 'actual': '5', 'status': 'success'}, {'expected': 'foo', 'actual': 'bar', 'status': 'success'}, {'expected': '10', 'actual': None, 'status': 'error'}]Output:
{'pass_rate': 0.3333, 'error_rate': 0.3333, 'passed_count': 1, 'total_count': 3, 'verdicts': ['pass', 'fail', 'error']}Explanation: The first test case passes (exact string match '5'). The second test case fails because 'foo' does not equal 'bar'. The third test case is marked as 'error' because the status is 'error'. Thus 1 out of 3 tests passed (0.3333), and 1 out of 3 had execution errors (0.3333).
Example 2:
Input:
Hidden test case or specific edge caseOutput:
Correct evaluated resultExplanation: An additional example to demonstrate the robustness of the implementation.
Starter Code
import numpy as np
def verify_code_execution(
test_cases: list[dict],
numeric_tolerance: float = 1e-6
) -> dict:
"""
Verify code execution results for a programming benchmark.
Args:
test_cases: List of dicts with keys:
- 'expected': Expected output string
- 'actual': Actual output string (or None if execution failed)
- 'status': 'success', 'error', or 'timeout'
numeric_tolerance: Tolerance for floating-point comparisons
Returns:
Dict with keys:
- 'pass_rate': Proportion of passed tests (float, rounded to 4 decimals)
- 'error_rate': Proportion of execution errors (float, rounded to 4 decimals)
- 'passed_count': Number of passed tests (int)
- 'total_count': Total number of tests (int)
- 'verdicts': List of 'pass', 'fail', or 'error' for each test
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready