Implement evaluation metrics for reasoning models: pass@1 and majority voting (consensus). These metrics are essential for evaluating models where multiple samples are generated per problem. Pass@1 measures the average correctness across samples, while majority voting selects the most common answer, often improving accuracy by filtering out inconsistent errors.
Examples
Example 1:
Input:
responses_correct = np.array([True, False, True, False])Output:
0.5Explanation: 2 out of 4 responses are correct, so pass@1 = 2/4 = 0.5. This represents a 50% chance of getting a correct answer when sampling once.
Starter Code
import numpy as np
from collections import Counter
def pass_at_1(responses_correct: np.ndarray) -> float:
"""
Compute pass@1 by averaging correctness.
Args:
responses_correct: Boolean array for each response
Returns:
pass@1 score
"""
# Your code here
pass
def majority_voting(responses: list[str]) -> str:
"""
Return the most common response.
Args:
responses: List of response strings
Returns:
Most frequent response
"""
# Your code here
pass
def pass_at_k(n: int, c: int, k: int) -> float:
"""
Compute unbiased pass@k from n samples with c correct.
Formula: pass@k = 1 - C(n-c, k) / C(n, k)
Args:
n: Total samples
c: Correct samples
k: k in pass@k
Returns:
Estimated pass@k
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready