The AI Interview - Master AI/ML Interviews

Implement evaluation metrics for reasoning models: pass@1 and majority voting (consensus). These metrics are essential for evaluating models where multiple samples are generated per problem. Pass@1 measures the average correctness across samples, while majority voting selects the most common answer, often improving accuracy by filtering out inconsistent errors.

Examples

Example 1:

Input: responses_correct = np.array([True, False, True, False])

Output: 0.5

Explanation: 2 out of 4 responses are correct, so pass@1 = 2/4 = 0.5. This represents a 50% chance of getting a correct answer when sampling once.

Starter Code

import numpy as np
from collections import Counter

def pass_at_1(responses_correct: np.ndarray) -> float:
	"""
	Compute pass@1 by averaging correctness.
	
	Args:
		responses_correct: Boolean array for each response
		
	Returns:
		pass@1 score
	"""
	# Your code here
	pass


def majority_voting(responses: list[str]) -> str:
	"""
	Return the most common response.
	
	Args:
		responses: List of response strings
		
	Returns:
		Most frequent response
	"""
	# Your code here
	pass


def pass_at_k(n: int, c: int, k: int) -> float:
	"""
	Compute unbiased pass@k from n samples with c correct.
	
	Formula: pass@k = 1 - C(n-c, k) / C(n, k)
	
	Args:
		n: Total samples
		c: Correct samples
		k: k in pass@k
		
	Returns:
		Estimated pass@k
	"""
	# Your code here
	pass

Pass@k and Majority Voting Evaluation Metrics

Examples

Starter Code