Implement a function to evaluate language model responses on MMLU (Massive Multitask Language Understanding) benchmark questions using letter-matching.
In MMLU evaluation, models are given multiple-choice questions with options A, B, C, or D. The model generates a text response, and we need to extract the predicted answer letter and compare it against the ground truth.
Given:
- model_outputs: A list of strings containing the model's generated responses
- ground_truth: A list of correct answer letters (A, B, C, or D)
- subjects: A list of subject/category names for each question
Your function should:
- Extract the predicted letter from each model output (handle various formats like 'A', 'a', 'A.', '(A)', 'A)', or phrases containing the answer)
- Compare predictions against ground truth
- Track valid vs invalid responses (outputs where no letter can be extracted)
- Calculate accuracy metrics overall and per-subject
Return a dictionary containing:
- 'overall_accuracy': Proportion of correct predictions out of all questions
- 'subject_accuracy': Dictionary mapping each subject to its accuracy
- 'valid_response_rate': Proportion of responses where a valid letter was extracted
- 'total_correct': Number of correct predictions
- 'total_questions': Total number of questions
Note: Invalid responses (where no letter can be extracted) count as incorrect for accuracy calculation.
Examples
Example 1:
Input:
model_outputs = ['A', 'C', '(B)', 'The answer is D']
ground_truth = ['A', 'B', 'B', 'D']
subjects = ['math', 'math', 'history', 'history']Output:
{'overall_accuracy': 0.75, 'subject_accuracy': {'history': 1.0, 'math': 0.5}, 'valid_response_rate': 1.0, 'total_correct': 3, 'total_questions': 4}Explanation: For each question: (1) 'A' matches 'A' - correct; (2) 'C' extracted, doesn't match 'B' - incorrect; (3) '(B)' -> 'B' matches 'B' - correct; (4) 'The answer is D' -> 'D' matches 'D' - correct. Total: 3/4 correct = 0.75 accuracy. Math subject: 1/2 = 0.5. History subject: 2/2 = 1.0. All 4 responses were valid letter extractions.
Starter Code
def mmlu_letter_matching(model_outputs: list[str], ground_truth: list[str], subjects: list[str]) -> dict:
"""
Evaluate MMLU predictions using letter-matching.
Args:
model_outputs: List of model generated responses
ground_truth: List of correct answer letters (A, B, C, or D)
subjects: List of subject names for each question
Returns:
Dictionary with evaluation metrics
"""
passPython3
ReadyLines: 1Characters: 0
Ready