MMLU Letter-Matching Evaluation

Medium
NLP

Implement a function to evaluate language model responses on MMLU (Massive Multitask Language Understanding) benchmark questions using letter-matching.

In MMLU evaluation, models are given multiple-choice questions with options A, B, C, or D. The model generates a text response, and we need to extract the predicted answer letter and compare it against the ground truth.

Given:

  • model_outputs: A list of strings containing the model's generated responses
  • ground_truth: A list of correct answer letters (A, B, C, or D)
  • subjects: A list of subject/category names for each question

Your function should:

  1. Extract the predicted letter from each model output (handle various formats like 'A', 'a', 'A.', '(A)', 'A)', or phrases containing the answer)
  2. Compare predictions against ground truth
  3. Track valid vs invalid responses (outputs where no letter can be extracted)
  4. Calculate accuracy metrics overall and per-subject

Return a dictionary containing:

  • 'overall_accuracy': Proportion of correct predictions out of all questions
  • 'subject_accuracy': Dictionary mapping each subject to its accuracy
  • 'valid_response_rate': Proportion of responses where a valid letter was extracted
  • 'total_correct': Number of correct predictions
  • 'total_questions': Total number of questions

Note: Invalid responses (where no letter can be extracted) count as incorrect for accuracy calculation.

Examples

Example 1:
Input: model_outputs = ['A', 'C', '(B)', 'The answer is D'] ground_truth = ['A', 'B', 'B', 'D'] subjects = ['math', 'math', 'history', 'history']
Output: {'overall_accuracy': 0.75, 'subject_accuracy': {'history': 1.0, 'math': 0.5}, 'valid_response_rate': 1.0, 'total_correct': 3, 'total_questions': 4}
Explanation: For each question: (1) 'A' matches 'A' - correct; (2) 'C' extracted, doesn't match 'B' - incorrect; (3) '(B)' -> 'B' matches 'B' - correct; (4) 'The answer is D' -> 'D' matches 'D' - correct. Total: 3/4 correct = 0.75 accuracy. Math subject: 1/2 = 0.5. History subject: 2/2 = 1.0. All 4 responses were valid letter extractions.

Starter Code

def mmlu_letter_matching(model_outputs: list[str], ground_truth: list[str], subjects: list[str]) -> dict:
    """
    Evaluate MMLU predictions using letter-matching.
    
    Args:
        model_outputs: List of model generated responses
        ground_truth: List of correct answer letters (A, B, C, or D)
        subjects: List of subject names for each question
    
    Returns:
        Dictionary with evaluation metrics
    """
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews