Implement the BLEU (Bilingual Evaluation Understudy) score metric, which is widely used to evaluate the quality of machine-generated text by comparing it against one or more reference texts.
Given a candidate sentence (as a list of tokens), a list of reference sentences (each as a list of tokens), and a maximum n-gram order, compute the BLEU score.
Your function should:
- Calculate modified n-gram precision for each n from 1 to max_n, where counts are clipped to avoid gaming by repetition
- Apply a brevity penalty to discourage overly short translations
- Combine the precisions using a geometric mean
- Return 0.0 if any n-gram precision is zero or if the candidate is empty
- When selecting the reference length for brevity penalty with multiple references, choose the length closest to the candidate length (if tied, choose shorter)
Examples
Example 1:
Input:
candidate = ['a', 'b', 'c', 'd'], references = [['a', 'b', 'x', 'd']], max_n = 2Output:
0.5Explanation: For 1-grams: candidate has {a, b, c, d}, reference has {a, b, x, d}. Clipped counts: a=1, b=1, c=0, d=1, total clipped=3, total candidate=4, so p1=3/4=0.75. For 2-grams: candidate has {(a,b), (b,c), (c,d)}, reference has {(a,b), (b,x), (x,d)}. Only (a,b) matches, so p2=1/3. Geometric mean = exp((log(0.75) + log(0.333))/2) = exp(-0.693) = 0.5. Since candidate length equals reference length, brevity penalty = 1.0. Final BLEU = 1.0 * 0.5 = 0.5.
Starter Code
import numpy as np
from collections import Counter
def bleu_score(candidate: list[str], references: list[list[str]], max_n: int = 4) -> float:
"""
Calculate BLEU score for a candidate sentence against reference sentences.
Args:
candidate: List of tokens in the candidate sentence
references: List of reference sentences, each as a list of tokens
max_n: Maximum n-gram order (default: 4)
Returns:
BLEU score between 0 and 1
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready