Implement a pairwise preference judge that compares responses from two LLMs across multiple evaluation criteria.
In LLM evaluation, pairwise comparison is a common approach where human annotators or automated systems compare two model outputs and determine which is better. This function should analyze such comparisons.
Function Inputs:
comparisons: A list of comparison dictionaries, each containing:'id': Unique identifier for the comparison'scores_a': Dictionary mapping criterion names to scores for response A'scores_b': Dictionary mapping criterion names to scores for response B
criteria_weights: Dictionary mapping criterion names to their importance weightstie_threshold: Float threshold - if the absolute weighted score difference is less than or equal to this value, declare a tie
Function Output:
A dictionary containing:
'results': List of result dictionaries, each with:'id': The comparison identifier'winner': String 'A', 'B', or 'tie''margin': The absolute weighted score difference (rounded to 4 decimal places)
'win_rate_a': Proportion of comparisons won by A (rounded to 4 decimal places)'win_rate_b': Proportion of comparisons won by B (rounded to 4 decimal places)'tie_rate': Proportion of tied comparisons (rounded to 4 decimal places)'avg_margin': Average absolute margin across all comparisons (rounded to 4 decimal places)
Note: Weights should be normalized so they sum to 1 when computing weighted scores. For empty comparison lists, return all rates and average margin as 0.0.
Examples
Example 1:
Input:
comparisons = [{'id': 1, 'scores_a': {'quality': 8, 'relevance': 7}, 'scores_b': {'quality': 6, 'relevance': 8}}, {'id': 2, 'scores_a': {'quality': 5, 'relevance': 6}, 'scores_b': {'quality': 7, 'relevance': 8}}], criteria_weights = {'quality': 0.6, 'relevance': 0.4}, tie_threshold = 0.5Output:
{'results': [{'id': 1, 'winner': 'A', 'margin': 0.8}, {'id': 2, 'winner': 'B', 'margin': 2.0}], 'win_rate_a': 0.5, 'win_rate_b': 0.5, 'tie_rate': 0.0, 'avg_margin': 1.4}Explanation: For comparison 1: weighted_a = (8*0.6 + 7*0.4)/1.0 = 7.6, weighted_b = (6*0.6 + 8*0.4)/1.0 = 6.8, margin = 0.8 > 0.5, so A wins. For comparison 2: weighted_a = 5.4, weighted_b = 7.4, margin = 2.0, so B wins. Win rates are each 50%, no ties, average margin is 1.4.
Example 2:
Input:
Hidden test case or specific edge caseOutput:
Correct evaluated resultExplanation: An additional example to demonstrate the robustness of the implementation.
Starter Code
def pairwise_preference_judge(comparisons: list, criteria_weights: dict, tie_threshold: float) -> dict:
"""
Analyze pairwise comparisons between LLM responses.
Args:
comparisons: List of comparison dicts with 'id', 'scores_a', 'scores_b'
criteria_weights: Dict mapping criterion names to importance weights
tie_threshold: Maximum difference to declare a tie
Returns:
Dict with 'results', 'win_rate_a', 'win_rate_b', 'tie_rate', 'avg_margin'
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready