Pairwise Preference Judge for LLM Comparison

Medium
LLM

Implement a pairwise preference judge that compares responses from two LLMs across multiple evaluation criteria.

In LLM evaluation, pairwise comparison is a common approach where human annotators or automated systems compare two model outputs and determine which is better. This function should analyze such comparisons.

Function Inputs:

  • comparisons: A list of comparison dictionaries, each containing:
    • 'id': Unique identifier for the comparison
    • 'scores_a': Dictionary mapping criterion names to scores for response A
    • 'scores_b': Dictionary mapping criterion names to scores for response B
  • criteria_weights: Dictionary mapping criterion names to their importance weights
  • tie_threshold: Float threshold - if the absolute weighted score difference is less than or equal to this value, declare a tie

Function Output:

A dictionary containing:

  • 'results': List of result dictionaries, each with:
    • 'id': The comparison identifier
    • 'winner': String 'A', 'B', or 'tie'
    • 'margin': The absolute weighted score difference (rounded to 4 decimal places)
  • 'win_rate_a': Proportion of comparisons won by A (rounded to 4 decimal places)
  • 'win_rate_b': Proportion of comparisons won by B (rounded to 4 decimal places)
  • 'tie_rate': Proportion of tied comparisons (rounded to 4 decimal places)
  • 'avg_margin': Average absolute margin across all comparisons (rounded to 4 decimal places)

Note: Weights should be normalized so they sum to 1 when computing weighted scores. For empty comparison lists, return all rates and average margin as 0.0.

Examples

Example 1:
Input: comparisons = [{'id': 1, 'scores_a': {'quality': 8, 'relevance': 7}, 'scores_b': {'quality': 6, 'relevance': 8}}, {'id': 2, 'scores_a': {'quality': 5, 'relevance': 6}, 'scores_b': {'quality': 7, 'relevance': 8}}], criteria_weights = {'quality': 0.6, 'relevance': 0.4}, tie_threshold = 0.5
Output: {'results': [{'id': 1, 'winner': 'A', 'margin': 0.8}, {'id': 2, 'winner': 'B', 'margin': 2.0}], 'win_rate_a': 0.5, 'win_rate_b': 0.5, 'tie_rate': 0.0, 'avg_margin': 1.4}
Explanation: For comparison 1: weighted_a = (8*0.6 + 7*0.4)/1.0 = 7.6, weighted_b = (6*0.6 + 8*0.4)/1.0 = 6.8, margin = 0.8 > 0.5, so A wins. For comparison 2: weighted_a = 5.4, weighted_b = 7.4, margin = 2.0, so B wins. Win rates are each 50%, no ties, average margin is 1.4.
Example 2:
Input: Hidden test case or specific edge case
Output: Correct evaluated result
Explanation: An additional example to demonstrate the robustness of the implementation.

Starter Code

def pairwise_preference_judge(comparisons: list, criteria_weights: dict, tie_threshold: float) -> dict:
    """
    Analyze pairwise comparisons between LLM responses.
    
    Args:
        comparisons: List of comparison dicts with 'id', 'scores_a', 'scores_b'
        criteria_weights: Dict mapping criterion names to importance weights
        tie_threshold: Maximum difference to declare a tie
    
    Returns:
        Dict with 'results', 'win_rate_a', 'win_rate_b', 'tie_rate', 'avg_margin'
    """
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews