Rubric-Based LLM Judge Evaluation

Medium
LLM

Implement a function that performs rubric-based evaluation of LLM outputs using multiple judges. This is a common technique in LLM-as-a-judge evaluation pipelines where multiple language models score a response across different quality criteria.

Your function should take:

  • judge_scores: A 2D list where judge_scores[i][j] represents the score given by judge i for criterion j. Scores range from 0 to max_score.
  • criteria_weights: A list of weights for each criterion (weights sum to 1)
  • passing_threshold: The minimum normalized score (0 to 1) required to pass (default: 0.6)
  • max_score: The maximum possible score for any criterion (default: 5.0)

Your function should return a dictionary containing:

  • weighted_score: The overall weighted score combining all criteria and judges
  • normalized_score: The weighted score normalized to a 0-1 scale
  • criterion_scores: A list of average scores per criterion across all judges
  • pass_status: Boolean indicating if the response passes the threshold
  • judge_agreement: A metric from 0 to 1 indicating how much judges agree (1 = perfect agreement)

For judge agreement, use a standard deviation-based approach where agreement decreases as the average standard deviation across criteria increases relative to the maximum possible standard deviation.

Examples

Example 1:
Input: judge_scores = [[4, 5, 3], [4, 4, 4], [5, 5, 4]], criteria_weights = [0.3, 0.5, 0.2]
Output: {'weighted_score': 4.3667, 'normalized_score': 0.8733, 'criterion_scores': [4.3333, 4.6667, 3.6667], 'pass_status': True, 'judge_agreement': 0.8114}
Explanation: Three judges evaluate a response on 3 criteria. For criterion 0, judges give [4, 4, 5] with average 4.3333. For criterion 1, judges give [5, 4, 5] with average 4.6667. For criterion 2, judges give [3, 4, 4] with average 3.6667. The weighted score is 4.3333*0.3 + 4.6667*0.5 + 3.6667*0.2 = 4.3667. Normalized to 0-1 scale: 4.3667/5 = 0.8733, which passes the 0.6 threshold. Judge agreement is calculated from the average standard deviation across criteria compared to maximum possible std.
Example 2:
Input: Hidden test case or specific edge case
Output: Correct evaluated result
Explanation: An additional example to demonstrate the robustness of the implementation.

Starter Code

import numpy as np

def rubric_llm_judge_evaluation(
    judge_scores: list[list[float]],
    criteria_weights: list[float],
    passing_threshold: float = 0.6,
    max_score: float = 5.0
) -> dict:
    """
    Evaluate LLM response using rubric-based multi-judge scoring.
    
    Args:
        judge_scores: 2D list where judge_scores[i][j] is judge i's score for criterion j
        criteria_weights: Weights for each criterion (should sum to 1)
        passing_threshold: Minimum normalized score to pass (0 to 1)
        max_score: Maximum possible score for each criterion
    
    Returns:
        Dictionary with evaluation results
    """
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews