Implement a function that performs rubric-based evaluation of LLM outputs using multiple judges. This is a common technique in LLM-as-a-judge evaluation pipelines where multiple language models score a response across different quality criteria.
Your function should take:
judge_scores: A 2D list wherejudge_scores[i][j]represents the score given by judgeifor criterionj. Scores range from 0 tomax_score.criteria_weights: A list of weights for each criterion (weights sum to 1)passing_threshold: The minimum normalized score (0 to 1) required to pass (default: 0.6)max_score: The maximum possible score for any criterion (default: 5.0)
Your function should return a dictionary containing:
weighted_score: The overall weighted score combining all criteria and judgesnormalized_score: The weighted score normalized to a 0-1 scalecriterion_scores: A list of average scores per criterion across all judgespass_status: Boolean indicating if the response passes the thresholdjudge_agreement: A metric from 0 to 1 indicating how much judges agree (1 = perfect agreement)
For judge agreement, use a standard deviation-based approach where agreement decreases as the average standard deviation across criteria increases relative to the maximum possible standard deviation.
Examples
Example 1:
Input:
judge_scores = [[4, 5, 3], [4, 4, 4], [5, 5, 4]], criteria_weights = [0.3, 0.5, 0.2]Output:
{'weighted_score': 4.3667, 'normalized_score': 0.8733, 'criterion_scores': [4.3333, 4.6667, 3.6667], 'pass_status': True, 'judge_agreement': 0.8114}Explanation: Three judges evaluate a response on 3 criteria. For criterion 0, judges give [4, 4, 5] with average 4.3333. For criterion 1, judges give [5, 4, 5] with average 4.6667. For criterion 2, judges give [3, 4, 4] with average 3.6667. The weighted score is 4.3333*0.3 + 4.6667*0.5 + 3.6667*0.2 = 4.3667. Normalized to 0-1 scale: 4.3667/5 = 0.8733, which passes the 0.6 threshold. Judge agreement is calculated from the average standard deviation across criteria compared to maximum possible std.
Example 2:
Input:
Hidden test case or specific edge caseOutput:
Correct evaluated resultExplanation: An additional example to demonstrate the robustness of the implementation.
Starter Code
import numpy as np
def rubric_llm_judge_evaluation(
judge_scores: list[list[float]],
criteria_weights: list[float],
passing_threshold: float = 0.6,
max_score: float = 5.0
) -> dict:
"""
Evaluate LLM response using rubric-based multi-judge scoring.
Args:
judge_scores: 2D list where judge_scores[i][j] is judge i's score for criterion j
criteria_weights: Weights for each criterion (should sum to 1)
passing_threshold: Minimum normalized score to pass (0 to 1)
max_score: Maximum possible score for each criterion
Returns:
Dictionary with evaluation results
"""
passPython3
ReadyLines: 1Characters: 0
Ready