In production ML systems, canary deployments are a critical strategy for safely rolling out new model versions. A small percentage of traffic is routed to the new (canary) model while the majority continues to use the existing (baseline) model. By comparing their performance, you can decide whether to promote the canary to full production or roll back.
Given prediction results from both canary and baseline models, compute key comparison metrics to determine if the canary deployment is healthy.
Each result in both lists is a dictionary with:
- 'latency_ms': Response latency in milliseconds (float)
- 'prediction': The model's predicted value
- 'ground_truth': The actual correct value
Write a function analyze_canary_deployment(canary_results, baseline_results, accuracy_tolerance, latency_tolerance) that computes:
- canary_accuracy: Fraction of correct predictions for canary model (0-1)
- baseline_accuracy: Fraction of correct predictions for baseline model (0-1)
- accuracy_change_pct: Relative change in accuracy as percentage
- canary_avg_latency: Average latency of canary model (ms)
- baseline_avg_latency: Average latency of baseline model (ms)
- latency_change_pct: Relative change in latency as percentage
- promote_recommended: Boolean - True if canary accuracy did not degrade beyond accuracy_tolerance AND latency did not increase beyond latency_tolerance
If either input list is empty, return an empty dictionary.
All numeric values should be rounded to 2 decimal places except accuracy values which should be rounded to 4 decimal places.
Examples
canary_results = [{'latency_ms': 45, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 50, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 48, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0}, {'latency_ms': 47, 'prediction': 0, 'ground_truth': 0}], baseline_results = [{'latency_ms': 50, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 55, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0}, {'latency_ms': 58, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 53, 'prediction': 1, 'ground_truth': 1}]{'canary_accuracy': 0.8, 'baseline_accuracy': 0.8, 'accuracy_change_pct': 0.0, 'canary_avg_latency': 48.4, 'baseline_avg_latency': 53.6, 'latency_change_pct': -9.7, 'promote_recommended': True}Starter Code
def analyze_canary_deployment(canary_results: list, baseline_results: list, accuracy_tolerance: float = 0.05, latency_tolerance: float = 0.10) -> dict:
"""
Analyze canary deployment health metrics for model rollout decision.
Args:
canary_results: list of prediction results from canary (new) model
Each dict has 'latency_ms', 'prediction', 'ground_truth'
baseline_results: list of prediction results from baseline (existing) model
Each dict has 'latency_ms', 'prediction', 'ground_truth'
accuracy_tolerance: max acceptable relative accuracy degradation (0.05 = 5%)
latency_tolerance: max acceptable relative latency increase (0.10 = 10%)
Returns:
dict with canary/baseline metrics and promotion recommendation
"""
pass