The AI Interview - Master AI/ML Interviews

In production ML systems, canary deployments are a critical strategy for safely rolling out new model versions. A small percentage of traffic is routed to the new (canary) model while the majority continues to use the existing (baseline) model. By comparing their performance, you can decide whether to promote the canary to full production or roll back.

Given prediction results from both canary and baseline models, compute key comparison metrics to determine if the canary deployment is healthy.

Each result in both lists is a dictionary with:

'latency_ms': Response latency in milliseconds (float)
'prediction': The model's predicted value
'ground_truth': The actual correct value

Write a function analyze_canary_deployment(canary_results, baseline_results, accuracy_tolerance, latency_tolerance) that computes:

canary_accuracy: Fraction of correct predictions for canary model (0-1)
baseline_accuracy: Fraction of correct predictions for baseline model (0-1)
accuracy_change_pct: Relative change in accuracy as percentage
canary_avg_latency: Average latency of canary model (ms)
baseline_avg_latency: Average latency of baseline model (ms)
latency_change_pct: Relative change in latency as percentage
promote_recommended: Boolean - True if canary accuracy did not degrade beyond accuracy_tolerance AND latency did not increase beyond latency_tolerance

If either input list is empty, return an empty dictionary.

All numeric values should be rounded to 2 decimal places except accuracy values which should be rounded to 4 decimal places.

Examples

Example 1:

Input:

canary_results = [{'latency_ms': 45, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 50, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 48, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0}, {'latency_ms': 47, 'prediction': 0, 'ground_truth': 0}], baseline_results = [{'latency_ms': 50, 'prediction': 1, 'ground_truth': 1}, {'latency_ms': 55, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 52, 'prediction': 1, 'ground_truth': 0}, {'latency_ms': 58, 'prediction': 0, 'ground_truth': 0}, {'latency_ms': 53, 'prediction': 1, 'ground_truth': 1}]

Output:

{'canary_accuracy': 0.8, 'baseline_accuracy': 0.8, 'accuracy_change_pct': 0.0, 'canary_avg_latency': 48.4, 'baseline_avg_latency': 53.6, 'latency_change_pct': -9.7, 'promote_recommended': True}

Explanation: Canary has 4/5 correct predictions (accuracy 0.8), baseline also has 4/5 (accuracy 0.8), so accuracy change is 0%. Canary average latency is (45+50+48+52+47)/5 = 48.4ms, baseline is (50+55+52+58+53)/5 = 53.6ms. Latency change is (48.4-53.6)/53.6 * 100 = -9.7% (improved). Since accuracy did not degrade and latency improved, promote_recommended is True.

Starter Code

def analyze_canary_deployment(canary_results: list, baseline_results: list, accuracy_tolerance: float = 0.05, latency_tolerance: float = 0.10) -> dict:
    """
    Analyze canary deployment health metrics for model rollout decision.
    
    Args:
        canary_results: list of prediction results from canary (new) model
                       Each dict has 'latency_ms', 'prediction', 'ground_truth'
        baseline_results: list of prediction results from baseline (existing) model
                         Each dict has 'latency_ms', 'prediction', 'ground_truth'
        accuracy_tolerance: max acceptable relative accuracy degradation (0.05 = 5%)
        latency_tolerance: max acceptable relative latency increase (0.10 = 10%)
    
    Returns:
        dict with canary/baseline metrics and promotion recommendation
    """
    pass

Analyze Canary Deployment Health for Model Rollout

Examples

Starter Code