A/B Test Statistical Analysis for Model Comparison

Hard
MLE Interview Prep

In production ML systems, A/B testing is the gold standard for comparing model versions. When rolling out a new model (treatment) against an existing model (control), you need rigorous statistical analysis to make data-driven decisions.

Given binary outcome data from both control and treatment groups, implement a comprehensive A/B test analyzer that computes:

  1. Success rates for both groups
  2. Absolute lift: The raw difference in success rates (treatment - control)
  3. Relative lift: The percentage improvement over control
  4. Z-statistic: The test statistic from a two-proportion z-test using pooled variance
  5. P-value: Two-tailed p-value for the hypothesis test
  6. Confidence interval: For the difference in proportions using unpooled standard error
  7. Statistical significance: Whether p-value is below alpha (1 - confidence_level)
  8. Practical significance: Whether absolute lift meets minimum detectable effect threshold
  9. Required sample size: Minimum samples needed per group for 80% power
  10. Recommendation: One of 'launch_treatment', 'keep_control', or 'continue_testing'

The recommendation logic should be:

  • 'launch_treatment': statistically significant AND practically significant AND treatment is better
  • 'keep_control': statistically significant AND (treatment is worse OR effect is not practically significant)
  • 'continue_testing': not statistically significant OR current sample size is insufficient

If either input list is empty, return an empty dictionary.

Write a function analyze_ab_test(control_outcomes, treatment_outcomes, confidence_level, min_detectable_effect) that performs this analysis.

Examples

Example 1:
Input: control_outcomes = [1,1,1,0,0,0,1,0,1,0]*50, treatment_outcomes = [1,1,1,1,0,0,1,0,1,0]*50, confidence_level = 0.95, min_detectable_effect = 0.02
Output: {'control_rate': 0.5, 'treatment_rate': 0.6, 'absolute_lift': 0.1, 'relative_lift_pct': 20.0, 'z_statistic': 3.1623, 'p_value': 0.0016, ..., 'recommendation': 'launch_treatment'}
Explanation: Control group has 250/500 successes (50% rate), treatment has 300/500 successes (60% rate). The absolute lift is 0.1 (10 percentage points). Using the two-proportion z-test with pooled variance, z = (0.6-0.5)/sqrt(0.55*0.45*(1/500+1/500)) = 3.16. The p-value is 0.0016 < 0.05, so statistically significant. Since lift (0.1) > min_detectable_effect (0.02) and treatment is better, recommendation is 'launch_treatment'.

Starter Code

import numpy as np

def analyze_ab_test(control_outcomes: list, treatment_outcomes: list, confidence_level: float = 0.95, min_detectable_effect: float = 0.02) -> dict:
    """
    Analyze A/B test results for model comparison with statistical rigor.
    
    Args:
        control_outcomes: List of binary outcomes (0 or 1) for control group
        treatment_outcomes: List of binary outcomes (0 or 1) for treatment group
        confidence_level: Confidence level for statistical tests (default 0.95)
        min_detectable_effect: Minimum absolute effect size considered practically significant
    
    Returns:
        dict with statistical analysis results and recommendation
    """
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews