In production ML systems, A/B testing is the gold standard for comparing model versions. When rolling out a new model (treatment) against an existing model (control), you need rigorous statistical analysis to make data-driven decisions.
Given binary outcome data from both control and treatment groups, implement a comprehensive A/B test analyzer that computes:
- Success rates for both groups
- Absolute lift: The raw difference in success rates (treatment - control)
- Relative lift: The percentage improvement over control
- Z-statistic: The test statistic from a two-proportion z-test using pooled variance
- P-value: Two-tailed p-value for the hypothesis test
- Confidence interval: For the difference in proportions using unpooled standard error
- Statistical significance: Whether p-value is below alpha (1 - confidence_level)
- Practical significance: Whether absolute lift meets minimum detectable effect threshold
- Required sample size: Minimum samples needed per group for 80% power
- Recommendation: One of 'launch_treatment', 'keep_control', or 'continue_testing'
The recommendation logic should be:
- 'launch_treatment': statistically significant AND practically significant AND treatment is better
- 'keep_control': statistically significant AND (treatment is worse OR effect is not practically significant)
- 'continue_testing': not statistically significant OR current sample size is insufficient
If either input list is empty, return an empty dictionary.
Write a function analyze_ab_test(control_outcomes, treatment_outcomes, confidence_level, min_detectable_effect) that performs this analysis.
Examples
Example 1:
Input:
control_outcomes = [1,1,1,0,0,0,1,0,1,0]*50, treatment_outcomes = [1,1,1,1,0,0,1,0,1,0]*50, confidence_level = 0.95, min_detectable_effect = 0.02Output:
{'control_rate': 0.5, 'treatment_rate': 0.6, 'absolute_lift': 0.1, 'relative_lift_pct': 20.0, 'z_statistic': 3.1623, 'p_value': 0.0016, ..., 'recommendation': 'launch_treatment'}Explanation: Control group has 250/500 successes (50% rate), treatment has 300/500 successes (60% rate). The absolute lift is 0.1 (10 percentage points). Using the two-proportion z-test with pooled variance, z = (0.6-0.5)/sqrt(0.55*0.45*(1/500+1/500)) = 3.16. The p-value is 0.0016 < 0.05, so statistically significant. Since lift (0.1) > min_detectable_effect (0.02) and treatment is better, recommendation is 'launch_treatment'.
Starter Code
import numpy as np
def analyze_ab_test(control_outcomes: list, treatment_outcomes: list, confidence_level: float = 0.95, min_detectable_effect: float = 0.02) -> dict:
"""
Analyze A/B test results for model comparison with statistical rigor.
Args:
control_outcomes: List of binary outcomes (0 or 1) for control group
treatment_outcomes: List of binary outcomes (0 or 1) for treatment group
confidence_level: Confidence level for statistical tests (default 0.95)
min_detectable_effect: Minimum absolute effect size considered practically significant
Returns:
dict with statistical analysis results and recommendation
"""
passPython3
ReadyLines: 1Characters: 0
Ready