The AI Interview - Master AI/ML Interviews

In production ML systems, monitoring model inference performance is essential for maintaining service quality. Given a list of inference latency measurements (in milliseconds), compute key statistics that are commonly used in MLOps dashboards:

Throughput: The number of requests that can be processed per second (assuming single-threaded sequential processing)
Average Latency: The mean latency across all measurements
Percentiles (p50, p95, p99): The latency values below which 50%, 95%, and 99% of requests fall

Write a function calculate_inference_stats(latencies_ms) that takes a list of latency measurements and returns a dictionary with the computed statistics. Use linear interpolation for percentile calculations.

If the input list is empty, return an empty dictionary.

Examples

Example 1:

Input: latencies_ms = [10, 20, 30, 40, 50]

Output: {'throughput_per_sec': 33.33, 'avg_latency_ms': 30.0, 'p50_ms': 30.0, 'p95_ms': 48.0, 'p99_ms': 49.6}

Explanation: With 5 latency measurements [10, 20, 30, 40, 50], the average latency is 30ms. Throughput is calculated as 1000/30 = 33.33 requests/sec. The p50 (median) is 30ms. For p95, we calculate index 0.95 * 4 = 3.8, interpolating between positions 3 and 4 gives 40 + 0.8*(50-40) = 48ms. Similarly, p99 uses index 3.96 giving 49.6ms.

Starter Code

def calculate_inference_stats(latencies_ms: list) -> dict:
    """
    Calculate inference statistics for model monitoring.
    
    Args:
        latencies_ms: list of latency measurements in milliseconds
    
    Returns:
        dict with keys: 'throughput_per_sec', 'avg_latency_ms', 'p50_ms', 'p95_ms', 'p99_ms'
        All values rounded to 2 decimal places.
    """
    pass

Calculate Model Inference Statistics for Monitoring

Examples

Starter Code