In production ML systems, monitoring model inference performance is essential for maintaining service quality. Given a list of inference latency measurements (in milliseconds), compute key statistics that are commonly used in MLOps dashboards:
- Throughput: The number of requests that can be processed per second (assuming single-threaded sequential processing)
- Average Latency: The mean latency across all measurements
- Percentiles (p50, p95, p99): The latency values below which 50%, 95%, and 99% of requests fall
Write a function calculate_inference_stats(latencies_ms) that takes a list of latency measurements and returns a dictionary with the computed statistics. Use linear interpolation for percentile calculations.
If the input list is empty, return an empty dictionary.
Examples
Example 1:
Input:
latencies_ms = [10, 20, 30, 40, 50]Output:
{'throughput_per_sec': 33.33, 'avg_latency_ms': 30.0, 'p50_ms': 30.0, 'p95_ms': 48.0, 'p99_ms': 49.6}Explanation: With 5 latency measurements [10, 20, 30, 40, 50], the average latency is 30ms. Throughput is calculated as 1000/30 = 33.33 requests/sec. The p50 (median) is 30ms. For p95, we calculate index 0.95 * 4 = 3.8, interpolating between positions 3 and 4 gives 40 + 0.8*(50-40) = 48ms. Similarly, p99 uses index 3.96 giving 49.6ms.
Starter Code
def calculate_inference_stats(latencies_ms: list) -> dict:
"""
Calculate inference statistics for model monitoring.
Args:
latencies_ms: list of latency measurements in milliseconds
Returns:
dict with keys: 'throughput_per_sec', 'avg_latency_ms', 'p50_ms', 'p95_ms', 'p99_ms'
All values rounded to 2 decimal places.
"""
passPython3
ReadyLines: 1Characters: 0
Ready