Data Quality Scoring for ML Pipelines

Medium
MLOps

In production ML systems, data quality is critical for model performance. Poor quality data can lead to model degradation, biased predictions, and system failures. You need to implement a data quality scoring function that evaluates incoming data against a defined schema.

Given a list of data records (dictionaries) and a schema definition, compute the following quality metrics:

  1. Completeness: Percentage of non-null values across all expected fields
  2. Type Validity: Percentage of values that match their expected data types (including null handling based on nullable flag)
  3. Uniqueness Ratio: Percentage of unique records in the dataset
  4. Overall Score: Weighted combination of metrics (40% completeness, 40% type validity, 20% uniqueness)

The schema is a dictionary where each key is a column name and the value is a specification with:

  • 'type': One of 'numeric', 'categorical', or 'boolean'
  • 'nullable': Boolean indicating if null values are acceptable

For type validity:

  • Numeric type accepts int and float (but not boolean)
  • Categorical type accepts strings
  • Boolean type accepts True/False only
  • If a value is None and the field is nullable, it counts as type-valid
  • If a value is None and the field is not nullable, it counts as type-invalid

Write a function calculate_data_quality_score(data, schema) that returns a dictionary with all four metrics. Return an empty dictionary if the input data is empty. All values should be rounded to 2 decimal places.

Examples

Example 1:
Input: data = [{'age': 25, 'name': 'Alice', 'active': True}, {'age': 'thirty', 'name': 'Bob', 'active': False}, {'age': None, 'name': None, 'active': True}, {'age': 40, 'name': 'Dave', 'active': 'yes'}], schema = {'age': {'type': 'numeric', 'nullable': True}, 'name': {'type': 'categorical', 'nullable': True}, 'active': {'type': 'boolean', 'nullable': False}}
Output: {'completeness': 83.33, 'type_validity': 83.33, 'uniqueness_ratio': 100.0, 'overall_score': 86.67}
Explanation: Total fields = 4 rows x 3 columns = 12. Non-null fields = 10 (row 3 has 2 nulls). Completeness = 10/12 = 83.33%. For type validity: row 1 has 3 valid, row 2 has 2 valid (age is string not numeric), row 3 has 3 valid (nulls are allowed), row 4 has 2 valid (active is string not boolean). Type validity = 10/12 = 83.33%. All 4 rows are unique, so uniqueness = 100%. Overall = 0.4*83.33 + 0.4*83.33 + 0.2*100 = 86.67%.

Starter Code

def calculate_data_quality_score(data: list, schema: dict) -> dict:
    """
    Calculate data quality metrics for ML pipeline monitoring.
    
    Args:
        data: list of dictionaries representing rows of data
        schema: dictionary defining expected columns and their types
                {'column_name': {'type': 'numeric'|'categorical'|'boolean', 'nullable': True|False}}
    
    Returns:
        dict with keys: 'completeness', 'type_validity', 'uniqueness_ratio', 'overall_score'
        All values as percentages (0-100), rounded to 2 decimal places.
    """
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews