In production ML systems, data quality is critical for model performance. Poor quality data can lead to model degradation, biased predictions, and system failures. You need to implement a data quality scoring function that evaluates incoming data against a defined schema.
Given a list of data records (dictionaries) and a schema definition, compute the following quality metrics:
- Completeness: Percentage of non-null values across all expected fields
- Type Validity: Percentage of values that match their expected data types (including null handling based on nullable flag)
- Uniqueness Ratio: Percentage of unique records in the dataset
- Overall Score: Weighted combination of metrics (40% completeness, 40% type validity, 20% uniqueness)
The schema is a dictionary where each key is a column name and the value is a specification with:
- 'type': One of 'numeric', 'categorical', or 'boolean'
- 'nullable': Boolean indicating if null values are acceptable
For type validity:
- Numeric type accepts int and float (but not boolean)
- Categorical type accepts strings
- Boolean type accepts True/False only
- If a value is None and the field is nullable, it counts as type-valid
- If a value is None and the field is not nullable, it counts as type-invalid
Write a function calculate_data_quality_score(data, schema) that returns a dictionary with all four metrics. Return an empty dictionary if the input data is empty. All values should be rounded to 2 decimal places.
Examples
Example 1:
Input:
data = [{'age': 25, 'name': 'Alice', 'active': True}, {'age': 'thirty', 'name': 'Bob', 'active': False}, {'age': None, 'name': None, 'active': True}, {'age': 40, 'name': 'Dave', 'active': 'yes'}], schema = {'age': {'type': 'numeric', 'nullable': True}, 'name': {'type': 'categorical', 'nullable': True}, 'active': {'type': 'boolean', 'nullable': False}}Output:
{'completeness': 83.33, 'type_validity': 83.33, 'uniqueness_ratio': 100.0, 'overall_score': 86.67}Explanation: Total fields = 4 rows x 3 columns = 12. Non-null fields = 10 (row 3 has 2 nulls). Completeness = 10/12 = 83.33%. For type validity: row 1 has 3 valid, row 2 has 2 valid (age is string not numeric), row 3 has 3 valid (nulls are allowed), row 4 has 2 valid (active is string not boolean). Type validity = 10/12 = 83.33%. All 4 rows are unique, so uniqueness = 100%. Overall = 0.4*83.33 + 0.4*83.33 + 0.2*100 = 86.67%.
Starter Code
def calculate_data_quality_score(data: list, schema: dict) -> dict:
"""
Calculate data quality metrics for ML pipeline monitoring.
Args:
data: list of dictionaries representing rows of data
schema: dictionary defining expected columns and their types
{'column_name': {'type': 'numeric'|'categorical'|'boolean', 'nullable': True|False}}
Returns:
dict with keys: 'completeness', 'type_validity', 'uniqueness_ratio', 'overall_score'
All values as percentages (0-100), rounded to 2 decimal places.
"""
passPython3
ReadyLines: 1Characters: 0
Ready