Machine Learning Systems Design
By Chip Huyen

About the Author
Chip Huyen is a renowned machine learning engineer and educator, known for her work on ML systems design and production ML.
- Website: huyenchip.com
- Twitter: @chipro
Table of Contents
1. Introduction
Research vs Production
The gap between research and production machine learning is significant and requires different approaches.
Performance Requirements
- Research: Focus on accuracy and novel approaches
- Production: Balance accuracy with latency, throughput, and cost
- Trade-offs between model complexity and deployment constraints
Compute Requirements
- Research: Often uses powerful GPUs without budget constraints
- Production: Must optimize for cost-effective inference
- Considerations for edge deployment and mobile devices
2. Design a Machine Learning System
Project Setup
- Define clear objectives and success metrics
- Understand stakeholders and requirements
- Establish baseline performance
- Plan for iterative development
Data Pipeline
The foundation of any ML system is robust data infrastructure:
Key Components:
-
Data Collection
- Source identification
- Data quality assessment
- Privacy and compliance considerations
-
Data Storage
- Database selection (SQL vs NoSQL)
- Data lakes and warehouses
- Version control for datasets
-
Data Processing
- ETL (Extract, Transform, Load) pipelines
- Feature engineering
- Data validation
-
Data Versioning
- Track data lineage
- Reproduce experiments
- Handle schema evolution
Modeling
Model Selection
Choosing the right model involves:
- Problem type (classification, regression, etc.)
- Data characteristics
- Latency requirements
- Interpretability needs
- Resource constraints
Common Model Types:
# Example model selection logic
def select_model(problem_type, data_size, latency_requirement):
if problem_type == "classification":
if data_size < 10000:
return "Logistic Regression"
elif latency_requirement == "low":
return "Random Forest"
else:
return "Neural Network"
# ... more conditions
Training
Debugging
ML models can fail in subtle ways:
- Vanishing/Exploding Gradients: Use gradient clipping, batch normalization
- Overfitting: Apply regularization, dropout, data augmentation
- Underfitting: Increase model capacity, add features
- Data Leakage: Careful feature engineering, proper train/test split
Debugging Checklist:
- Verify data pipeline correctness
- Check for data leakage
- Start with a simple model
- Monitor training metrics
- Use validation sets effectively
Hyperparameter Tuning
Systematic approach to finding optimal parameters:
Methods:
- Grid Search: Exhaustive but expensive
- Random Search: More efficient than grid
- Bayesian Optimization: Smart sampling
- Hyperband: Multi-fidelity optimization
# Example hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
'learning_rate': [0.001, 0.01, 0.1],
'batch_size': [16, 32, 64],
'num_layers': [2, 3, 4]
}
search = RandomizedSearchCV(
model,
param_distributions,
n_iter=20,
cv=5
)
Scaling
As data grows, systems must scale:
Strategies:
- Data Parallelism: Distribute data across multiple workers
- Model Parallelism: Distribute model across devices
- Pipeline Parallelism: Split model into stages
- Distributed Training: Use frameworks like Horovod, PyTorch Distributed
Serving
Deploying models to production requires:
Considerations:
- Latency: Real-time vs batch predictions
- Throughput: Requests per second
- Cost: Infrastructure and compute costs
- Reliability: Uptime and error handling
Serving Options:
- REST API: Standard HTTP endpoints
- gRPC: Faster binary protocol
- Batch Predictions: Offline processing
- Edge Deployment: On-device inference
# Example serving code
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = extract_features(data)
prediction = model.predict([features])
return jsonify({'prediction': prediction.tolist()})
3. Case Studies
Real-world examples demonstrating ML systems design:
Topics Covered:
- E-commerce recommendation systems
- Fraud detection at scale
- Real-time personalization
- Content moderation
- Predictive maintenance
4. Exercises
Practical exercises to reinforce learning:
Exercise Categories:
-
System Design Questions
- Design a recommendation system
- Build a fraud detection pipeline
- Create a real-time prediction service
-
Implementation Tasks
- Implement a feature store
- Build a model monitoring system
- Create a deployment pipeline
-
Optimization Challenges
- Reduce model latency
- Improve throughput
- Minimize infrastructure costs
Key Takeaways
Research vs Production
Research ML:
- Focus on novel algorithms
- Maximize accuracy
- Use clean, curated datasets
- Infinite compute budget
Production ML:
- Focus on reliability and maintainability
- Balance accuracy with other metrics
- Deal with messy, real-world data
- Optimize for cost and latency
Critical Success Factors
-
Data Quality
- Garbage in, garbage out
- Invest in data infrastructure
- Monitor data drift
-
Monitoring
- Track model performance
- Detect degradation early
- Alert on anomalies
-
Iteration
- Start simple
- Measure everything
- Improve incrementally
-
Collaboration
- Bridge ML and engineering teams
- Clear communication
- Shared understanding of goals
Best Practices
Development
- Use version control for everything (code, data, models)
- Automate repetitive tasks
- Write tests for data and models
- Document decisions and trade-offs
Deployment
- Start with simple baselines
- A/B test new models
- Gradual rollouts
- Easy rollback procedures
Monitoring
- Business metrics first
- Model-specific metrics
- System health metrics
- User feedback loops
Tools and Technologies
Data Processing
- Apache Spark
- Apache Beam
- Dask
- Pandas
Model Training
- TensorFlow
- PyTorch
- Scikit-learn
- XGBoost
Model Serving
- TensorFlow Serving
- TorchServe
- FastAPI
- BentoML
MLOps
- MLflow
- Kubeflow
- Airflow
- DVC
Conclusion
Designing ML systems for production requires a different mindset than research. Success depends on building robust, scalable systems that deliver value to users while being maintainable and cost-effective.
The journey from research to production involves:
- Understanding requirements
- Building solid data foundations
- Selecting appropriate models
- Iterating based on feedback
- Monitoring and maintaining systems
Resources
This guide provides a structured approach to machine learning systems design, bridging the gap between research and production.