Back to Resources
Articles

Machine Learning Systems Design - Table of Contents

Machine Learning Systems Design

Chip Huyen
2025

Machine Learning Systems Design

By Chip Huyen

ML Systems Design
ML Systems Design

About the Author

Chip Huyen is a renowned machine learning engineer and educator, known for her work on ML systems design and production ML.

Table of Contents

1. Introduction

Research vs Production

The gap between research and production machine learning is significant and requires different approaches.

Performance Requirements
  • Research: Focus on accuracy and novel approaches
  • Production: Balance accuracy with latency, throughput, and cost
  • Trade-offs between model complexity and deployment constraints
Compute Requirements
  • Research: Often uses powerful GPUs without budget constraints
  • Production: Must optimize for cost-effective inference
  • Considerations for edge deployment and mobile devices

2. Design a Machine Learning System

Project Setup

  • Define clear objectives and success metrics
  • Understand stakeholders and requirements
  • Establish baseline performance
  • Plan for iterative development

Data Pipeline

The foundation of any ML system is robust data infrastructure:

Key Components:

  1. Data Collection

    • Source identification
    • Data quality assessment
    • Privacy and compliance considerations
  2. Data Storage

    • Database selection (SQL vs NoSQL)
    • Data lakes and warehouses
    • Version control for datasets
  3. Data Processing

    • ETL (Extract, Transform, Load) pipelines
    • Feature engineering
    • Data validation
  4. Data Versioning

    • Track data lineage
    • Reproduce experiments
    • Handle schema evolution

Modeling

Model Selection

Choosing the right model involves:

  • Problem type (classification, regression, etc.)
  • Data characteristics
  • Latency requirements
  • Interpretability needs
  • Resource constraints

Common Model Types:

# Example model selection logic
def select_model(problem_type, data_size, latency_requirement):
    if problem_type == "classification":
        if data_size < 10000:
            return "Logistic Regression"
        elif latency_requirement == "low":
            return "Random Forest"
        else:
            return "Neural Network"
    # ... more conditions
Training
Debugging

ML models can fail in subtle ways:

  • Vanishing/Exploding Gradients: Use gradient clipping, batch normalization
  • Overfitting: Apply regularization, dropout, data augmentation
  • Underfitting: Increase model capacity, add features
  • Data Leakage: Careful feature engineering, proper train/test split

Debugging Checklist:

  1. Verify data pipeline correctness
  2. Check for data leakage
  3. Start with a simple model
  4. Monitor training metrics
  5. Use validation sets effectively
Hyperparameter Tuning

Systematic approach to finding optimal parameters:

Methods:

  • Grid Search: Exhaustive but expensive
  • Random Search: More efficient than grid
  • Bayesian Optimization: Smart sampling
  • Hyperband: Multi-fidelity optimization
# Example hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'learning_rate': [0.001, 0.01, 0.1],
    'batch_size': [16, 32, 64],
    'num_layers': [2, 3, 4]
}

search = RandomizedSearchCV(
    model,
    param_distributions,
    n_iter=20,
    cv=5
)
Scaling

As data grows, systems must scale:

Strategies:

  1. Data Parallelism: Distribute data across multiple workers
  2. Model Parallelism: Distribute model across devices
  3. Pipeline Parallelism: Split model into stages
  4. Distributed Training: Use frameworks like Horovod, PyTorch Distributed

Serving

Deploying models to production requires:

Considerations:

  • Latency: Real-time vs batch predictions
  • Throughput: Requests per second
  • Cost: Infrastructure and compute costs
  • Reliability: Uptime and error handling

Serving Options:

  1. REST API: Standard HTTP endpoints
  2. gRPC: Faster binary protocol
  3. Batch Predictions: Offline processing
  4. Edge Deployment: On-device inference
# Example serving code
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = extract_features(data)
    prediction = model.predict([features])
    return jsonify({'prediction': prediction.tolist()})

3. Case Studies

Real-world examples demonstrating ML systems design:

Topics Covered:

  • E-commerce recommendation systems
  • Fraud detection at scale
  • Real-time personalization
  • Content moderation
  • Predictive maintenance

4. Exercises

Practical exercises to reinforce learning:

Exercise Categories:

  1. System Design Questions

    • Design a recommendation system
    • Build a fraud detection pipeline
    • Create a real-time prediction service
  2. Implementation Tasks

    • Implement a feature store
    • Build a model monitoring system
    • Create a deployment pipeline
  3. Optimization Challenges

    • Reduce model latency
    • Improve throughput
    • Minimize infrastructure costs

Key Takeaways

Research vs Production

Research ML:

  • Focus on novel algorithms
  • Maximize accuracy
  • Use clean, curated datasets
  • Infinite compute budget

Production ML:

  • Focus on reliability and maintainability
  • Balance accuracy with other metrics
  • Deal with messy, real-world data
  • Optimize for cost and latency

Critical Success Factors

  1. Data Quality

    • Garbage in, garbage out
    • Invest in data infrastructure
    • Monitor data drift
  2. Monitoring

    • Track model performance
    • Detect degradation early
    • Alert on anomalies
  3. Iteration

    • Start simple
    • Measure everything
    • Improve incrementally
  4. Collaboration

    • Bridge ML and engineering teams
    • Clear communication
    • Shared understanding of goals

Best Practices

Development

  • Use version control for everything (code, data, models)
  • Automate repetitive tasks
  • Write tests for data and models
  • Document decisions and trade-offs

Deployment

  • Start with simple baselines
  • A/B test new models
  • Gradual rollouts
  • Easy rollback procedures

Monitoring

  • Business metrics first
  • Model-specific metrics
  • System health metrics
  • User feedback loops

Tools and Technologies

Data Processing

  • Apache Spark
  • Apache Beam
  • Dask
  • Pandas

Model Training

  • TensorFlow
  • PyTorch
  • Scikit-learn
  • XGBoost

Model Serving

  • TensorFlow Serving
  • TorchServe
  • FastAPI
  • BentoML

MLOps

  • MLflow
  • Kubeflow
  • Airflow
  • DVC

Conclusion

Designing ML systems for production requires a different mindset than research. Success depends on building robust, scalable systems that deliver value to users while being maintainable and cost-effective.

The journey from research to production involves:

  • Understanding requirements
  • Building solid data foundations
  • Selecting appropriate models
  • Iterating based on feedback
  • Monitoring and maintaining systems

Resources


This guide provides a structured approach to machine learning systems design, bridging the gap between research and production.