First published: March 30, 2025

IntroductionPermalink

In the fields of Artificial Intelligence (AI) and Machine Learning (ML), building a fantastic ML model is only part of the story. A truly successful ML project isn’t just about the model itself; it’s about getting that model into the hands of users, continuously improving it, and ensuring it’s reliably running.

As an example, consider a SaaS platform that offers a basic genAI model. To truly benefit users, you need to continuously update it with user feedback, because without incorporating user data, your genAI model will quickly become outdated. The goal is to use this data to create a better, more responsive AI-powered service.

That’s where MLOps (Machine Learning Operations) comes in.

What is MLOps?Permalink

MLOps is essentially about streamlining the process of taking an ML model from initial development to production and ongoing management – all with a focus on automation, collaboration, and continuous improvement.

How Does it Work?Permalink

The following is a good starting point that captures the key components and workflow:

  • Datasets. One starts with obtaining datasets from sources of data, which can include anything from databases, files, APIs, and others.

  • Data Pre-processing. The preparation of data, which includes data cleaning and feature engineering.

  • Model Development. This includes model training (if we are not using a pre-trained model) and model tuning.

  • Model Packaging. Includes uploading the model to a registry (public or private), facilitating its subsequent download and deployment.

  • Model Deployment. This step also entails A/B testing as well as “canary” deployment (where the model is gradually rolled out to a small subset of users before being fully deployed to all users).

  • Continuous Retraining. Based on new data obtained from users, the model is continuously retrained, thus achieving a feedback loop. This retraining process is being triggered by various factors, including drift, performance, and fixed schedule.

  • Model Monitoring. In the aforementioned feedback loop, the model is continuously being observed in use, thus obtaining metrics that influence the retraining process.

Key Closed-Loop ComponentsPermalink

  1. Model MonitoringTrigger Retraining:
    • Detects data drift, concept drift, or performance decay.
    • Triggers retraining via automated rules (e.g., accuracy drops below threshold) or schedules.
    • Note: retraining can also be triggered according to some preset schedule, for example once monthly.
  2. Continuous Retraining:
    • Ingests new data (from production feedback or updated datasets).
    • Repeats preprocessing, training, and validation.
    • Validates against a champion/challenger model before redeployment.
  3. Feedback Loop:
    • Production inferences/logs feed back into monitoring and retraining.
    • Human-in-the-loop (HITL) reviews for critical edge cases.

Tools for Continuous RetrainingPermalink

Some of the currently-used tools for continuous retraining:

  • Drift Detection: Amazon SageMaker Model Monitor
  • Automated Pipelines: Kubeflow Pipelines, Apache Airflow, MLflow
  • Retraining Triggers: CI/CD (e.g., GitHub Actions, Jenkins) + Lambda functions
  • Model Registry: MLflow, TensorFlow Extended (TFX), Vertex AI

ReferencesPermalink

Updated: