Building an End-to-End MLOps Pipeline with Open-Source Tools

Published in

Infer

11 min readOct 18, 2023

By: Grig Duta, Solution Architect at Qwak

MLOps Open Source: TL;DR

This article serves as a focused guide for data scientists and ML engineers who are looking to transition from experimental machine learning to production-ready MLOps pipelines. We identify the limitations of traditional ML setups and introduce you to essential open-source tools that can help you build a more robust, scalable, and maintainable ML system.‍

The tools discussed include Feast for feature management, MLflow for model tracking and versioning, Seldon for model deployment, Evidently for real-time monitoring, and Kubeflow for workflow orchestration.

Introduction

The machine learning landscape is constantly changing, and the transition from model development to production deployment presents its own set of challenges. While Jupyter notebooks and isolated scripts are useful for experimentation, they often lack the features needed for a production-grade system. This article aims to help you navigate these challenges by introducing the concept of MLOps and a selection of open-source tools that can facilitate the creation of a production-ready ML pipeline.‍

Whether you’re a data scientist looking to transition into production or an ML engineer seeking to optimize your existing workflows, this article aims to provide a focused overview of essential MLOps practices and tools.‍

Why Production-Grade ML is Different

Dynamic vs Static: Unlike experimental ML, which often uses fixed datasets, production environments are dynamic. They require systems that can adapt to fluctuating user demand and data variability.

Inelastic Demand: User requests in production can come at any time, requiring a system that can auto-scale to meet this inelastic demand.

Data Drift: Production systems need constant monitoring for changes in data distribution, known as data drift, which can affect model performance.

Real-Time Needs: Many production applications require real-time predictions, necessitating low-latency data processing.

The Traditional Machine Learning Setup

In a traditional machine learning setup, the focus is often on experimentation and proof-of-concept rather than production readiness. The workflow is generally linear and manual, lacking the automation and scalability required for a production environment.‍

Let’s break down what this traditional setup usually entails, using a Credit Risk prediction model as an example:‍

Data preprocessing and feature engineering are typically done in an ad-hoc manner using tools like Jupyter notebooks. There’s usually no version control for the data transformations or features, making it difficult to reproduce results.
The model is trained and validated using the same Jupyter notebook environment. Hyperparameters are often tuned manually, and the training process lacks systematic tracking of model versions, performance metrics, or experiment metadata.
Once the model is trained, predictions are run in batch mode. This is a manual step where the model is applied to a dataset to generate predictions, which are then saved for further analysis or reporting.
The prediction results, along with any model artifacts, are manually saved to a data storage service, such as a cloud-based data store. There’s usually no versioning or tracking, making it challenging to manage model updates or rollbacks.‍

You can picture this setup in the following image.

‍

The Journey from Notebooks to Production: An Integrated ML System‍

While there’s a wealth of information available on the MLOps lifecycle, this article aims to provide a focused overview that bridges the gap between a model in a Jupyter notebook and a full-fledged, production-grade ML system.‍

Our primary objective is to delve into how specific open-source tools can be orchestrated to create a robust, production-ready ML setup. Although we won’t be discussing Kubernetes in detail, it’s worth noting that the architecture we explore can be deployed on container orchestrators like Kubernetes for further scalability and robustness.

To ground our discussion, let’s consider a real-world scenario: a data scientist is developing a predictive model for credit risk. The model relies on financial and behavioral data, which is housed in a Snowflake database but could just as easily reside in other data sources.‍

The following diagram illustrates an integrated ML system capable of real-time predictions. The subsequent sections will guide you through the transformation from a traditional ML setup to this more integrated, streamlined system. To achieve this, we’ll introduce you to the top 5 open-source tools that are making waves in the industry today.

Data Management and Feature Engineering‍

Managing and engineering features for machine learning models is a critical step that often determines the quality of the model. A feature store serves as a centralized hub for these features, ensuring that they are consistently managed, stored, and served.‍

What is a Feature Store?

A feature store is a centralized repository that acts as a bridge between raw data and machine learning models. It streamlines the feature engineering process and ensures consistency across different models. Feature stores can be broadly categorized into two types:‍

Offline Feature Store — is primarily used for batch processing of features. It’s where you store historical feature data that can be used for training machine learning models. The offline store is optimized for analytical workloads, allowing you to query large volumes of data efficiently. Typically, the offline feature store is backed by a data warehouse or a distributed file system like S3.
Online Feature Store — serves features in real-time for model inference. When a prediction request comes in, the online feature store quickly retrieves the relevant features to be fed into the model. This is crucial for applications that require low-latency predictions. Online feature stores are often backed by high-performance databases like Redis to ensure quick data retrieval.‍

Tool of Choice: Feast

Feast (Feature Store) is an open-source platform that manages features for both training and inference. Unlike traditional setups where feature engineering is often ad-hoc and siloed, Feast provides a unified platform that eliminates data drift and ensures consistent data preprocessing steps during both training and inference.

What we like about Feast:

Consistency: Ensures that the same features and data preprocessing steps are used during both training and inference, reducing the risk of model skew.
Traceability: Offers feature versioning, allowing you to trace back to the exact state of features used to train a particular model version.
Modularity: Works well with other MLOps tools, offering easy integration points for model training, deployment, and monitoring.
Real-time and Batch Support: Provides both online and offline feature stores, catering to real-time inference as well as batch training needs.

Additional Considerations:

Infrastructure Requirements: Feast is not a data processing engine but an abstraction layer that integrates with your existing data infrastructure. For batch feature computation, tasks can be offloaded to an Apache Spark cluster, while real-time feature computation can be handled by a Kafka Streams application.‍

Model Tracking and Versioning

‍Model Tracking and Versioning are integral parts of MLOps that provide a systematic way to manage various aspects of machine learning models. While tracking focuses on capturing parameters, metrics, and artifacts during the model training process, versioning deals with managing different iterations of trained models. Both are crucial for ensuring reproducibility, facilitating collaboration, and enabling easy deployment and rollback.‍

Tool of Choice: MLflow‍

MLflow excels in providing a unified platform for both tracking and versioning. It simplifies the otherwise complex task of logging different model runs with varying hyperparameters and algorithms. With MLflow, you can easily compare metrics to identify the best-performing model.‍

Moreover, MLflow’s model registry feature comes in handy when you have multiple model runs. It allows you to store serialized models and their associated artifacts in a centralized repository. This makes it easier to identify which model corresponds to a specific run and to keep track of deployment statuses across different environments.

‍

What We Like About MLflow

Simplicity: MLflow is Python-friendly and allows for easy logging of model artifacts and metadata, often in just a few lines of code.
Comprehensive Tracking: It offers a centralized dashboard for all your machine learning experiments, streamlining the comparison process.
Version Management: The model registry feature is invaluable for managing different versions of your models, facilitating easy rollbacks or promotions from staging to production.

Additional Considerations

Access Control: While MLflow provides basic access control features, you may need to integrate it with third-party identity providers for more granular permissions, especially in larger teams or organizations.
Operational Overheads: Running MLflow on a centralized server requires attention to server uptime, backups, and scalability. These operational aspects are not unique to MLflow and would apply to any centralized tool in your stack.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)# Start a new MLflow run
with mlflow.start_run() as run:    # Train a RandomForest model
    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)
    
    # Log parameters and metrics
    mlflow.log_param("criterion", clf.criterion)
    mlflow.log_metric("accuracy", clf.score(X_test, y_test))
    
    # Log the trained model
    mlflow.sklearn.log_model(clf, "RandomForestModel")    # Register the model to the MLflow Model Registry
    mlflow.register_model(f"runs:/{run.info.run_id}/RandomForestModel", "IrisClassifier")‍

Model Deployment and Serving‍

After training and registering your ML model, the next critical step is deployment. While batch predictions are often executed offline where latency is less critical, real-time predictions demand low-latency responses to HTTP requests. Without deployment, an ML model remains just a theoretical construct; it’s the deployment that brings it to life in a real-world context.‍

Tool of Choice: Seldon‍

What we like about Seldon:

Flexibility: Supports multiple machine learning frameworks, including MLflow models, and is highly compatible with Python.
Customization: Enables tailored prediction pipelines by allowing custom pre-processing and post-processing steps. For instance, you can query online features from Feast as a pre-processing step.
Monitoring: Comes with built-in monitoring capabilities, including Prometheus metrics and Grafana dashboards, for real-time performance tracking.
Advanced Deployment Strategies: Supports complex deployments like A/B testing to optimize model performance in live environments.
Community and Ecosystem: Being open-source, Seldon has a strong community, making it easier to find support and resources.

Additional Considerations:

Scalability Management: While Seldon’s seamless integration with Kubernetes allows for autoscaling, managing this scalability can be complex. You’ll need to set appropriate scaling policies and thresholds to ensure efficient resource utilization on top of the Kubernetes cluster management.
Resource Allocation: While robust, Seldon can be resource-intensive and may require a well-provisioned infrastructure.
Metrics Overhead: Seldon can be configured to capture and display metrics like Requests per Minute, Failure Rate, Latency, and Resource Consumption (CPU, Memory) which are great for monitoring your models.However, the overhead of storing and analyzing these metrics can be significant. You’ll need to decide what metrics are essential and how they will be stored and accessed.‍

Model Monitoring‍

In a production environment, machine learning models interact with dynamic and constantly changing data. This fluidity necessitates vigilant monitoring to ensure that the model’s performance remains consistent and reliable over time. Without proper monitoring, you risk model drift, data anomalies, and degraded performance, which could have significant business impact.‍

Tool of Choice: Evidently‍

Evidently is an open-source tool designed for comprehensive model monitoring. It provides real-time insights into model performance and data quality, helping you identify issues like model drift and anomalies as they occur.‍

What we like about Evidently:

Real-Time Monitoring: Evidently provides real-time monitoring capabilities, allowing you to catch issues as they arise, rather than after they’ve impacted the model’s performance or the business.
Comprehensive Metrics: It offers a wide range of metrics for monitoring model performance, including data drift, prediction drift, and various statistical divergences.
User-Friendly Dashboards: Evidently comes with intuitive dashboards that make it easy to visualize and understand the model’s performance metrics.
Integration: It can be easily integrated into existing MLOps pipelines and works well with other tools like Seldon for model deployment and Feast for feature management.

Additional considerations:

Initial Data Flow Design: Before you even start collecting data, you’ll need to architect a data flow that can handle both training and prediction data. This involves deciding how data will be ingested, processed, and eventually fed into Evidently for monitoring.
Data Storage Strategy: Where you store this integrated data is crucial. You’ll need a storage solution that allows for easy retrieval and is scalable, especially if you’re dealing with large volumes of real-time data.
Automated Workflows: Consider automating the data flow from Seldon and your training data source to Evidently. This could involve setting up automated ETL jobs or utilizing orchestration tools to ensure data is consistently fed into the monitoring tool.‍

Workflow Orchestration

Orchestrating the various moving parts of an ML system is no small feat. From data ingestion to model deployment, each step is a cog in a complex machinery. Doing this manually is not just tedious but prone to errors.‍

Tool of Choice: Kubeflow‍

Kubeflow is an open-source platform optimized for the deployment of machine learning workflows in Kubernetes environments. It simplifies the process of taking machine learning models from the lab to production by automating the coordination of complex workflows.‍

What we like about Kubeflow:

Simplicity: Kubeflow abstracts the complexity of orchestrating machine learning workflows, especially in Kubernetes environments.
Scalability: Being native to Kubernetes, it allows for easy scaling of your machine learning models and data pipelines.
Extensibility: It supports a wide range of machine learning frameworks and languages, making it a versatile choice for different kinds of ML projects.

Additional considerations:

Kubernetes Expertise: Managing Kubeflow effectively requires a good understanding of Kubernetes. This includes knowledge of Kubernetes namespaces, RBAC, and other security features, which can add complexity to the management overhead.
State Management: Kubeflow pipelines are stateless by default. If your workflows require stateful operations, you’ll need to manage this yourself, adding to the complexity.
Resource Requirements: While Kubeflow is powerful, it can be resource-intensive and may require a well-provisioned infrastructure to run efficiently.‍

The following diagram can serve as a visual guide to understanding how Kubeflow orchestrates the various components in an ML workflow, including the feedback loop for continuous improvement.

MLOps workflow and feedback loop visual representation‍‍

Exploring Managed MLOps Solutions‍

While open-source tools offer flexibility and community support, they often come with the operational burden of setup, maintenance, and integration. One often overlooked aspect is the need for Machine Learning Engineering (MLE) skills within the organization to effectively utilize these tools. Additionally, the time required for full deployment of these open-source solutions can range from 3 to 6 months, a period during which models may be held back from production, affecting business objectives.

If you’re looking for an alternative that combines the best of both worlds, managed MLOps platforms are worth considering.‍

One such platform is Qwak, which aims to simplify the MLOps lifecycle by offering an integrated suite of features. From feature management to model deployment and monitoring, Qwak provides a unified solution that can be deployed either as a SaaS or on your own cloud infrastructure. This allows you to focus more on model development and business impact, rather than the intricacies of tool management.‍

Conclusion‍

The transition from experimental machine learning to production-grade MLOps is a complex but necessary journey for organizations aiming to leverage ML at scale. While traditional setups may suffice for isolated experiments, they lack the robustness and scalability required for production environments. Open-source tools like Feast, MLflow, Seldon, and others offer valuable capabilities but come with their own set of challenges, including operational overhead and integration complexities.

‍Managed MLOps platforms like Qwak offer a compelling alternative, providing an integrated, hassle-free experience that allows teams to focus on what truly matters: building impactful machine learning models. Whether you’re a data scientist looking to make the leap into production or an ML engineer aiming to streamline your workflows, the landscape of MLOps tools and practices is rich and varied, offering something for everyone.‍

By understanding the limitations of traditional setups and exploring both open-source and managed solutions, you can make informed decisions that best suit your organization’s needs and take a significant step towards operationalizing machine learning effectively.‍

This article was originally published in the Qwak Blog.