MLOps - a new hot word or a necessity?

MLOps - a new hot word or a necessity?


Historically speaking, people have always craved to discover patterns within their environments. The simplest form of pattern recognition can be identified in the earliest days of mankind. Centuries later, starting in the third industrial revolution data began to manifest in a digital form, and so recognizing patterns within these digital data.

Today, we are witnesses of machine learning models embedded in nearly every application we use, starting from entertainment games and going all the way up to the most industry-grade applications. Academia and Industry have both joint forces and put enormous amounts of resources into research and development of state of the art models (SOTA), many of them available open-source. Those models combined with transfer learning and a vast amount of open data, allow one to develop machine learning applications within an extremely short time frame after the initial idea is defined. Those ideas framed into applications spawn new startups at a rate never seen before.

Every model (from failed to SOTA) starts in a research environment, commonly in a notebook. As the project starts being more serious, utility libraries are generalized for further reuse, scripts for training, deployment, evaluating, and visualizing results are written, logging at several levels embodied, and when put into production, performance monitoring setup being built. Tasks such as retraining and prevention of performance degradation are often part of those monitoring setups, which analyze input drift or evaluate through ground truth data.

After the research phase, much of the focus and effort is pure software engineering. Historically, processes and technologies, commonly recognized in the industry have revolutionized and modernized software development. They appear in different flavors of agile methodologies, review and release processes, version control, conventions, CI/CD... Big players in the industry frequently share their experiences about the adoption of proces through various forms of media. Such example being Why Google Stores Billions of Lines of Code in a Single Repository?.

Data Scientists should focus on exploring and developing models. All the other stuff should be uniform and standardized. Developing models should be put beyond Jupyter Notebooks, and custom scripts.

Data should be versioned, results visualized, training monitored, real-time reports available for all stakeholders. All and all, a step forward should be given towards the explainability and interoperability of results. This can be done (and it is currently done) from scratch within every organization and team, and it works perfectly. Data is versioned following conventions and rules, scripts for model evaluation and reports are developed, monitoring dashboards and reports for stakeholders are drafted. Now, let's shift scope and imagine if every software company developed its version control system and container technology - the industry would be a total mess.


What is it?

A formal definition states, MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently.


Who does it?

In software development, there are many different roles, each of them carrying responsibilities and perceiving the same products from different aspects. In a data science project the following ones can be differentiated:

  • Business Analyst or Domain Expert
  • Data Analyst
  • Software Engineers and Developers
  • Data and ML Architects
  • Data Engineers
  • MLOps Engineers
  • Optimization engineers
  • Data Scientists

What are the common tasks?

The first attempt to identify the problems in current ML applications workflow was done by a group of researchers at Google in their paper Hidden Technical Debt in Machine Learning Systems.

This table summarizes some of the core MLOps principles. Full table and text are available at

MLOps PrinciplesDataML ModelCode
Versioning1) Data preparation pipelines
2) Features store
3) Datasets
4) Metadata
1) ML model training pipeline
2) ML model (object)
3) Hyperparameters
4) Experiment tracking
1) Application code
2) Configurations
Testing1) Data Validation (error detection)
2) Feature creation unit testing
1) Model specification is unit tested
2) ML model training pipeline is integration tested
3) ML model is validated before being operationalized
4) ML model staleness test (in production)
5) Testing ML model relevance and correctness
6) Testing non-functional requirements (security, fairness, interpretability)
1) Unit testing
2) Integration testing for the end-to-end pipeline
Automation1) Data transformation
2) Feature creation and manipulation
1) Data engineering pipeline
2) ML model training pipeline
3) Hyperparameter/Parameter selection
1) ML model deployment with CI/CD
2) Application build
Reproducibility1) Backup data
2) Data versioning
3) Extract metadata
4) Versioning of feature engineering
1) Hyperparameter tuning is identical between dev and prod
2) The order of features is the same
3) Ensemble learning: the combination of ML models is same
4)The model pseudo-code is documented
1) Versions of all dependencies in dev and prod are identical
2) Same technical stack for dev and production environments
3) Reproducing results by providing container images or virtual machines
Deployment1) Feature store is used in dev and prod environments1) Containerization of the ML stack
3) On-premise, cloud, or edge
1) On-premise, cloud, or edge
Monitoring1) Data distribution changes (training vs. serving data)
2) Training vs serving features
1) ML model decay
2) Numerical stability
3) Computational performance of the ML model
1) Predictive quality of the application on serving data

Where are we know and where should we seek to?

A clear set of tasks and rules cannot be defined due to the nature of software development in general but from my point of view, there are several things worth paying attention to.

Since a great percentage of ML applications codebases is not ML code, good practices for writing clean code should be put wherever possible. Efforts into drawing abstractions should be put, and common practices should evolve into patterns. Teams and organizations should be encouraged to share their experiences and knowledge obtained during research and development. Data Versioning, experiment management and reproducibility, sanity checks of data, pipelines, dependencies, and performance monitoring should be embodied in every ML project, regardless of its size, industry, or impact.

But, those processes and tasks wouldn't define and evolve by themselves, but indeed the participants in the process should start practicing them.

Starting from Data Scientists, more effort should be put into practice clean code, design patterns, and knowledge sharing. Educators (formal or informal) should focus more on creating resources that cover the topics discussed in this post. We are witnesses of Computer Science graduates with extraordinary Data Science skills, but still writing messy code and lacking fundamental knowledge about software engineering in general. There are extreme cases in which CS graduates are unfamiliar with version control systems and HTTP protocol. Because after all:

Every Data Scientist should be an engineer first. And every ML model is just another microservice within the boundaries of its system.

Where should you start from

In my opinion, every data scientist should have a concrete and solid knowledge of the fundamental concepts discussed in this post. I strongly recommend reading some books, starting with Introducing MLOps: How to Scale Machine Learning in the Enterprise, Practical MLOps: Operationalizing Machine Learning Models and related textbooks.

Exploring tools like Weights & Biases, Data Version Control, Tensor Board and others will give you a great start.