The feature store is becoming a fully-fledged component of the modern data stack and has aroused a lot of interest in the past three years. This relatively new category of tools promises to make machine learning models productionization effortless, compared to the hurdle it currently is. This is an extremely ambitious promise, as 85% of ML projects currently undertaken never make it to production. Although this statistic is dubious, as this figure is extremely hard to estimate, it reflects a tough reality: the majority of ML-projects never see the day. Feature stores promise to change that. How so? By providing data scientists with a catalog of neat, ready-for-production features, powering efficiency and scalability of ML models.
This article seeks to explain how feature stores manage to deliver this huge promise. This task first requires us to understand what features and models are, and what makes it so hard to productionize ML models at scale. We'll then be equipped to dive into the technicalities of how feature stores solve the problem.
Here’s a quick overview:
I - Models and features: what are they?
II- What is a feature store?
To avoid my explanations feeling dry and abstract, I'll use a real-life example of ML model that will carry us throughout the article to help us understand features, models, and the functioning of a feature store platform.
A nice example of a machine learning model that most of us are familiar with is the model powering the Uber Eats application. The ultimate goal of this model is to predict accurately how long it will take for your meal to be prepared and brought to your door, answering the fundamental question of human existence: "when are we eating?". That's an ambitious endeavour, as Uber uptakes the task of answering this question for thousands of users at the same time, for different locations, different restaurants, and different drivers. This model is especially interesting as it gave birth to the very concept of feature store. I'll explain how later. For now, let's dive in the fun part.
When building the ML model used to power the Uber Eats app, the first step is to isolate the signals that can help predict delivery time. A signal is simply an element that has a causal effect on the outcome of interest. In our case, the signals include the following: how long it typically takes for the restaurant to prepare a dish, an indication of how busy the restaurant is, information about the weather, etc.. . These signals are what we call features, and analyzing feature ****values over a certain period of time help us predict the delivery time of our food. In fact, if the restaurant is extremely busy and it's raining, it will take much longer to deliver a meal than when it's empty and the sun is shining. Signals come in all shapes and size, and we shall explore the nuances between different kinds of signals in a second part.
A machine learning model is a system taking all these signals/features into account to make the most accurate prediction about delivery time. Once the appropriate model has been chosen by a data scientist, the model is trained. That is**,** it learns from a variety of historical examples to learn from his past predictions.
The dataset containing all the historical values is called the training set. It allows the model to look back at its past predictions, learn from its mistakes and adjust its predictions accordingly. Running all our examples through a training algorithm gives a decision-making system that can act on an input and output a predicted delivery time. The model depends entirely on these "signals" we mentioned above. It needs that data in a neat and structured format to be able to interpret it and output a decision.
The issue is, raw data never is in a neat and structured format. Special data pipelines are thus set up to put the data into this specific format. We call that feature engineering. For example, to understand how busy a restaurant is, you need to aggregate the number of orders over a given window of time. You need aggregated data to be fed into your model, not just raw data. Raw data won't tell you anything.
To make predictions the model uses two things:
In short, a machine learning model is two things: a model artifact (aka the decision-making system), and these data dependencies that are always changing and that need to be maintained.
You can probably start smelling the complexity of the ML models from the previous section. Going deeper into our example allows us to pinpoint where the complexity comes from exactly, so we can explain how feature stores solve them.
Coming back to our Uber Eats example, let's talk about two signals that could become predictive features for our model and what they entail in terms of data infrastructure:
There is a first operational challenge that comes from managing different refresh timeframes according to the kind of feature you’re dealing with. Online and offline features can’t be dealt with the same way: they have to be stored in different places and treated differently.
The pathway to productionize ML models is the following:
This is probably giving you a headache already, and we're just talking about two signals. Complex ML models can include dozens of signals. This gives rise to an extremely complicated nest of data pipelines and services built to support and process all the different types of data, translate it and extract signals from it.
Deploying machine learning is thus not accessible to organizations that can't afford a big engineering team. Building and keeping these pipelines running demands enormous technicality and a huge engineering effort. We have seen previously that the model makes predictions based on historical data (batch data), and well as current feature values (real-time data). the data needs to be put in a format that's consumable by the model and delivered to the model in real-time. Without an engineering team to rebuild the prototype in the operational environment, the data science project never sees the day. Even with a big engineering team, the cost per ML use case is simply too high as each project requires rebuilding complex data pipelines.
Problems encountered in ML-models productionization:
❌ The machine learning lifecycle takes forever, as it demands the synchronization of the data science and data engineering team.
❌ Productionization of ML-models comes at huge costs, as it demands that complex data pipelines be built from scratch each time we want to put a model in production.
Now that we have more context surrounding machine learning applications, let's turn to feature stores and how they can solve the aforementioned problems.
A feature store is a platform where all features are centralized, accessible to everyone, allowing employees to re-use them across various different projects. More precisely, a feature store is an ML-specific data platform that:
Features in the feature store are calculated and updated daily, to ensure the model remains accurate.
There a 5 key components to a feature store: Storage, Transformation, Monitoring, Serving, and Cataloging. You can find them explained in more detail in a great article written by Tecton, a feature store platform provider.
Basically, the feature store is a data warehouse of features for machine learning; it is a central vault for storing documented and curated features that can be used across many different models, thus providing support for easy feature management**.** It architecturally differs from a traditional data warehouse in the sense that it is a dual database.
These two databases support the requirements of different feature serving systems, according to the model needs.
The main component of a feature store is the feature catalog or feature registry. This is the repository of neat features, available for all to develop and use. At the training stage, data scientists define features in the feature store's registry. A feature is comprised of:
On the basis of these definitions and metadata, the feature store schedules and configures data ingestion, transformation, and storage. That is, it uses the definitions to write new data pipelines in an automated manner.
As previously mentioned, ML models usually require transforming new, raw data into neat features. Feature stores orchestrate these data transformations based on the feature definition discussed just above. Say, the feature definition contains an aggregation, as our the Ubereats example. The feature store will take this definition in, create a streaming spark job that will connect to a Kafka pipeline, and maintain these aggregations to feed them to the online store. This store only holds the latest values of these signals. These feature values will also be fed to the offline storage layer in which historical feature values are used for the purpose of model training & retraining.
There a various types of data transformations:
The feature store serves feature data to models. This part is crucial as well, as it ensures that the right feature values are constantly fed to the model. It also ensures consistency of definitions between features that are used to train a model and those used for online serving. Offline data is retrieved for the purpose of model training. For online serving, Responses are served through a high-performance API backed by a low-latency database.
Once these complex systems are finally up and running, it's important to monitor them. That is, keep the pulse on the health of the system to ensure data problems can be detected quickly and efficiently. Monitoring allows for immediate alerting when something goes wrong in an ML application.
The automated development of these feature pipelines is made possible through data abstractions that make it easy to build feature pipelines across environments.
Feature stores solve most of the problems mentioned in the first section:
✅ They allow data scientists to finally become end-to-end owners of the ML process. They can productionize ML models just by selecting features in the feature store. Productionization of the ML model doesn't have to be handled by an engineering team.
✅ They help achieve consistency between training and serving data. They do so by ensuring that data transformations are consistently applied across different environments.
✅ They bring economies of scale to ML organizations. Once registered in the feature store, features are available for immediate re-use by other models across the organization. Say the feature "Outside temperature during the past five days" is useful for a given model, it might also serve to predict something else. The new ML projects can just bootstrap with a library of production-ready features. You no longer need an army of engineers each time you want to make slightly complex predictions.
✅ They allow for easy monitoring of the health of feature pipelines in production
You're probably wondering whether you need a feature store platform in your organization. Before getting a feature store, there are two things you need to figure out first:
Feature stores have only emerged recently. The concept was introduced by Uber in 2017 with the launch of their Michelangelo platform. If you have minutes to spare, have a look at how the Michelangelo platform was built and the motivations behind it, this can only deepen your understanding of the matter.
After this, it only took a year until Hopsworks and Feast, two open-source feature store projects**,** were launched. Tech giants such as Google (Vertex AI), Amazon (SageMaker), and Databricks were also quick to propose their own fully-managed feature store platforms. Here's a quick timeline to understand the evolution of the tools:
Find more benchmarks and analysis on the modern data stack here. We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data. If you're a data leader and would like to discuss these topics in more depth, join the community we've created for that!
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly.