The feature store is becoming a fully-fledged component of the modern data stack and has aroused a lot of interest in the past three years. This relatively new category of tools promises to make machine learning models productionization effortless, compared to the hurdle it currently is. This is an extremely ambitious promise, as 85% of ML projects currently undertaken never make it to production. Although this statistic is dubious, as this figure is extremely hard to estimate, it reflects a tough reality: the majority of ML-projects never see the day. Feature stores promise to change that. How so? By providing data scientists with a catalog of neat, ready-for-production features, powering efficiency and scalability of ML models.
This article seeks to explain how feature stores manage to deliver this huge promise. This task first requires us to understand what features and models are, and what makes it so hard to productionize ML models at scale. We'll then be equipped to dive into the technicalities of how feature stores solve the problem.
Here’s a quick overview:
I - Models and features: what are they?
- Defining models & features through the example of Ubereats
- The challenges in operationalizing ML models
II- What is a feature store?
- The 5 components of a feature store
- What do feature stores really achieve?
- Do you need a feature store?
Models & features: what are they?
To avoid my explanations feeling dry and abstract, I'll use a real-life example of ML model that will carry us throughout the article to help us understand features, models, and the functioning of a feature store platform.
A nice example of a machine learning model that most of us are familiar with is the model powering the Uber Eats application. The ultimate goal of this model is to predict accurately how long it will take for your meal to be prepared and brought to your door, answering the fundamental question of human existence: "when are we eating?". That's an ambitious endeavour, as Uber uptakes the task of answering this question for thousands of users at the same time, for different locations, different restaurants, and different drivers. This model is especially interesting as it gave birth to the very concept of feature store. I'll explain how later. For now, let's dive in the fun part.
When building the ML model used to power the Uber Eats app, the first step is to isolate the signals that can help predict delivery time. A signal is simply an element that has a causal effect on the outcome of interest. In our case, the signals include the following: how long it typically takes for the restaurant to prepare a dish, an indication of how busy the restaurant is, information about the weather, etc.. . These signals are what we call features, and analyzing feature ****values over a certain period of time help us predict the delivery time of our food. In fact, if the restaurant is extremely busy and it's raining, it will take much longer to deliver a meal than when it's empty and the sun is shining. Signals come in all shapes and size, and we shall explore the nuances between different kinds of signals in a second part.
A machine learning model is a system taking all these signals/features into account to make the most accurate prediction about delivery time. Once the appropriate model has been chosen by a data scientist, the model is trained. That is**,** it learns from a variety of historical examples to learn from his past predictions.
The dataset containing all the historical values is called the training set. It allows the model to look back at its past predictions, learn from its mistakes and adjust its predictions accordingly. Running all our examples through a training algorithm gives a decision-making system that can act on an input and output a predicted delivery time. The model depends entirely on these "signals" we mentioned above. It needs that data in a neat and structured format to be able to interpret it and output a decision.
The issue is, raw data never is in a neat and structured format. Special data pipelines are thus set up to put the data into this specific format. We call that feature engineering. For example, to understand how busy a restaurant is, you need to aggregate the number of orders over a given window of time. You need aggregated data to be fed into your model, not just raw data. Raw data won't tell you anything.
To make predictions the model uses two things:
- The training dataset: containing historical feature values described above
- The current state of those features: that is, the values of these features at time T when you order your meal. At time T, the models need data about what the weather is right now, how busy the restaurant is right now (and not 6 months ago because this data is useless to predict how fast your rider will drive in the next 20min) ****
In short, a machine learning model is two things: a model artifact (aka the decision-making system), and these data dependencies that are always changing and that need to be maintained.
The challenges in operationalizing ML models
You can probably start smelling the complexity of the ML models from the previous section. Going deeper into our example allows us to pinpoint where the complexity comes from exactly, so we can explain how feature stores solve them.
First challenge: manage various refresh timeframe
Coming back to our Uber Eats example, let's talk about two signals that could become predictive features for our model and what they entail in terms of data infrastructure:
- How long it typically takes for the restaurant to prepare a dish. This signal is pretty straightforward. You can calculate it once a week or once a month and make sure it is always available in your data warehouse to be passed into the model. The time taken to make a burger usually doesn't vary that much, so you're safe. This type of feature is called an offline feature because it is mainly used by batch processes. Usually, these features are calculated via frameworks such as Spark or by simply running SQL queries against a database.
- How many orders the restaurants have, and how it compares to the usual number of orders they get over 30 min. This gives an indication of how busy the restaurant is. If it is busier than usual, you can expect your meal to take longer to arrive. This signal is much trickier. It can't be computed in the data warehouse as the data warehouse doesn't contain the data that's been available in the latest 30 min and it's not meant for fast operations anyway. So what typically happens is that the order events are populated through a Kafka stream, and some streaming aggregation jobs read these kaka streams aggregating these data in an aggregation window of 15-30 min. The window needs to be kept short so that when you need to make a prediction for when your meal will be prepared, the signal will be available immediately. The window is always updated in real-time. This type of feature is called an online feature because it needs to be calculated extremely fast and often served in millisecond latency. The calculations needed are more challenging than for offline features, as it requires fast computation and fast access to data.
There is a first operational challenge that comes from managing different refresh timeframes according to the kind of feature you’re dealing with. Online and offline features can’t be dealt with the same way: they have to be stored in different places and treated differently.
Second challenge: an intricate productionization workflow
The pathway to productionize ML models is the following:
- Data scientists own the beginning of the ML workflow. They begin by figuring out the data needed to build the features, and then export this data to a local python environment. They clean the data, transform it to extract the relevant signals for the model (feature engineering). The model is then trained to ensure it fares well when it comes to predicting new values. Once the model is deemed performant, the prototype is completely handed off to the engineering team.
- Data engineers take the python notebook and take care of "productionizing" the ML model. That is, they build the data pipelines necessary to feed the model from scratch. They write the data pipelines for serving the model in production and build production services/monitoring. They basically build a custom engineering micro-service to sustain these predictions. This workflow is repeated for every feature, and for every model an organization wants to productionize.
This is probably giving you a headache already, and we're just talking about two signals. Complex ML models can include dozens of signals. This gives rise to an extremely complicated nest of data pipelines and services built to support and process all the different types of data, translate it and extract signals from it.
Deploying machine learning is thus not accessible to organizations that can't afford a big engineering team. Building and keeping these pipelines running demands enormous technicality and a huge engineering effort. We have seen previously that the model makes predictions based on historical data (batch data), and well as current feature values (real-time data). the data needs to be put in a format that's consumable by the model and delivered to the model in real-time. Without an engineering team to rebuild the prototype in the operational environment, the data science project never sees the day. Even with a big engineering team, the cost per ML use case is simply too high as each project requires rebuilding complex data pipelines.
Problems encountered in ML-models productionization:
❌ The machine learning lifecycle takes forever, as it demands the synchronization of the data science and data engineering team.
❌ Productionization of ML-models comes at huge costs, as it demands that complex data pipelines be built from scratch each time we want to put a model in production.
What is a feature store?
Now that we have more context surrounding machine learning applications, let's turn to feature stores and how they can solve the aforementioned problems.
A feature store is a platform where all features are centralized, accessible to everyone, allowing employees to re-use them across various different projects. More precisely, a feature store is an ML-specific data platform that:
- Runs data pipelines that transform raw data into feature values. Think aggregating orders to get the feature value "number of orders at a restaurant over the past 30 minutes"
- Stores and manages the feature data itself (in an online or offline setting)
- Is serving feature data consistently for model training and inference purpose
Features in the feature store are calculated and updated daily, to ensure the model remains accurate.
There a 5 key components to a feature store: Storage, Transformation, Monitoring, Serving, and Cataloging. You can find them explained in more detail in a great article written by Tecton, a feature store platform provider.
Basically, the feature store is a data warehouse of features for machine learning; it is a central vault for storing documented and curated features that can be used across many different models, thus providing support for easy feature management**.** It architecturally differs from a traditional data warehouse in the sense that it is a dual database.
- The first database stores large volumes of offline features with the aim of (1) creating and training datasets and (2) performing offline model scoring.
- The second database is serving online features at low latency to online applications, once the model has been put in production. This online feature store caters to the need to deliver the freshest feature values to the predictive model.
These two databases support the requirements of different feature serving systems, according to the model needs.
The main component of a feature store is the feature catalog or feature registry. This is the repository of neat features, available for all to develop and use. At the training stage, data scientists define features in the feature store's registry. A feature is comprised of:
- A feature definition: this might be a SQL query, or the specification of a data transformation (such as aggregation, as seen in our Deliveroo example)
- Metadata: Specific configuration code specifying how we want the signal to be backfilled, the owner of the feature, and whether the feature is a production feature (it can be trusted) or an experimental feature, which one should be more careful before using.
On the basis of these definitions and metadata, the feature store schedules and configures data ingestion, transformation, and storage. That is, it uses the definitions to write new data pipelines in an automated manner.
As previously mentioned, ML models usually require transforming new, raw data into neat features. Feature stores orchestrate these data transformations based on the feature definition discussed just above. Say, the feature definition contains an aggregation, as our the Ubereats example. The feature store will take this definition in, create a streaming spark job that will connect to a Kafka pipeline, and maintain these aggregations to feed them to the online store. This store only holds the latest values of these signals. These feature values will also be fed to the offline storage layer in which historical feature values are used for the purpose of model training & retraining.
There a various types of data transformations:
- Batch transformations are applied only to data at rest, within the data warehouse, data lake or database. This kind of transformation would be applied to obtain the average time it takes to prepare a meal in a restaurant.
- Streaming transformations are applied to data in motion/streaming sources. That is, data is transformed as it moves through the pipelines. These transformations are usually performed in Kafka, Kinesis, Pubsub, etc... This kind of transformation can be used in the context of obtaining the number of orders for a given restaurant in the past 30 min.
The feature store serves feature data to models. This part is crucial as well, as it ensures that the right feature values are constantly fed to the model. It also ensures consistency of definitions between features that are used to train a model and those used for online serving. Offline data is retrieved for the purpose of model training. For online serving, Responses are served through a high-performance API backed by a low-latency database.
Once these complex systems are finally up and running, it's important to monitor them. That is, keep the pulse on the health of the system to ensure data problems can be detected quickly and efficiently. Monitoring allows for immediate alerting when something goes wrong in an ML application.
The automated development of these feature pipelines is made possible through data abstractions that make it easy to build feature pipelines across environments.
What do feature stores really achieve?
Feature stores solve most of the problems mentioned in the first section:
✅ They allow data scientists to finally become end-to-end owners of the ML process. They can productionize ML models just by selecting features in the feature store. Productionization of the ML model doesn't have to be handled by an engineering team.
✅ They help achieve consistency between training and serving data. They do so by ensuring that data transformations are consistently applied across different environments.
✅ They bring economies of scale to ML organizations. Once registered in the feature store, features are available for immediate re-use by other models across the organization. Say the feature "Outside temperature during the past five days" is useful for a given model, it might also serve to predict something else. The new ML projects can just bootstrap with a library of production-ready features. You no longer need an army of engineers each time you want to make slightly complex predictions.
✅ They allow for easy monitoring of the health of feature pipelines in production
Do you need a feature store?
You're probably wondering whether you need a feature store platform in your organization. Before getting a feature store, there are two things you need to figure out first:
- Does your product depend on machine learning? Some ML models are simple to deploy. With complex ML models such as Ubereats or Amazon live recommendation service, the whole product depends on the accuracy of the model. This motivates the need to bring additional signals. The more signals, the more context you get, the better your predictions will be. —> Complex pipelines and high costs, you probably need a feature store
- Who is the consumer of the ML use case? There's a big difference between the success and the cost of an ML use case depending on who is the consumer of that use case. Usually, if its purpose is for internal, analytical consumption (i.e sales forecasting), it tends to be way less costly than operational ML. Data scientists are the end-to-end owners of internal ML projects, and the stakes are lower. For operational ML, you need everything to work perfectly, because your product entirely depends on the performance of your model. There's SLA's, more engineering requirements and huge business costs if things go wrong for these use cases. If you're engaging in operational ML, you probably need a feature store more than if your uses cases remain internal.
The evolution of feature stores tools
Feature stores have only emerged recently. The concept was introduced by Uber in 2017 with the launch of their Michelangelo platform. If you have minutes to spare, have a look at how the Michelangelo platform was built and the motivations behind it, this can only deepen your understanding of the matter.
After this, it only took a year until Hopsworks and Feast, two open-source feature store projects**,** were launched. Tech giants such as Google (Vertex AI), Amazon (SageMaker), and Databricks were also quick to propose their own fully-managed feature store platforms. Here's a quick timeline to understand the evolution of the tools:
Subscribe to the Castor Blog
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Try CastorDoc for free with a 14 day demo.
You might also like
Discover a comprehensive SQL editor benchmark and market analysis to help you choose the best solution for your data management needs.
CastorDoc evaluates data catalog solutions for mid-market & enterprise companies, assisting you in selecting the right tool for your data management needs.
Fantastic tool for data discovery and documentation
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.”
Michal, Head of Data, Printify