Deciding whether to adopt a feature store

Who should decide whether it's worth the investment for a team and/or company to build a feature store? Feature stores offer benefits for data science and infrastructure/platform teams, so it makes sense to include both groups in the discussion (see Data science factors below).

Platform engineers on large teams or organizations often recommend adoption of a feature store. In contrast, data scientists are likely to first see the benefits in smaller teams and organizations. Buy-in from both groups is critical to successfully implementing a solution like this one.

Data platform teams are looking for ways to provide most of the functionality needed by the data science team. The goal is to provide it in an easy to manage and control using self-service, if possible. Data scientists are looking for a platform where they can find and share features and abstract feature access during training and inference.

The following content highlights decision points most teams will reach while considering implementing a feature store. We'll start with the definition of some terms that are frequently used in feature stores.

Feature Store Terminology

Feature transformation: refers to the process of converting raw data into features. Transformation generally requires building data pipelines to ingest both historical and real-time data.

Feature registry: refers to a location where all features used are defined and registered. By using a registry, data scientists can search, find, and reuse features in their models. Feature definitions include information like type, source, and other relevant metadata.

Feature serving: refers to being able to serve feature values for both batch operations like training (high latency) and low latency for inference. It abstracts the complexity away when querying the feature values while providing functionality like point-in-time joins.

Observation data: refers to the raw input for the data being queried in the Feature Serving layer. Observation data is, at the least, composed of the IDs of the entities of interest and timestamps; both exist as join keys. This concept is called Entity data frame in other feature stores.

Point-in-Time Joins (PITJ): for time-series-driven data. It is important to make sure the data used for training is not mixed with the latest data ingested. Doing so creates feature leakage (also known as label leakage). PITJ ensures that data served corresponds to the closest observation times.

Data science factors

Data scientists should review the following questions to help decide if it is worth the cost to invest in a feature store.

Decision point Without feature store With feature store
Do your data scientists have problems finding available features for reuse? Without a centralized repository for features, data scientists often jump directly to creating feature transformation pipelines. These pipelines increase the complexity of the platform as the use cases supported grow and reduce the value of previously acquired domain knowledge. A key component in a feature store is the feature registry. A feature registry is a module that works as a centralized repository for all features created by and within an organization. It makes discovery and management of features easier.

A feature registry contains information about feature definitions and their source. Depending on the feature store, it might include information about the transformation code and lineage information.
This component is, ideally, searchable, easy to understand, and accessible from a centralized endpoint.
Do you want to share your features with business users? Information about features is scattered throughout docs and code and is not easily shareable with business users. These users provide domain knowledge about which features to use or might become outdated. Feature Stores are a single source of truth with a standardized and structured way of viewing information about features.
Do many of your features need to be served/computed in real time? You have clients requesting predictions that do not have the feature values, without a way to inject them into the requests. These features need to be computed in near real-time.

A good use case is a real-time recommendation engine where you aggregate streamed events and generate new recommendations on-demand.
Feature stores provide a component called an Online Store. The online store contains the latest version of a feature value (often called materialization). The values persist in a low latency data store, ensuring that features are served in near real-time to your model.

The feature store abstracts this materialization process.
Are many of your features time-dependent? Do your data scientists spend much time handling complex point-in-time joins? Data scientists need to spend time learning how to do point-in-time correct joins. Constructing point-in-time correct data is time-consuming and error-prone. A feature store has built-in point-in-time joining capabilities, abstracting this complexity away from the data scientists.
Do your data scientists spend time writing complex queries or code to access the feature data? During feature value retrieval, data scientists must write code to access the data according to the data source of choice. The lack of abstraction can require writing complex queries or spending time writing code that has little direct value to their work.

Sometimes, the time is spent debugging infrastructure issues instead of higher value activities like feature engineering/building the ML model itself.
Feature stores provide a Serving layer that works as an abstraction away from the infrastructure. Data scientists can minimize the time spent dealing with the infrastructure and specific syntax and focus on the needed features.

This layer combines the feature registry and the point-in-time joins, providing a powerful mechanism for data scientists to access data without knowing the underlying infrastructure.

Platform factors

Infrastructure and data platform teams should consider the following questions when evaluating the pros and cons of building a feature store.

Decision point Without feature store With feature store
Do you maintain many duplicated feature transformation code/pipelines? When data scientists are unaware of existing features, there will be a propensity to create and manage duplicate pipelines to perform feature transformations. Managing all these pipelines is expensive and demands a lot of attention from a platform team when making changes or upgrades. Given the predilection for shareability in a feature store, the number of duplicated feature transformation pipelines should be reduced in favor of reusing existing features.
Do you have to serve features for training (batch or high-latency) and inference (low-latency)? Processing historical (for training) and streaming (for inference) data is done differently and requires separate pipelines. These pipelines might process the data using different methods and technologies that are specific to how the data is ingested (batch vs. streaming) and store the results in various data stores according to the latency requirements. All these factors increase the complexity of maintaining these pipelines. Most feature stores provide a module for feature computation that takes care of storing data in a suitable data store according to the requirements of the feature. A module like this enables the processing of batch data from ETL processes or historical data in a data warehouse and of streaming data from low latency message bus systems.

To make this process consistent, a feature store would provide a domain-specific language (DSL) to perform transformations that deliver consistent results no matter how the data is ingested. The DSL allows features to be computed once for both training (batch) and inference (real-time) and reused across models. An example of such a case is Feathr.
Do you need to keep your data systems compliant? Maintaining control of each training dataset used by your data science team might be daunting, especially as the number of use cases grows. Some feature stores provide the governance tools required by an enterprise to exercise control over the feature data. Access control, data quality, policies, auditing, etc., enable the platform team to maintain control over the data ingested and transformed in the feature store from one centralized place.