Understanding top-level entities in managed feature store

This document describes the top level entities in the managed feature store.

Diagram depicting the main components of managed feature store.

For more information on the managed feature store, visit the What is managed feature store? resource.

Feature store

You can create and manage feature sets through a feature store. A feature set is a collection of features. You can optionally associate a materialization store (offline store connection) with a feature store, to precompute and persist the features on a regular basis. This approach can make feature retrieval during training or inference faster and more reliable.

For more information about the configuration, visit the CLI (v2) feature store YAML schema resource.

Entities

An entity encapsulates the index columns for logical entities in an enterprise. Examples of entities include account entity, customer entity, etc. Entities help enforce, as a best practice, the use of the same index column definitions across the feature sets that use the same logical entities.

Entities are typically created once, and then reused across feature sets. Entities are versioned.

For more information about the configuration, visit the CLI (v2) feature entity YAML schema resource.

Feature set specification and asset

A feature set is a collection of features generated by the application of a transformation on source system data. Feature sets encapsulate a source, the transformation function, and the materialization settings. We currently support PySpark feature transformation code.

First, create a feature set specification. A feature set specification is a self-contained definition of a feature set that you can locally develop and test.

A feature set specification typically consists of these parameters:

  • source: What source(s) does this feature map to
  • transformation (optional): The transformation logic, applied to the source data, to create features. In our case, we use Spark as the supported compute.
  • Names of the columns that represent the index_columns and the timestamp_column: These names are required when users try to join feature data with observation data (more about this later)
  • materialization_settings(optional): Required if you want to cache the feature values in a materialization store for efficient retrieval.

After you develop and test the feature set spec in your local/dev environment, you can register the spec as a feature set asset with the feature store. The feature set asset provides managed capabilities, for example versioning and materialization.

For more information about the feature set YAML specification, visit the CLI (v2) feature set specification YAML schema resource.

Feature retrieval specification

A feature retrieval specification is a portable definition of a feature list that is associated with a model. It can help streamline machine learning model development and operationalization. A feature retrieval specification is typically an input to the training pipeline. It helps generate the training data. You can package it with the model. Additionally, the inference step uses it to look up the features. It integrates all phases of the machine learning lifecycle. Changes to your training and inference pipeline can be minimized as you experiment and deploy.

Use of a feature retrieval specification and the built-in feature retrieval component are optional. You can directly use the get_offline_features() API if you want.

For more information about the feature retrieval YAML specification, visit the CLI (v2) feature retrieval specification YAML schema resource.

Next steps