Tutorial 2: Experiment and train models by using features

2024-09-30

This tutorial series shows how features seamlessly integrate all phases of the machine learning lifecycle: prototyping, training, and operationalization.

The first tutorial showed how to create a feature set specification with custom transformations. Then, it showed how to use that feature set to generate training data, enable materialization, and perform a backfill. This tutorial shows how to enable materialization and perform a backfill. It also shows how to experiment with features, as a way to improve model performance.

In this tutorial, you learn how to:

Prototype a new accounts feature set specification, through use of existing precomputed values as features. Then, register the local feature set specification as a feature set in the feature store. This process differs from the first tutorial, where you created a feature set that had custom transformations.
Select features for the model from the transactions and accounts feature sets, and save them as a feature retrieval specification.
Run a training pipeline that uses the feature retrieval specification to train a new model. This pipeline uses the built-in feature retrieval component to generate the training data.

Prerequisites

Before you proceed with this tutorial, be sure to complete the first tutorial in the series.

Set up

Configure the Azure Machine Learning Spark notebook.

You can create a new notebook and execute the instructions in this tutorial step by step. You can also open and run the existing notebook named 2.Experiment-train-models-using-features.ipynb from the featurestore_sample/notebooks directory. You can choose sdk_only or sdk_and_cli. Keep this tutorial open and refer to it for documentation links and more explanation.
1. On the top menu, in the Compute dropdown list, select Serverless Spark Compute under Azure Machine Learning Serverless Spark.
2. Configure the session:
  1. When the toolbar displays Configure session, select it.
  2. On the Python packages tab, select Upload Conda file.
  3. Upload the conda.yml file that you uploaded in the first tutorial.
  4. As an option, you can increase the session time-out (idle time) to avoid frequent prerequisite reruns.

Start the Spark session.

# run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")

Set up the root directory for the samples.

import os

# please update the dir to ./Users/<your_user_alias> (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure in the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

Set up the CLI.

Python SDK
Azure CLI

Not applicable.

Install the Azure Machine Learning extension.
```
!az extension add --name ml
```
Authenticate.
```
!az login
```

Set the default subscription.

import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

Initialize the project workspace variables.

This is the current workspace, and the tutorial notebook runs in this resource.

### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace
ws_client = MLClient(
    AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name
)

Initialize the feature store variables.

Be sure to update the featurestore_name and featurestore_location values, to reflect what you created in the first tutorial.

from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = (
    "<FEATURESTORE_NAME>"  # use the same name from part #1 of the tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client
fs_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    featurestore_subscription_id,
    featurestore_resource_group_name,
    featurestore_name,
)

Initialize the feature store consumption client.

# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

Create a compute cluster named cpu-cluster in the project workspace.

You need this compute cluster when you run the training/batch inference jobs.

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
    name="cpu-cluster-fs",
    type="amlcompute",
    size="STANDARD_F4S_V2",  # you can replace it with other supported VM SKUs
    location=ws_client.workspaces.get(ws_client.workspace_name).location,
    min_instances=0,
    max_instances=1,
    idle_time_before_scale_down=360,
)
ws_client.begin_create_or_update(cluster_basic).result()

Create the accounts feature set in a local environment

In the first tutorial, you created a transactions feature set that had custom transformations. Here, you create an accounts feature set that uses precomputed values.

To onboard precomputed features, you can create a feature set specification without writing any transformation code. You use a feature set specification to develop and test a feature set in a fully local development environment.

You don't need to connect to a feature store. In this procedure, you create the feature set specification locally, and then sample the values from it. To benefit from the capabilities of managed feature store, you must use a feature asset definition to register the feature set specification with a feature store. Later steps in this tutorial provide more details.

Explore the source data for the accounts.

Note

This notebook uses sample data hosted in a publicly accessible blob container. Only a wasbs driver can read it in Spark. When you create feature sets through use of your own source data, host those feature sets in an Azure Data Lake Storage Gen2 account, and use an abfss driver in the data path.
```
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"
accounts_df = spark.read.parquet(accounts_data_path)

display(accounts_df.head(5))
```

Create the accounts feature set specification locally, from these precomputed features.

You don't need any transformation code here, because you reference precomputed features.

from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)
from azureml.featurestore.feature_source import ParquetFeatureSource


accounts_featureset_spec = create_feature_set_spec(
    source=ParquetFeatureSource(
        path="wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    # account profiles in the source are updated once a year. set temporal_join_lookback to 365 days
    temporal_join_lookback=DateTimeOffset(days=365, hours=0, minutes=0),
    infer_schema=True,
)

Export as a feature set specification.

To register the feature set specification with the feature store, you must save the feature set specification in a specific format.

After you run the next cell, inspect the generated accounts feature set specification. To see the specification, open the featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml file from the file tree.

The specification has these important elements:
- source: A reference to a storage resource. In this case, it's a Parquet file in a blob storage resource.
- features: A list of features and their datatypes. With provided transformation code, the code must return a DataFrame that maps to the features and datatypes. Without the provided transformation code, the system builds the query to map the features and datatypes to the source. In this case, the generated accounts feature set specification doesn't contain transformation code, because features are precomputed.
- index_columns: The join keys required to access values from the feature set.
To learn more, visit the Understanding top-level entities in managed feature store and the CLI (v2) feature set specification YAML schema resources.

As an extra benefit, persisting supports source control.

You don't need any transformation code here, because you reference precomputed features.
```
import os

# create a new folder to dump the feature set spec
accounts_featureset_spec_folder = root_dir + "/featurestore/featuresets/accounts/spec"

# check if the folder exists, create one if not
if not os.path.exists(accounts_featureset_spec_folder):
    os.makedirs(accounts_featureset_spec_folder)

accounts_featureset_spec.dump(accounts_featureset_spec_folder, overwrite=True)
```

Locally experiment with unregistered features and register with feature store when ready

As you develop features, you might want to locally test and validate them, before you register them with the feature store or run training pipelines in the cloud. A combination of a local unregistered feature set (accounts) and a feature set registered in the feature store (transactions) generates training data for the machine learning model.

Select features for the model.

# get the registered transactions feature set, version 1
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# Notice that account feature set spec is in your local dev environment (this notebook): not registered with feature store yet
features = [
    accounts_featureset_spec.get_feature("accountAge"),
    accounts_featureset_spec.get_feature("numPaymentRejects1dPerUser"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

Locally generate training data.

This step generates training data for illustrative purposes. As an option, you can locally train models here. Later steps in this tutorial explain how to train a model in the cloud.

from azureml.featurestore import get_offline_features

# Load the observation data. To understand observatio ndata, refer to part 1 of this tutorial
observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call training_df.show() to see correctly formatted value

After you locally experiment with feature definitions, and if they seem reasonable, you can register a feature set asset definition with the feature store.

from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

accounts_fset_config = FeatureSet(
    name="accounts",
    version="1",
    description="accounts featureset",
    entities=[f"azureml:account:1"],
    stage="Development",
    specification=FeatureSetSpecification(path=accounts_featureset_spec_folder),
    tags={"data_type": "nonPII"},
)

poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())

Get the registered feature set and test it.

# look up the featureset by providing name and version
accounts_featureset = featurestore.feature_sets.get("accounts", "1")

Run a training experiment

In these steps, you select a list of features, run a training pipeline, and register the model. You can repeat these steps until the model performs as you want.

Optionally, discover features from the feature store UI.

The first tutorial covered this step, when you registered the transactions feature set. Because you also have an accounts feature set, you can browse through the available features:
1. Go to the Azure Machine Learning global landing page.
2. On the left pane, select Feature stores.
3. In the list of feature stores, select the feature store that you created earlier.
The UI shows the feature sets and entity that you created. Select the feature sets to browse through the feature definitions. You can use the global search box to search for feature sets across feature stores.

Optionally, discover features from the SDK.

# List available feature sets
all_featuresets = featurestore.feature_sets.list()
for fs in all_featuresets:
    print(fs)

# List of versions for transactions feature set
all_transactions_featureset_versions = featurestore.feature_sets.list(
    name="transactions"
)
for fs in all_transactions_featureset_versions:
    print(fs)

# See properties of the transactions featureset including list of features
featurestore.feature_sets.get(name="transactions", version="1").features

Select features for the model, and export the model as a feature retrieval specification.

In the previous steps, you selected features from a combination of registered and unregistered feature sets for local experimentation and testing. You can now experiment in the cloud. Your model-shipping agility increases if you save the selected features as a feature retrieval specification, and then use the specification in the machine learning operations (MLOps) or continuous integration and continuous delivery (CI/CD) flow for training and inference.
1. Select features for the model.
```
# you can select features in pythonic way
features = [
    accounts_featureset.get_feature("accountAge"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    f"accounts:1:numPaymentRejects1dPerUser",
    f"transactions:1:transaction_amount_7d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)

features.extend(more_features)
```
2. Export the selected features as a feature retrieval specification.
  
  A feature retrieval specification is a portable definition of the feature list associated with a model. It can help streamline the development and operationalization of a machine learning model. It becomes an input to the training pipeline that generates the training data. Then, it's packaged with the model.
  
  The inference phase uses the feature retrieval to look up the features. It integrates all phases of the machine learning lifecycle. Changes to the training/inference pipeline can stay at a minimum as you experiment and deploy.
  
  Use of the feature retrieval specification and the built-in feature retrieval component is optional. You can directly use the get_offline_features() API, as shown earlier. The name of the specification should be feature_retrieval_spec.yaml when you package it with the model. This way, the system can recognize it.
```
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir + "/project/fraud_model/feature_retrieval_spec"

# check if the folder exists, create one if not
if not os.path.exists(feature_retrieval_spec_folder):
    os.makedirs(feature_retrieval_spec_folder)

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)
```

Train in the cloud with pipelines, and register the model

In this procedure, you manually trigger the training pipeline. In a production scenario, a CI/CD pipeline could trigger it, based on changes to the feature retrieval specification in the source repository. You can register the model if it's satisfactory.

Run the training pipeline.

The training pipeline has these steps:
1. Feature retrieval: For its input, this built-in component takes the feature retrieval specification, the observation data, and the time-stamp column name. It then generates the training data as output. It runs these steps as a managed Spark job.
2. Training: Based on the training data, this step trains the model and then generates a model (not yet registered).
3. Evaluation: This step validates whether the model performance and quality fall within a threshold. (In this tutorial, it's a placeholder step for illustration purposes.)
4. Register the model: This step registers the model.
  
  Note
  
  In the second tutorial, you ran a backfill job to materialize data for the transactions feature set. The feature retrieval step reads feature values from the offline store for this feature set. The behavior is the same, even if you use the get_offline_features() API.
```
from azure.ai.ml import load_job  # will be used later

training_pipeline_path = (
    root_dir + "/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition = load_job(source=training_pipeline_path)
training_pipeline_job = ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15 mins. However subsequent runs can be faster (assuming spark pool is warm - default timeout is 30 mins)
```
5. Inspect the training pipeline and the model.
  - To display the pipeline steps, select the hyperlink for the Web View pipeline, and open it in a new window.
Use the feature retrieval specification in the model artifacts:
1. On the left pane of the current workspace, select Models with the right mouse button.
2. Select Open in a new tab or window.
3. Select fraud_model.
4. Select Artifacts.
The feature retrieval specification is packaged along with the model. The model registration step in the training pipeline handled this step. You created the feature retrieval specification during experimentation. Now it's part of the model definition. In the next tutorial, you'll see how the inferencing process uses it.

View the feature set and model dependencies

View the list of feature sets associated with the model.

On the same Models page, select the Feature sets tab. This tab shows both the transactions and accounts feature sets. This model depends on these feature sets.
View the list of models that use the feature sets:
1. Open the feature store UI (explained earlier in this tutorial).
2. On the left pane, select Feature sets.
3. Select a feature set.
4. Select the Models tab.
The feature retrieval specification determined this list when the model was registered.

Clean up

The fifth tutorial in the series describes how to delete the resources.

Next steps

Go to the next tutorial in the series: Enable recurrent materialization and run batch inference.
Learn about feature store concepts and top-level entities in managed feature store.
Learn about identity and access control for managed feature store.
View the troubleshooting guide for managed feature store.
View the YAML reference.

Share via