教學課程 2：使用功能來實驗和定型模型

發行項
03/18/2024

本教學課程系列會示範功能如何順暢地整合機器學習生命週期的所有階段：原型設計、定型和運算化。

第一個教學課程示範如何使用自訂轉換來建立功能集規格，然後使用該功能集來產生定型資料、啟用具體化，以及執行回填。本教學課程示範如何啟用具體化，並執行回填。其也會示範如何實驗功能，以此來提升模型效能。

在本教學課程中，您會了解如何：

使用現有的預先計算值作為功能，為新的 accounts 功能集規格設計原型。然後，將本機功能集規格註冊為功能存放區中的功能集。此程序與第一個教學課程不同，在第一個教學課程中，您建立了具有自訂轉換的功能集。
從 transactions 和 accounts 功能集選取模型的功能，並將其儲存為功能擷取規格。
執行會使用功能擷取規格來定型新模型的定型管線。此管線會使用內建的功能擷取元件來產生定型資料。

必要條件

在繼續進行本教學課程之前，請務必先完成本系列的第一個教學課程。

設定

設定 Azure Machine Learning Spark 筆記本。

您可以建立新的筆記本，並逐步執行本教學課程中的指示。您也可以從 featurestore_sample/notebooks 目錄開啟並執行名為 2.Experiment-train-models-using-features.ipynb 的現有筆記本。您可以選擇 sdk_only 或 sdk_and_cli。您不需關閉本教學課程，可以隨時參閱本教學課程中的文件連結和進一步說明。
1. 在頂端功能表上的 [計算] 下拉式清單中，選取 [Azure Machine Learning 無伺服器 Spark] 底下的 [無伺服器 Spark 計算]。
2. 設定工作階段：
  1. 當工具列顯示 [設定工作階段] 時，請加以選取。
  2. 在 [Python 套件] 索引標籤上，選取 [上傳 Conda 檔案]。
  3. 上傳您在第一個教學課程中上傳的 conda.yml 檔案。
  4. (選擇性) 增加工作階段逾時 (閑置時間)，以避免經常重新執行必要條件。

啟動 Spark 工作階段。

# run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")

設定範例的根目錄。

import os

# please update the dir to ./Users/<your_user_alias> (or any custom directory you uploaded the samples to).
# You can find the name from the directory structure in the left nav
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

設定 CLI。

Python SDK
Azure CLI

不適用。

安裝 Azure Machine Learning 延伸模組。
```
!az extension add --name ml
```
驗證。
```
!az login
```

設定預設訂用帳戶。

import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id

初始化專案工作區變數。

這是目前的工作區，而教學課程筆記本會在此資源中執行。

### Initialize the MLClient of this project workspace
import os
from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

# connect to the project workspace
ws_client = MLClient(
    AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name
)

初始化功能存放區變數。

請務必更新 featurestore_name 和 featurestore_location 值，以反映您在第一個教學課程中建立的內容。

from azure.ai.ml import MLClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

# feature store
featurestore_name = (
    "<FEATURESTORE_NAME>"  # use the same name from part #1 of the tutorial
)
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

# feature store ml client
fs_client = MLClient(
    AzureMLOnBehalfOfCredential(),
    featurestore_subscription_id,
    featurestore_resource_group_name,
    featurestore_name,
)

初始化功能存放區取用用戶端。

# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

在專案工作區中建立名為 cpu-cluster 的計算叢集。

在執行定型/批次推斷作業時，需要此計算叢集。

from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
    name="cpu-cluster-fs",
    type="amlcompute",
    size="STANDARD_F4S_V2",  # you can replace it with other supported VM SKUs
    location=ws_client.workspaces.get(ws_client.workspace_name).location,
    min_instances=0,
    max_instances=1,
    idle_time_before_scale_down=360,
)
ws_client.begin_create_or_update(cluster_basic).result()

在本機環境中建立帳戶功能集

在第一個教學課程中，您建立了具有自訂轉換的 transactions 功能集。在這裡，您會建立使用預先計算值的 accounts 功能集。

若要上線預先計算的功能，您可以建立功能集規格，而不需撰寫任何轉換程式碼。您會使用功能集規格，在完全本機的開發環境中開發和測試功能集。

您不需要連線到功能存放區。在此程序中，您會在本機建立功能集規格，然後從中對值進行取樣。針對受管理的功能存放區的能力，您必須使用功能資產定義向功能存放區註冊功能集規格。本教學課程稍後的步驟會提供更多詳細資料。

探索帳戶的來源資料。

注意

此筆記本會使用裝載在可公開存取的 Blob 容器中的範例資料。只有 wasbs 驅動程式可以在 Spark 中讀取該資料。當您使用自己的來源資料建立功能集時，請將這些功能集裝載在 Azure Data Lake Storage Gen2 帳戶中，並在資料路徑中使用 abfss 驅動程式。
```
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"
accounts_df = spark.read.parquet(accounts_data_path)

display(accounts_df.head(5))
```

從這些預先計算的功能，在本機建立 accounts 功能集規格。

在這裡，您不需要任何轉換程式碼，因為您會參考預先計算的功能。

from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)
from azureml.featurestore.feature_source import ParquetFeatureSource


accounts_featureset_spec = create_feature_set_spec(
    source=ParquetFeatureSource(
        path="wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    # account profiles in the source are updated once a year. set temporal_join_lookback to 365 days
    temporal_join_lookback=DateTimeOffset(days=365, hours=0, minutes=0),
    infer_schema=True,
)

匯出為功能集規格。

若要使用功能存放區註冊功能集規格，您必須以特定格式儲存該功能集規格。

在執行下一個資料格後，請檢查產生的 accounts 功能集規格。若要查看此規格，請從檔案樹狀目錄開啟「featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml」檔案。

此規格具有下列重要元素：
- source：儲存體資源的參考。在此案例中，其為 Blob 儲存體資源中的 Parquet 檔案。
- features：功能及其資料類型的清單。使用所提供的轉換程式碼時，該程式碼必須傳回對應至功能和資料類型的 DataFrame。未使用所提供的轉換程式碼時，系統則會建置查詢，以將功能和資料類型對應至來源。在此情況下，產生的 accounts 功能集規格不會包含轉換程式碼，因為功能已預先計算好。
- index_columns：要從功能集存取值所需的聯結索引鍵。
若要深入了解，請參閱了解受管理的功能存放區中的最上層實體和 CLI (v2) 功能集規格 YAML 結構描述。

保存的額外好處是支援原始檔控制。

在這裡，您不需要任何轉換程式碼，因為您會參考預先計算的功能。
```
import os

# create a new folder to dump the feature set spec
accounts_featureset_spec_folder = root_dir + "/featurestore/featuresets/accounts/spec"

# check if the folder exists, create one if not
if not os.path.exists(accounts_featureset_spec_folder):
    os.makedirs(accounts_featureset_spec_folder)

accounts_featureset_spec.dump(accounts_featureset_spec_folder, overwrite=True)
```

在本機實驗未註冊的功能，並在就緒時向功能存放區進行註冊

當您開發功能時，建議您先在本機測試並驗證功能，再向功能存放區註冊功能，或在雲端中執行定型管線。本機的未註冊功能集 (accounts) 和功能存放區中的已註冊功能集 (transactions) 的組合會產生機器學習模型的定型資料。

選取模型的功能。

# get the registered transactions feature set, version 1
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# Notice that account feature set spec is in your local dev environment (this notebook): not registered with feature store yet
features = [
    accounts_featureset_spec.get_feature("accountAge"),
    accounts_featureset_spec.get_feature("numPaymentRejects1dPerUser"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

在本機產生定型資料。

此步驟會產生用於說明用途的定型資料。在這裡，您可以選擇於本機定型模型。本教學課程稍後的步驟會說明如何在雲端中定型模型。

from azureml.featurestore import get_offline_features

# Load the observation data. To understand observatio ndata, refer to part 1 of this tutorial
observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call training_df.show() to see correctly formatted value

使用功能存放區註冊 accounts 功能集。

在本機實驗過功能定義，且看起來沒問題後，您可以使用功能存放區註冊功能集資產定義。

from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification

accounts_fset_config = FeatureSet(
    name="accounts",
    version="1",
    description="accounts featureset",
    entities=[f"azureml:account:1"],
    stage="Development",
    specification=FeatureSetSpecification(path=accounts_featureset_spec_folder),
    tags={"data_type": "nonPII"},
)

poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)
print(poller.result())

取得已註冊的功能集，並對其進行測試。

# look up the featureset by providing name and version
accounts_featureset = featurestore.feature_sets.get("accounts", "1")

執行定型實驗

在下列步驟中，您會選取功能清單、執行定型管線，以及註冊模型。您可以重複這些步驟，直到模型的表現符合您的需要。

(選擇性) 從功能存放區 UI 探索功能。

第一個教學課程涵蓋了此步驟，那時您註冊了 transactions 功能集。因為您也有 accounts 功能集，因此您可以瀏覽可用的功能：
1. 移至 Azure Machine Learning 全域登陸頁面。
2. 在左窗格上，選取 [功能存放區]。
3. 在功能存放區清單中，選取您稍早建立的功能存放區。
UI 會顯示您建立的功能集和實體。選取功能集以瀏覽功能定義。您可以使用全域搜尋方塊來跨功能存放區搜尋功能集。

(選擇性) 從 SDK 探索功能。

# List available feature sets
all_featuresets = featurestore.feature_sets.list()
for fs in all_featuresets:
    print(fs)

# List of versions for transactions feature set
all_transactions_featureset_versions = featurestore.feature_sets.list(
    name="transactions"
)
for fs in all_transactions_featureset_versions:
    print(fs)

# See properties of the transactions featureset including list of features
featurestore.feature_sets.get(name="transactions", version="1").features

選取模型的功能，並將模型匯出為功能擷取規格。

在先前的步驟中，您已從已註冊和未註冊的功能集組合中選取功能，以進行本機實驗和測試。您現在可以在雲端中進行實驗。如果您將選取的功能儲存為功能擷取規格，然後在機器學習作業 (MLOps) 或持續整合與持續傳遞 (CI/CD) 流程中使用規格來進行定型和推斷，則模型傳送靈活度會增加。
1. 選取模型的功能。
```
# you can select features in pythonic way
features = [
    accounts_featureset.get_feature("accountAge"),
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_3d_sum"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    f"accounts:1:numPaymentRejects1dPerUser",
    f"transactions:1:transaction_amount_7d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)

features.extend(more_features)
```
2. 將選取的功能匯出為功能擷取規格。
  
  功能擷取規格是與模型相關聯的功能清單可攜式定義。該規格有助於簡化機器學習模型的開發和操作。其會成為產生定型資料的定型管線輸入。然後，會與模型封裝在一起。
  
  推斷階段會使用功能擷取來查閱功能。其整合了機器學習生命週期的所有階段。當您進行實驗和部署時，可以對定型/推斷管線進行最小程度的變更就好。
  
  您可自行選擇是否使用功能擷取規格和內建的功能擷取元件。您可以直接使用 get_offline_features() API，如先前所示。當規格與模型封裝在一起時，規格的名稱應該是「feature_retrieval_spec.yaml」。如此一來，系統就可以辨識該規格。
```
# Create feature retrieval spec
feature_retrieval_spec_folder = root_dir + "/project/fraud_model/feature_retrieval_spec"

# check if the folder exists, create one if not
if not os.path.exists(feature_retrieval_spec_folder):
    os.makedirs(feature_retrieval_spec_folder)

featurestore.generate_feature_retrieval_spec(feature_retrieval_spec_folder, features)
```

使用管線在雲端中定型，並註冊模型

在此程序中，您會手動觸發定型管線。在生產案例中，CI/CD 管線可能會根據來源存放庫中的功能擷取規格變更來觸發定型管線。如果模型令人滿意，便可以註冊模型。

執行定型管線。

定型管線具有下列步驟：
1. 功能擷取：針對其輸入，此內建元件會採用功能擷取規格、觀察資料和時間戳記資料行名稱。然後，其會產生定型資料作為輸出。其會以受控 Spark 作業的形式執行這些步驟。
2. 定型：根據定型資料，此步驟會定型模型，然後產生模型 (尚未註冊)。
3. 評估：此步驟會驗證模型效能和品質是否落在閾值內。 (在本教學課程中，這是說明用的預留位置步驟。)
4. 註冊模型：此步驟會註冊模型。
  
  注意
  
  在第二個教學課程中，您執行了回填作業，以將 transactions 功能集的資料具體化。功能擷取步驟會從這個功能集的離線存放區中讀取功能值。行為會相同，即使您使用 get_offline_features() API 也一樣。
```
from azure.ai.ml import load_job  # will be used later

training_pipeline_path = (
    root_dir + "/project/fraud_model/pipelines/training_pipeline.yaml"
)
training_pipeline_definition = load_job(source=training_pipeline_path)
training_pipeline_job = ws_client.jobs.create_or_update(training_pipeline_definition)
ws_client.jobs.stream(training_pipeline_job.name)
# Note: First time it runs, each step in pipeline can take ~ 15 mins. However subsequent runs can be faster (assuming spark pool is warm - default timeout is 30 mins)
```
5. 檢查定型管線和模型。
  - 若要顯示管線步驟，請選取 [Web 檢視] 管線的超連結，然後在新的視窗中開啟。
在模型成品中使用功能擷取規格：
1. 在目前工作區的左窗格中，使用滑鼠右鍵選取 [模型]。
2. 選取 [在新的索引標籤或視窗中開啟]。
3. 選取 [fraud_model]。
4. 選取 [構件]。
功能擷取規格會與模型封裝在一起。定型管線中的模型註冊步驟處理了這個步驟。您在實驗期間建立了功能擷取規格。現在其已是模型定義的一部分。在下一個教學課程中，您會了解推斷如何使用該規格。

檢視功能集和模型相依性

檢視與模型相關聯的功能集清單。

在相同的 [模型] 頁面上，選取 [功能集] 索引標籤。此索引標籤會顯示此模型相依的 transactions 和 accounts 功能集。
檢視會使用功能集的模型清單：
1. 開啟功能存放區 UI (本教學課程稍早已說明)。
2. 在左窗格上，選取 [功能集]。
3. 選取功能集。
4. 選取 [模型] 索引標籤。
功能擷取規格在模型註冊時決定了此清單。

清理

本系列的第五個教學課程會說明如何刪除資源。

共用方式為