ワークスペース特徴量ストアで特徴量を操作する

[アーティクル]
03/01/2024

Note

このドキュメントでは、ワークスペース特徴量ストアについて説明します。 Databricks では、Unity Catalog の特徴エンジニアリングを使用することをお勧めします。ワークスペース特徴量ストアは、今後非推奨となる予定です。

このページでは、ワークスペース特徴ストアで特徴テーブルを作成して操作する方法について説明します。

Note

ワークスペースが Unity Catalog に対して有効になっている場合、主キーを持つ Unity Catalog によって管理されるテーブルは、モデルのトレーニングと推論に使用できる特徴テーブルとして自動的に作成されます。セキュリティ、系列、タグ付け、ワークスペース間アクセスなど、すべての Unity Catalog 機能が特徴テーブルで自動的に使用できます。 Unity Catalog 対応ワークスペースでの特徴テーブルの操作の詳細については、「Unity Catalog の特徴エンジニアリング」を参照してください。

特徴系列の追跡と鮮度の詳細については、「特徴の検出と特徴系列の追跡」を参照してください。

Note

データベース名と特徴テーブル名には英数字とアンダースコア (_) のみを使用できます。

特徴テーブルのデータベースを作成する

特徴テーブルを作成する前に、それらを格納するデータベースを作成する必要があります。

%sql CREATE DATABASE IF NOT EXISTS <database-name>

特徴テーブルは、Delta テーブルとして格納されます。 create_table (Feature Store クライアント v0.3.6 以降) または create_feature_table (v0.3.5 以前) を使用して特徴テーブルを作成する場合は、データベース名を指定する必要があります。たとえば、この引数は、customer_features という名前の Delta テーブルをデータベース recommender_system に作成します。

name='recommender_system.customer_features'

オンラインストアに特徴テーブルを公開する場合、既定のテーブルとデータベース名は、テーブルの作成時に指定された名前になります。publish_table メソッドを使用すると、異なる名前を指定できます。

Databricks Feature Store UI には、オンラインストア内のテーブルとデータベースの名前と、他のメタデータが表示されます。

Databricks Feature Store で特徴テーブルを作成する

注意

また、既存の Delta テーブルを特徴テーブルとして登録できるようになりました。既存の Delta テーブルを特徴テーブルとして登録することに関する記事を参照してください。

特徴テーブルを作成する基本的な手順は次のとおりです。

特徴をコンピューティングする Python 関数を記述します。各関数の出力は、一意の主キーを持つ Apache Spark DataFrame である必要があります。主キーは、1 つ以上の列で構成できます。
FeatureStoreClient をインスタンス化し、create_table (v0.3.6 以降) または create_feature_table (v0.3.5 以前) を使用して、特徴テーブルを作成します。
write_table を使用して特徴テーブルを設定します。

以下の例で使用されているコマンドとパラメーターの詳細については、「Feature Store Python API リファレンス」を参照してください。

V0.3.6 以降

from databricks.feature_store import feature_table

def compute_customer_features(data):
  ''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''
  pass

# create feature table keyed by customer_id
# take schema from DataFrame output by compute_customer_features
from databricks.feature_store import FeatureStoreClient

customer_features_df = compute_customer_features(df)

fs = FeatureStoreClient()

customer_feature_table = fs.create_table(
  name='recommender_system.customer_features',
  primary_keys='customer_id',
  schema=customer_features_df.schema,
  description='Customer features'
)

# An alternative is to use `create_table` and specify the `df` argument.
# This code automatically saves the features to the underlying Delta table.

# customer_feature_table = fs.create_table(
#  ...
#  df=customer_features_df,
#  ...
# )

# To use a composite key, pass all keys in the create_table call

# customer_feature_table = fs.create_table(
#   ...
#   primary_keys=['customer_id', 'date'],
#   ...
# )

# Use write_table to write data to the feature table
# Overwrite mode does a full refresh of the feature table

fs.write_table(
  name='recommender_system.customer_features',
  df = customer_features_df,
  mode = 'overwrite'
)

V0.3.5 以前

from databricks.feature_store import feature_table

def compute_customer_features(data):
  ''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''
  pass

# create feature table keyed by customer_id
# take schema from DataFrame output by compute_customer_features
from databricks.feature_store import FeatureStoreClient

customer_features_df = compute_customer_features(df)

fs = FeatureStoreClient()

customer_feature_table = fs.create_feature_table(
  name='recommender_system.customer_features',
  keys='customer_id',
  schema=customer_features_df.schema,
  description='Customer features'
)

# An alternative is to use `create_feature_table` and specify the `features_df` argument.
# This code automatically saves the features to the underlying Delta table.

# customer_feature_table = fs.create_feature_table(
#  ...
#  features_df=customer_features_df,
#  ...
# )

# To use a composite key, pass all keys in the create_feature_table call

# customer_feature_table = fs.create_feature_table(
#   ...
#   keys=['customer_id', 'date'],
#   ...
# )

# Use write_table to write data to the feature table
# Overwrite mode does a full refresh of the feature table

fs.write_table(
  name='recommender_system.customer_features',
  df = customer_features_df,
  mode = 'overwrite'
)from databricks.feature_store import feature_table

def compute_customer_features(data):
  ''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''
  pass

# create feature table keyed by customer_id
# take schema from DataFrame output by compute_customer_features
from databricks.feature_store import FeatureStoreClient

customer_features_df = compute_customer_features(df)

fs = FeatureStoreClient()

customer_feature_table = fs.create_feature_table(
  name='recommender_system.customer_features',
  keys='customer_id',
  schema=customer_features_df.schema,
  description='Customer features'
)

# An alternative is to use `create_feature_table` and specify the `features_df` argument.
# This code automatically saves the features to the underlying Delta table.

# customer_feature_table = fs.create_feature_table(
#  ...
#  features_df=customer_features_df,
#  ...
# )

# To use a composite key, pass all keys in the create_feature_table call

# customer_feature_table = fs.create_feature_table(
#   ...
#   keys=['customer_id', 'date'],
#   ...
# )

# Use write_table to write data to the feature table
# Overwrite mode does a full refresh of the feature table

fs.write_table(
  name='recommender_system.customer_features',
  df = customer_features_df,
  mode = 'overwrite'
)

既存の Delta テーブルを特徴テーブルとして登録する

v0.3.8 以降では、既存の Delta テーブルを特徴テーブルとして登録できます。 Delta テーブルはメタストアに存在している必要があります。

注意

登録されている特徴テーブルを更新するには、Feature Store Python API を使用する必要があります。

fs.register_table(
  delta_table='recommender.customer_features',
  primary_keys='customer_id',
  description='Customer features'
)

特徴テーブルへのアクセスの制御

「特徴テーブルへのアクセスの制御」を参照してください。

特徴テーブルを更新する

特徴テーブルを更新するには、新しい特徴を追加するか、主キーに基づいて特定の行を変更します。

次の特徴テーブルのメタデータは更新できません。

Primary key (プライマリキー)
パーティションキー
既存の特徴の名前または種類

既存の特徴テーブルに新しい機能を追加する

既存の特徴テーブルに新しい特徴を追加するには、次の 2 つの方法があります。

既存の特徴評価関数を更新し、返された DataFrame を使用して write_table を実行します。これにより、特徴テーブルスキーマが更新され、主キーに基づいて新しい特徴値がマージされます。
新しい特徴の値を計算する新しい特徴評価関数を作成します。この新しい評価関数によって返される DataFrame には、特徴テーブルの主キーとパーティションキー (定義されている場合) が含まれている必要があります。 DataFrame を使用して write_table を実行し、同じ主キーを使用して既存の特徴テーブルに新しい特徴を書き込みます。

特徴テーブル内の特定の行のみを更新する

write_table で mode = "merge" を使用します。 write_table 呼び出しで送信された DataFrame に主キーが存在しない行は変更されません。

fs.write_table(
  name='recommender.customer_features',
  df = customer_features_df,
  mode = 'merge'
)

特徴テーブルを更新するジョブをスケジュールする

特徴テーブルの特徴に常に最新の値が含まれるようにするため、Databricks では、毎日のように定期的に特徴テーブルを更新するノートブックを実行するジョブを作成することをお勧めします。スケジュールされていないジョブが既に作成されている場合は、スケジュールされたジョブに変換して、特徴の値が常に最新になるようにします。

特徴テーブルを更新するコードでは、次の例に示すように mode='merge' を使用します。

fs = FeatureStoreClient()

customer_features_df = compute_customer_features(data)

fs.write_table(
  df=customer_features_df,
  name='recommender_system.customer_features',
  mode='merge'
)

日次特徴の過去の値を格納する

複合主キーを使用して特徴テーブルを定義します。主キーに日付を含めます。たとえば、特徴テーブル store_purchases の場合、効率的な読み取りのために複合主キー (date、user_id) とパーティションキー date を使用できます。

fs.create_table(
  name='recommender_system.customer_features',
  primary_keys=['date', 'customer_id'],
  partition_columns=['date'],
  schema=customer_features_df.schema,
  description='Customer features'
)

その後、特徴テーブルの date のフィルター処理から関心のある期間まで読み取るコードを作成できます。

timestamp_keys 引数を利用し、タイムスタンプキーとして date 列を指定する方法で時系列特徴テーブルを作成することもできます。

fs.create_table(
  name='recommender_system.customer_features',
  primary_keys=['date', 'customer_id'],
  timestamp_keys=['date'],
  schema=customer_features_df.schema,
  description='Customer timeseries features'
)

これにより、create_training_set または score_batch を使うときのポイントインタイム検索が可能になります。システムは、ユーザーが指定した timestamp_lookup_key を使って、タイムスタンプの時点での結合を実行します。

特徴テーブルを最新の状態に保つには、定期的にスケジュールされたジョブを設定して特徴を書き込むか、新しい特徴の値を特徴テーブルにストリーム配信します。

ストリーミング機能評価パイプラインを作成して特徴を更新する

ストリーミング機能評価パイプラインを作成するには、ストリーミング DataFrame を引数として write_table に渡します。このメソッドから StreamingQuery オブジェクトが返されます。

def compute_additional_customer_features(data):
  ''' Returns Streaming DataFrame
  '''
  pass  # not shown

customer_transactions = spark.readStream.load("dbfs:/events/customer_transactions")
stream_df = compute_additional_customer_features(customer_transactions)

fs.write_table(
  df=stream_df,
  name='recommender_system.customer_features',
  mode='merge'
)

特徴テーブルから読み取る

特徴の値を読み取る場合は、read_table を使用します。

fs = feature_store.FeatureStoreClient()
customer_features_df = fs.read_table(
  name='recommender.customer_features',
)

特徴テーブルを検索および参照する

Feature Store UI を使用して、特徴テーブルを検索または参照します。

サイドバーで [Machine Learning > Feature Store] を選択して、Feature Store UI を表示します。
検索ボックスには、特徴テーブル、特徴、または特徴計算に使用されるデータソースの名前のすべてまたは一部を入力できます。タグのキーまたは値のすべてまたは一部を入力することもできます。検索テキストでは大文字と小文字が区別されません。

特徴テーブルのメタデータを取得する

特徴テーブルのメタデータを取得する API は、使用している Databricks ランタイムのバージョンによって異なります。 v0.3.6 以降では、get_table を使用します。 v0.3.5 以前では、get_feature_table を使用します。

# this example works with v0.3.6 and above
# for v0.3.5, use `get_feature_table`
from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.get_table("feature_store_example.user_feature_table")

特徴テーブルを操作する

タグはキーと値のペアであり、作成して特徴テーブルを検索するために使用できます。 Feature Store UI または Feature Store Python API を使用して、タグを作成、編集、削除できます。

UI で特徴テーブルを操作する

Feature Store UI を使用して、特徴テーブルを検索または参照します。 UI にアクセスするには、サイドバーで [Machine Learning > Feature Store] を選択します。

Feature Store UI を使用してタグを追加する

をクリックします (開いていない場合)。タグテーブルが表示されます。
[名前] フィールドと [値] フィールドをクリックし、タグのキーと値を入力します。
[追加] をクリックします。

Feature Store UI を使用してタグを編集または削除する

既存のタグを編集または削除するには、[アクション] 列のアイコンを使用します。

tag actions

Feature Store Python API を使用して特徴テーブルタグを操作する

v0.4.1 以降を実行しているクラスターでは、Feature Store Python API を使用してタグを作成、編集、削除できます。

必要条件

Feature Store クライアント v0.4.1 以降

Feature Store Python API を使用してタグ付きの特徴テーブルを作成する

from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()

customer_feature_table = fs.create_table(
  ...
  tags={"tag_key_1": "tag_value_1", "tag_key_2": "tag_value_2", ...},
  ...
)

Feature Store Python API を使用してタグを追加、更新、削除する

from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()

# Upsert a tag
fs.set_feature_table_tag(table_name="my_table", key="quality", value="gold")

# Delete a tag
fs.delete_feature_table_tag(table_name="my_table", key="quality")

特徴テーブルのデータソースを更新する

特徴ストアでは、特徴の計算に使用されるデータソースが自動的に追跡されます。 Feature Store Python API を利用し、データソースを手動更新することもできます。

必要条件

Feature Store クライアント v0.5.0 以降

Feature Store Python API を使用してデータソースを追加する

以下にコマンド例を示します。詳しくは、API ドキュメントをご覧ください。

from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()

# Use `source_type="table"` to add a table in the metastore as data source.
fs.add_data_sources(feature_table_name="clicks", data_sources="user_info.clicks", source_type="table")

# Use `source_type="path"` to add a data source in path format.
fs.add_data_sources(feature_table_name="user_metrics", data_sources="dbfs:/FileStore/user_metrics.json", source_type="path")

# Use `source_type="custom"` if the source is not a table or a path.
fs.add_data_sources(feature_table_name="user_metrics", data_sources="user_metrics.txt", source_type="custom")

Feature Store Python API を使用してデータソースを削除する

詳しくは、API ドキュメントをご覧ください。

注意

次のコマンドでは、ソース名と一致するすべての型 ("table"、"path"、"custom") のデータソースが削除されます。

from databricks.feature_store import FeatureStoreClient
fs = FeatureStoreClient()
fs.delete_data_sources(feature_table_name="clicks", sources_names="user_info.clicks")

特徴テーブルを削除する

Feature Store UI または Feature Store Python API を使用して、特徴テーブルを削除できます。

注意

特徴テーブルを削除すると、アップストリームプロデューサーとダウンストリームコンシューマー (モデル、エンドポイント、スケジュールされたジョブ) で予期しないエラーが発生する可能性があります。公開済みのオンラインストアは、クラウドプロバイダーで削除する必要があります。
API を使用して特徴テーブルを削除すると、基になる Delta テーブルも削除されます。 UI から特徴テーブルを削除する場合は、基になる Delta テーブルを個別に削除する必要があります。

UI を使用して特徴テーブルを削除する

特徴テーブルページで、特徴テーブル名の右側にあるをクリックし、[削除]を選択します。特徴テーブルに対する管理可能アクセス許可が付与されていない場合は、このオプションは表示されません。
[特徴テーブルの削除] ダイアログで、[削除] をクリックして確定します。
基になる差分テーブルもドロップする場合は、ノートブックで次のコマンドを実行します。
```
%sql DROP TABLE IF EXISTS <feature-table-name>;
```

Feature Store Python API を使用して特徴テーブルを削除する

Feature Store クライアント v0.4.1 以降では、drop_table を使用して特徴テーブルを削除できます。 drop_table が含まれるテーブルを削除すると、基になる Delta テーブルも削除されます。

fs.drop_table(
  name='recommender_system.customer_features'
)

ワークスペース特徴量ストアで特徴量を操作する

特徴テーブルのデータベースを作成する

Databricks Feature Store で特徴テーブルを作成する

V0.3.6 以降

V0.3.5 以前

既存の Delta テーブルを特徴テーブルとして登録する

特徴テーブルへのアクセスの制御

特徴テーブルを更新する

既存の特徴テーブルに新しい機能を追加する

特徴テーブル内の特定の行のみを更新する

特徴テーブルを更新するジョブをスケジュールする

日次特徴の過去の値を格納する

ストリーミング機能評価パイプラインを作成して特徴を更新する

特徴テーブルから読み取る

特徴テーブルを検索および参照する

特徴テーブルのメタデータを取得する

特徴テーブルを操作する

UI で特徴テーブルを操作する

Feature Store UI を使用してタグを追加する

Feature Store UI を使用してタグを編集または削除する

Feature Store Python API を使用して特徴テーブル タグを操作する

必要条件

Feature Store Python API を使用してタグ付きの特徴テーブルを作成する

Feature Store Python API を使用してタグを追加、更新、削除する

特徴テーブルのデータ ソースを更新する

必要条件

Feature Store Python API を使用してデータ ソースを追加する

Feature Store Python API を使用してデータ ソースを削除する

特徴テーブルを削除する

UI を使用して特徴テーブルを削除する

Feature Store Python API を使用して特徴テーブルを削除する

その他のリソース

Feature Store Python API を使用して特徴テーブルタグを操作する

特徴テーブルのデータソースを更新する

Feature Store Python API を使用してデータソースを追加する

Feature Store Python API を使用してデータソースを削除する