Isolation Forest を使用した多変量異常検出

2024-01-18

この記事では、Apache Spark で SynapseML を使用して多変量異常検出を実行する方法を示します。多変量異常検出では、さまざまな変数間のすべての相互相関と依存関係を考慮して、多くの変数または時系列間の異常を検出できます。このシナリオでは、SynapseML を使用して多変量異常検出用の Isolation Forest モデルをトレーニングし、その後、トレーニング済みモデルを使用して、3 つの IoT センサーからの合成測定値を含むデータセット内の多変量異常を推論します。

Isolation Forest モデルの詳細については、Liu らによる元の論文を参照してください。

前提条件

ノートブックをレイクハウスにアタッチします。左側の [追加] を選択して、既存のレイクハウスを追加するか、レイクハウスを作成します。

ライブラリのインポート

from IPython import get_ipython
from IPython.terminal.interactiveshell import TerminalInteractiveShell
import uuid
import mlflow

from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.types import *
from pyspark.ml import Pipeline

from synapse.ml.isolationforest import *

from synapse.ml.explainers import *

%matplotlib inline

from pyspark.sql import SparkSession

# Bootstrap Spark Session
spark = SparkSession.builder.getOrCreate()

from synapse.ml.core.platform import *

if running_on_synapse():
    shell = TerminalInteractiveShell.instance()
    shell.define_macro("foo", """a,b=10,20""")

入力データ

# Table inputs
timestampColumn = "timestamp"  # str: the name of the timestamp column in the table
inputCols = [
    "sensor_1",
    "sensor_2",
    "sensor_3",
]  # list(str): the names of the input variables

# Training Start time, and number of days to use for training:
trainingStartTime = (
    "2022-02-24T06:00:00Z"  # datetime: datetime for when to start the training
)
trainingEndTime = (
    "2022-03-08T23:55:00Z"  # datetime: datetime for when to end the training
)
inferenceStartTime = (
    "2022-03-09T09:30:00Z"  # datetime: datetime for when to start the training
)
inferenceEndTime = (
    "2022-03-20T23:55:00Z"  # datetime: datetime for when to end the training
)

# Isolation Forest parameters
contamination = 0.021
num_estimators = 100
max_samples = 256
max_features = 1.0

データの読み取り

df = (
    spark.read.format("csv")
    .option("header", "true")
    .load(
        "wasbs://publicwasb@mmlspark.blob.core.windows.net/generated_sample_mvad_data.csv"
    )
)

列を適切なデータ型にキャストする

df = (
    df.orderBy(timestampColumn)
    .withColumn("timestamp", F.date_format(timestampColumn, "yyyy-MM-dd'T'HH:mm:ss'Z'"))
    .withColumn("sensor_1", F.col("sensor_1").cast(DoubleType()))
    .withColumn("sensor_2", F.col("sensor_2").cast(DoubleType()))
    .withColumn("sensor_3", F.col("sensor_3").cast(DoubleType()))
    .drop("_c5")
)

display(df)

トレーニングデータの準備

# filter to data with timestamps within the training window
df_train = df.filter(
    (F.col(timestampColumn) >= trainingStartTime)
    & (F.col(timestampColumn) <= trainingEndTime)
)
display(df_train)

テストデータの準備

# filter to data with timestamps within the inference window
df_test = df.filter(
    (F.col(timestampColumn) >= inferenceStartTime)
    & (F.col(timestampColumn) <= inferenceEndTime)
)
display(df_test)

Isolation Forest モデルをトレーニングする

isolationForest = (
    IsolationForest()
    .setNumEstimators(num_estimators)
    .setBootstrap(False)
    .setMaxSamples(max_samples)
    .setMaxFeatures(max_features)
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(contamination)
    .setContaminationError(0.01 * contamination)
    .setRandomSeed(1)
)

次に、Isolation Forest モデルをトレーニングする ML パイプラインを作成します。また、MLflow 実験を作成し、トレーニング済みモデルを登録する方法も示します。

MLflow モデルの登録は厳密には、トレーニング済みモデルに後でアクセスする場合にのみ必要です。モデルをトレーニングし、同じノートブックで推論を実行する場合は、モデルオブジェクトモデルで十分です。

va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)

推論を実行する

トレーニング済みの Isolation Forest モデルを読み込む

推論を実行する

df_test_pred = model.transform(df_test)
display(df_test_pred)

事前作成された Anomaly Detector

Azure AI Anomaly Detector

最新のポイントの異常状態: 前のポイントを使用してモデルを生成し、最新のポイントが異常であるかどうかを判断します (Scala、Python)
異常の検出: 系列全体を使用してモデルを生成し、系列内の異常を見つけます (Scala、Python)

次の方法で共有

Isolation Forest を使用した多変量異常検出

前提条件

ライブラリのインポート

入力データ

データの読み取り

トレーニング データの準備

テスト データの準備

Isolation Forest モデルをトレーニングする

推論を実行する

事前作成された Anomaly Detector

関連するコンテンツ

フィードバック

その他のリソース

トレーニングデータの準備

テストデータの準備