教學課程：建立、評估文字分類模型及評分

發行項
01/27/2025

本教學課程會在 Microsoft Fabric 中，針對文字分類模型呈現 Synapse 資料科學工作流程的端對端範例。該案例會使用 Spark 上的 word2vec 和羅吉斯迴歸，僅根據書籍的標題來判斷大英圖書館書籍資料集中的書籍類型。

本教學課程涵蓋了下列步驟：

安裝自訂程式庫
載入資料
使用探索式資料分析來了解和處理資料
使用 word2vec 和羅吉斯迴歸來訓練機器學習模型，並使用 MLflow 和 Fabric 自動記錄功能來追蹤實驗
載入機器學習模型以進行評分和預測

必要條件

取得 Microsoft Fabric 訂用帳戶。或註冊免費的 Microsoft Fabric 試用版。
登入 Microsoft Fabric。
使用首頁左下方的體驗切換器，切換至 Fabric。

如果您沒有 Microsoft Fabric lakehouse，請遵循在 Microsoft Fabric 中建立 lakehouse 的步驟建立一個。

遵循筆記本中的指示

您可以選擇下列選項之一，以遵循筆記本中的指示操作：

開啟並執行內建筆記本。
從 GitHub 上傳您的筆記本。

開啟內建筆記本

本教學課程隨附範例標題內容類型分類筆記本。

若要開啟本次教學課程的範例筆記本，請遵循準備系統以進行資料科學教學的指示。
開始執行程序代碼之前，請務必將 lakehouse 附加至筆記本。

從 GitHub 匯入筆記本

AIsample - 標題內容類型 Classification.ipynb 是本教學課程隨附的筆記本。

若要開啟本教學課程隨附的筆記本，請遵循準備系統以進行數據科學教學課程中的指示，將筆記本匯入工作區。
如果您想要複製並貼上此頁面中的程式碼，則可以建立新的筆記本。
開始執行程式碼之前，請務必將 Lakehouse 連結至筆記本。

步驟 1：安裝自訂程式庫

針對機器學習模型開發或臨機操作資料分析，您可能需要快速安裝 Apache Spark 工作階段的自訂程式庫。安裝程式庫有兩個選項。

使用筆記本的內嵌安裝功能 (%pip 或 %conda)，僅在您目前的筆記本中安裝程式庫。
或者，您可以建立 Fabric 環境，從公用來源安裝程式庫，或將自訂程式庫上傳至該環境，然後您的工作區管理員可將環境連結為工作區的預設值。環境中的所有程式庫隨後可供在工作區中的任何筆記本和 Spark 工作定義使用。如需有關環境的詳細資訊，請參閱在 Microsoft Fabric 中建立、設定和使用環境。

針對分類模型，使用 wordcloud 程式庫來代表文字中的字詞頻率，其中字詞大小代表其頻率。在本教學課程中，使用 %pip install 在您的筆記本中安裝 wordcloud。

注意

執行 %pip install 之後，PySpark 核心會重新啟動。在執行任何其他資料格之前，請先安裝所需的程式庫。

# Install wordcloud for text visualization by using pip
%pip install wordcloud

步驟 2：載入資料

該資料集具有英國圖書館書籍的相關中繼資料，該圖書館與 Microsoft 數位化共同作業。中繼資料是分類資訊，指示書籍是小說還是非小說。使用此資料集時，目標是訓練分類模型，該模型只會根據其標題來判斷書籍的內容類型。

BL 記錄 ID	資源的類型	名稱	與名稱關聯的資料	名稱類型	角色	所有名稱	標題	變化標題	系列標題	系列中的編號	出版物國家/地區	出版物的類型	發行者	出版物的日期	版本(Edition)	實體描述	Dewey 分類	BL 貨架標記	主題	Genre	語言	備註	實體資源的 BL 記錄 ID	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	連結至數位化書籍	已標註
014602826	專著	Yearsley、Ann	1753-1806	person		More, Hannah, 1745-1833 [person]; Yearsley, Ann, 1753-1806 [person]	Poems on several occasions [With a prefatory letter by Hannah More.]				英格蘭	London		1786	第四版手稿筆記			Digital Store 11644.d.32			英語		003996603																						False
014602830	專著	A, T.		person		Oldham, John, 1653-1683 [person]; A, T. [person]	A Satyr against Vertue. (A poem: supposed to be spoken by a Town-Hector [By John Oldham. The preface signed: T. A.])				英格蘭	London		1679		15 pages (4°)		Digital Store 11602.ee.10. (2.)			英語		000001143																						False

定義下列參數，以便您在不同的資料集上套用此筆記本：

IS_CUSTOM_DATA = False  # If True, the user must manually upload the dataset
DATA_FOLDER = "Files/title-genre-classification"
DATA_FILE = "blbooksgenre.csv"

# Data schema
TEXT_COL = "Title"
LABEL_COL = "annotator_genre"
LABELS = ["Fiction", "Non-fiction"]

EXPERIMENT_NAME = "sample-aisample-textclassification"  # MLflow experiment name

下載資料集並上傳至 Lakehouse

此程式碼會下載公開可用的資料集版本，然後將其儲存在 Fabric Lakehouse 中。

重要

在執行筆記本之前，新增 Lakehouse 至筆記本。無法執行這項操作時，將會發生錯誤。

if not IS_CUSTOM_DATA:
    # Download demo data files into the lakehouse, if they don't exist
    import os, requests

    remote_url = "https://synapseaisolutionsa.blob.core.windows.net/public/Title_Genre_Classification"
    fname = "blbooksgenre.csv"
    download_path = f"/lakehouse/default/{DATA_FOLDER}/raw"

    if not os.path.exists("/lakehouse/default"):
        # Add a lakehouse, if no default lakehouse was added to the notebook
        # A new notebook won't link to any lakehouse by default
        raise FileNotFoundError(
            "Default lakehouse not found, please add a lakehouse and restart the session."
        )
    os.makedirs(download_path, exist_ok=True)
    if not os.path.exists(f"{download_path}/{fname}"):
        r = requests.get(f"{remote_url}/{fname}", timeout=30)
        with open(f"{download_path}/{fname}", "wb") as f:
            f.write(r.content)
    print("Downloaded demo data files into lakehouse.")

匯入必要的程式庫

在進行任何處理之前，您需要匯入必要的程式庫，包括 Spark 和 SynapseML 的程式庫：

import numpy as np
from itertools import chain

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator,
)

from synapse.ml.stages import ClassBalancer
from synapse.ml.train import ComputeModelStatistics

import mlflow

調整超參數

微調某些用於模型訓練的超參數。

重要

只有在您了解每個參數後，才可修改這些超參數。

# Hyperparameters 
word2vec_size = 128  # The length of the vector for each word
min_word_count = 3  # The minimum number of times that a word must appear to be considered
max_iter = 10  # The maximum number of training iterations
k_folds = 3  # The number of folds for cross-validation

開始記錄執行此筆記本所需的時間：

# Record the notebook running time
import time

ts = time.time()

設定 MLflow 實驗追蹤

自動記錄可擴充 MLflow 記錄功能。自動記錄會在訓練時，自動擷取機器學習模型的輸入參數值和輸出計量。接著，您會將此資訊記錄到工作區。在工作區中，您可以使用工作區中的 MLflow API 或對應的實驗來存取和視覺化資訊。若要深入了解自動記錄，請參閱 Microsoft Fabric 中的自動記錄。

# Set up Mlflow for experiment tracking

mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True)  # Disable Mlflow autologging

若要停用筆記本工作階段的 Microsoft Fabric 自動記錄，請呼叫 mlflow.autolog() 並設定 disable=True︰

從 Lakehouse 讀取原始日期資料

raw_df = spark.read.csv(f"{DATA_FOLDER}/raw/{DATA_FILE}", header=True, inferSchema=True)

步驟 3：執行探索式資料分析

使用 display 命令來探索資料集，以檢視資料集的高層級統計資料，以及顯示圖表檢視：

display(raw_df.limit(20))

準備資料

移除重複項目以清除資料：

df = (
    raw_df.select([TEXT_COL, LABEL_COL])
    .where(F.col(LABEL_COL).isin(LABELS))
    .dropDuplicates([TEXT_COL])
    .cache()
)

display(df.limit(20))

套用類別平衡以解決任何偏差：

# Create a ClassBalancer instance, and set the input column to LABEL_COL
cb = ClassBalancer().setInputCol(LABEL_COL)

# Fit the ClassBalancer instance to the input DataFrame, and transform the DataFrame
df = cb.fit(df).transform(df)

# Display the first 20 rows of the transformed DataFrame
display(df.limit(20))

將段落和句子分割成較小的單位，以權杖化資料集。如此一來，指派意義會變得更容易。然後，移除停用字詞以改善效能。停用字詞移除涉及移除通常發生在語料庫所有文件中的字組。停用字詞移除是自然語言處理 (NLP) 應用程式中最常使用的前置處理步驟之一。

# Text transformer
tokenizer = Tokenizer(inputCol=TEXT_COL, outputCol="tokens")
stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")

# Build the pipeline
pipeline = Pipeline(stages=[tokenizer, stopwords_remover])

token_df = pipeline.fit(df).transform(df)

display(token_df.limit(20))

顯示每個類別的 wordcloud 程式庫。 wordcloud 程式庫以視覺化顯著呈現文字資料中經常出現關鍵字。 wordcloud 程式庫很有效，因為關鍵字的轉譯會形成類似雲端的色彩圖片，以便更清楚地擷取主文字資料。深入了解 wordcloud。

# WordCloud
for label in LABELS:
    tokens = (
        token_df.where(F.col(LABEL_COL) == label)
        .select(F.explode("filtered_tokens").alias("token"))
        .where(F.col("token").rlike(r"^\w+$"))
    )

    top50_tokens = (
        tokens.groupBy("token").count().orderBy(F.desc("count")).limit(50).collect()
    )

    # Generate a wordcloud image
    wordcloud = WordCloud(
        scale=10,
        background_color="white",
        random_state=42,  # Make sure the output is always the same for the same input
    ).generate_from_frequencies(dict(top50_tokens))

    # Display the generated image by using matplotlib
    plt.figure(figsize=(10, 10))
    plt.title(label, fontsize=20)
    plt.axis("off")
    plt.imshow(wordcloud, interpolation="bilinear")

最後，使用 word2vec 向量化文字。 word2vec 技術會建立文字中每個字詞的向量呈現。在類似內容或具有語意關聯性的文字中，透過向量空間中的接近度來有效地擷取。此接近度表示類似的字詞具有類似的字詞向量。

# Label transformer
label_indexer = StringIndexer(inputCol=LABEL_COL, outputCol="labelIdx")
vectorizer = Word2Vec(
    vectorSize=word2vec_size,
    minCount=min_word_count,
    inputCol="filtered_tokens",
    outputCol="features",
)

# Build the pipeline
pipeline = Pipeline(stages=[label_indexer, vectorizer])
vec_df = (
    pipeline.fit(token_df)
    .transform(token_df)
    .select([TEXT_COL, LABEL_COL, "features", "labelIdx", "weight"])
)

display(vec_df.limit(20))

步驟 4︰訓練及評估模型

設置資料後，定義模型。在本節中，您將訓練羅吉斯迴歸模型，以對向量化文字分類。

準備訓練與測試的資料集

# Split the dataset into training and testing
(train_df, test_df) = vec_df.randomSplit((0.8, 0.2), seed=42)

追蹤機器學習實驗

機器學習實驗是所有相關機器學習執行的組織和控制主要單位。執行會對應至模型程式碼的單一執行。

機器學習實驗追蹤會管理所有實驗及其元件，例如參數、計量、模型和其他成品。追蹤可組織特定機器學習實驗的所有必要元件。它還可讓您使用儲存的實驗，輕鬆重現過去的結果。深入了解 Microsoft Fabric 中的機器學習實驗。

# Build the logistic regression classifier
lr = (
    LogisticRegression()
    .setMaxIter(max_iter)
    .setFeaturesCol("features")
    .setLabelCol("labelIdx")
    .setWeightCol("weight")
)

調整超參數

建置參數方格以搜尋超參數。然後建置交叉評估工具估算器，以產生 CrossValidator 模型：

# Build a grid search to select the best values for the training parameters
param_grid = (
    ParamGridBuilder()
    .addGrid(lr.regParam, [0.03, 0.1])
    .addGrid(lr.elasticNetParam, [0.0, 0.1])
    .build()
)

if len(LABELS) > 2:
    evaluator_cls = MulticlassClassificationEvaluator
    evaluator_metrics = ["f1", "accuracy"]
else:
    evaluator_cls = BinaryClassificationEvaluator
    evaluator_metrics = ["areaUnderROC", "areaUnderPR"]
evaluator = evaluator_cls(labelCol="labelIdx", weightCol="weight")

# Build a cross-evaluator estimator
crossval = CrossValidator(
    estimator=lr,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=k_folds,
    collectSubModels=True,
)

評估模型

我們可以評估測試資料集上的模型，以對其進行比較。針對驗證和測試資料集執行時，訓練良好的模型應根據相關計量示範高效能。

def evaluate(model, df):
    log_metric = {}
    prediction = model.transform(df)
    for metric in evaluator_metrics:
        value = evaluator.evaluate(prediction, {evaluator.metricName: metric})
        log_metric[metric] = value
        print(f"{metric}: {value:.4f}")
    return prediction, log_metric

使用 MLflow 追蹤實驗

啟動訓練和評估程序。使用 MLflow 來追蹤所有實驗，以及記錄參數、計量和模型。所有此資訊會記錄在工作區中的實驗名稱下。

with mlflow.start_run(run_name="lr"):
    models = crossval.fit(train_df)
    best_metrics = {k: 0 for k in evaluator_metrics}
    best_index = 0
    for idx, model in enumerate(models.subModels[0]):
        with mlflow.start_run(nested=True, run_name=f"lr_{idx}") as run:
            print("\nEvaluating on test data:")
            print(f"subModel No. {idx + 1}")
            prediction, log_metric = evaluate(model, test_df)

            if log_metric[evaluator_metrics[0]] > best_metrics[evaluator_metrics[0]]:
                best_metrics = log_metric
                best_index = idx

            print("log model")
            mlflow.spark.log_model(
                model,
                f"{EXPERIMENT_NAME}-lrmodel",
                registered_model_name=f"{EXPERIMENT_NAME}-lrmodel",
                dfs_tmpdir="Files/spark",
            )

            print("log metrics")
            mlflow.log_metrics(log_metric)

            print("log parameters")
            mlflow.log_params(
                {
                    "word2vec_size": word2vec_size,
                    "min_word_count": min_word_count,
                    "max_iter": max_iter,
                    "k_folds": k_folds,
                    "DATA_FILE": DATA_FILE,
                }
            )

    # Log the best model and its relevant metrics and parameters to the parent run
    mlflow.spark.log_model(
        models.subModels[0][best_index],
        f"{EXPERIMENT_NAME}-lrmodel",
        registered_model_name=f"{EXPERIMENT_NAME}-lrmodel",
        dfs_tmpdir="Files/spark",
    )
    mlflow.log_metrics(best_metrics)
    mlflow.log_params(
        {
            "word2vec_size": word2vec_size,
            "min_word_count": min_word_count,
            "max_iter": max_iter,
            "k_folds": k_folds,
            "DATA_FILE": DATA_FILE,
        }
    )

要檢視您的實驗：

在左側導覽中，選取您的工作區
尋找並選取實驗名稱 - 在此案例中為 sample_aisample-textclassification

步驟 5：評分並儲存預測結果

Microsoft Fabric 可讓使用者使用 PREDICT 可調整函數，來操作機器學習模型。此函數支援在任何計算引擎進行批次評分 (或批次推斷)。您可以直接從筆記本或特定模型的項目頁面建立批次預測。若要深入了解 PREDICT，以及如何在 Fabric 中使用，請參閱使用 Microsoft Fabric 中的 PREDICT 對機器學習模型評分。

在上述評估結果中，模型 1 具有 Area Under the Precision-Recall Curve (AUPRC) 和 Area Under the Curve Receiver Operating Characteristic (AUC-ROC) 的最大計量。因此，您應使用模型 1 進行預測。

AUC-ROC 量值通常用於度量二進位分類器效能。不過，根據 AUPRC 度量來評估分類器有時會更合適。 AUC-ROC 圖表會視覺化確判率 (TPR) 與誤判率 (FPR) 之間的取捨。 AUPRC 曲線會在單一視覺效果中結合精確度 (預測確判率或 PPV) 和重新叫用 (誤判率或 TPR)。

# Load the best model
model_uri = f"models:/{EXPERIMENT_NAME}-lrmodel/1"
loaded_model = mlflow.spark.load_model(model_uri, dfs_tmpdir="Files/spark")

# Verify the loaded model
batch_predictions = loaded_model.transform(test_df)
batch_predictions.show(5)

# Code to save userRecs in the lakehouse
batch_predictions.write.format("delta").mode("overwrite").save(
    f"{DATA_FOLDER}/predictions/batch_predictions"
)

# Determine the entire runtime
print(f"Full run cost {int(time.time() - ts)} seconds.")

共用方式為

教學課程：建立、評估文字分類模型及評分

必要條件

遵循筆記本中的指示

開啟內建筆記本

從 GitHub 匯入筆記本

步驟 1：安裝自訂程式庫

步驟 2：載入資料

下載資料集並上傳至 Lakehouse

匯入必要的程式庫

調整超參數

設定 MLflow 實驗追蹤

從 Lakehouse 讀取原始日期資料

步驟 3：執行探索式資料分析

準備資料

步驟 4︰訓練及評估模型

準備訓練與測試的資料集

追蹤機器學習實驗

調整超參數

評估模型

使用 MLflow 追蹤實驗

步驟 5：評分並儲存預測結果

意見反應

其他資源

共用方式為

教學課程：建立、評估文字分類模型及評分

必要條件

遵循筆記本中的指示

開啟內建筆記本

從 GitHub 匯入筆記本

步驟 1：安裝自訂程式庫

步驟 2：載入資料

下載資料集並上傳至 Lakehouse

匯入必要的程式庫

調整超參數

設定 MLflow 實驗追蹤

從 Lakehouse 讀取原始日期資料

步驟 3：執行探索式資料分析

準備資料

步驟 4︰訓練及評估模型

準備訓練與測試的資料集

追蹤機器學習實驗

調整超參數

評估模型

使用 MLflow 追蹤實驗

步驟 5：評分並儲存預測結果

相關內容

意見反應

其他資源