使用 Azure Machine Learning 大規模訓練 PyTorch 模型

發行項
09/17/2024

適用於：Python SDK azure-ai-ml v2 (目前)

在本文中，您會了解如何使用 Azure Machine Learning Python SDK v2 進行 PyTorch 模型的定型、超參數微調和部署。

您會使用範例指令碼將雞和火雞影像分類，以根據 PyTorch 的傳輸學習教學課程，建立深度學習神經網路 (DNN)。傳輸學習是一種技術，可將從解決一個問題所獲得的知識運用到不同但相關的問題。傳輸學習可以藉由要求比從頭訓練更少的資料、時間和計算資源，來縮短定型流程。若要深入了解傳輸學習，請參閱深度學習與機器學習。

無論您是從頭開始定型深度學習 PyTorch 模型，或是將現有的模型帶到雲端，都可以使用 Azure Machine Learning，利用彈性的雲端計算資源來擴增開放原始碼定型作業。您可以使用 Azure Machine Learning 建立、部署、版本設定和監視生產等級的模型。

必要條件

Azure 訂用帳戶。如果您還沒有 Azure 訂用帳戶，請建立免費帳戶。
使用 Azure Machine Learning 計算執行個體或您自己的 Jupyter Notebook，執行本文中的程式碼：
- Azure Machine Learning 計算執行個體 (不需要下載或安裝)：
  - 完成快速入門：開始使用 Azure Machine Learning，以透過 SDK 和範例存放庫建立預先載入的專用筆記本伺服器。
  - 在工作區的 [筆記本] 區段中的 [範例] 索引標籤下，瀏覽至以下目錄以尋找已完成且已展開的筆記本：SDK v2/sdk/python/jobs/single-step/pytorch/train-hyperparameter-tune-deploy-with-pytorch
- 您的 Jupyter 筆記本伺服器：
  - 安裝 Azure Machine Learning SDK (v2)。
  - 下載定型指令檔 pytorch_train.py。

您也可以在 GitHub 範例頁面上找到本指南的完整 Jupyter 筆記本版本。

設定作業

本節透過載入必要的 Python 套件、連線至工作區、建立計算資源來執行命令作業，以及建立環境來執行作業，來設定定型作業。

連線到工作區

首先，您需要連線至您的 Azure Machine Learning 工作區。工作區是服務的最上層資源。其可以在您使用 Azure Machine Learning 時，提供集中式位置以處理您建立的所有成品。

我們正在使用 DefaultAzureCredential 來存取工作區。此認證應該能夠處理大部分的 Azure SDK 驗證案例。

如果 DefaultAzureCredential 不適合您，請參閱 azure.identity 套件或設定驗證以取得更多可用的認證。

# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

如果您想要使用瀏覽器來登入並驗證，您應該取消註解下列程式碼，並改為使用它。

# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

接下來，提供您的訂用帳戶識別碼、資源群組名稱和工作區名稱，以取得工作區的控制代碼。若要尋找這些參數：

在 Azure Machine Learning 工作室工具列右上角尋找您的工作區名稱。
選取您的工作區名稱以顯示您的資源群組和訂用帳戶識別碼。
將資源群組和訂用帳戶識別碼的值複製到程式碼中。

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

執行此指令碼的結果是您可以用來管理其他資源和作業的工作區控制代碼。

注意

建立 MLClient 時不會將用戶端連線至工作區。用戶端初始化作業是緩慢的，會等第一次它所需要的時間到時才會進行呼叫。在本文中，這會在計算建立期間發生。

建立計算資源以執行作業

Azure Machine Learning 需要計算資源才能執行工作。此資源可以是採用 Linux 或 Windows OS 的單一或多節點機器，或 Spark 之類的特定計算網狀架構。

在下列範例指令碼中，我們會佈建 Linux 計算叢集。您可以查看 Azure Machine Learning 定價頁面以取得 VM 大小和價格的完整清單。由於我們需要此範例的 GPU 叢集，因此讓我們挑選 STANDARD_NC6 模型並建立 Azure Machine Learning 計算。

from azure.ai.ml.entities import AmlCompute

gpu_compute_target = "gpu-cluster"

try:
    # let's see if the compute target already exists
    gpu_cluster = ml_client.compute.get(gpu_compute_target)
    print(
        f"You already have a cluster named {gpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new gpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    gpu_cluster = AmlCompute(
        # Name assigned to the compute cluster
        name="gpu-cluster",
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_NC6s_v3",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    gpu_cluster = ml_client.begin_create_or_update(gpu_cluster).result()

print(
    f"AMLCompute with name {gpu_cluster.name} is created, the compute size is {gpu_cluster.size}"
)

建立作業環境

若要執行 Azure Machine Learning 作業，您需要環境。 Azure Machine Learning 環境會封裝相依性 (例如軟體執行時間和程式庫) 在您的計算資源上執行您的機器學習訓練指令碼。此環境類似於您本機電腦上的 Python 環境。

Azure Machine Learning 可讓您使用策劃的 (或現成的) 環境，或是使用 Docker 映像或 Conda 設定來建立自訂環境。在本文中，您會重複使用策劃的 Azure Machine Learning 環境 AzureML-acpt-pytorch-2.2-cuda12.1。透過使用 @latest 指示詞來使用此環境的最新版本。

curated_env_name = "AzureML-acpt-pytorch-2.2-cuda12.1@latest"

設定並提交您的定型作業

在本節中，我們會從介紹用於訓練的資料開始。接著，我們會說明如何使用我們提供的定型指令碼來執行定型作業。您將瞭解如何藉由設定執行定型指令碼的命令來組建定型作業。然後，您將提交定型工作以在 Azure Machine Learning 中執行。

取得訓練資料

您可以使用這個 ZIP 壓縮檔中的資料集。此資料集包含兩個類別 (火雞和雞) 各有大約 120 個定型影像，每個類別都有 100 個驗證影像。影像是 Open Images v5 資料集的子集。定型指令碼 pytorch_train.py 會下載並擷取資料集。

建立定型指令碼

在必要條件一節中，我們已提供定型指令碼 pytorch_train.py。實務上，您應該能夠按原樣採用任何的自訂定型指令碼，並在不需修改程式碼的情況下，使用 Azure Machine Learning 來執行。

提供的定型指令碼會下載資料、定型模型，以及註冊模型。

建置定型作業

現在您已有執行作業所需的所有資產，接下來即可使用 Azure Machine Learning Python SDK 第 2 版來組建作業。在此範例中，我們會建立 command。

Azure Machine Learning command 是一項資源，指定在雲端中執行定型程式碼所需的所有詳細資料。這些詳細資料包括輸入和輸出、要使用的硬體類型、要安裝的軟體，以及如何執行程式碼。 command 包含執行單一命令的資訊。

設定命令

您將使用一般用途 command 來執行定型指令碼，並執行所需的工作。建立 command 物件以指定定型作業的設定詳細資料。

from azure.ai.ml import command
from azure.ai.ml import Input

job = command(
    inputs=dict(
        num_epochs=30, learning_rate=0.001, momentum=0.9, output_dir="./outputs"
    ),
    compute=gpu_compute_target,
    environment=curated_env_name,
    code="./src/",  # location of source code
    command="python pytorch_train.py --num_epochs ${{inputs.num_epochs}} --output_dir ${{inputs.output_dir}}",
    experiment_name="pytorch-birds",
    display_name="pytorch-birds-image",
)

此命令的輸入包括 epoch 數目、學習速率、訊號和輸出目錄。
對於參數值：
1. 提供您為執行此命令所建立的計算叢集 gpu_compute_target = "gpu-cluster"。
2. 提供您稍早初始化的策劃環境。
3. 如果您未使用 [範例] 資料夾中的已完成筆記本，請指定 pytorch_train.py 檔案的位置。
4. 設定命令列動作本身，在此案例中命令為 python pytorch_train.py。您可以透過 ${{ ... }} 標記法存取命令中的輸入和輸出。
5. 設定中繼資料 (例如顯示名稱和實驗名稱)，其中實驗是一個在特定專案上執行之所有反復專案的容器。所有以相同實驗名稱提交的作業，都會在 Azure Machine Learning 工作室中相鄰列出。

提交作業

接著即可提交作業，以在 Azure Machine Learning 中執行。這次您會在 ml_client.jobs 上使用 create_or_update。

ml_client.jobs.create_or_update(job)

完成後，作業會在工作區中註冊模型 (作為定型的結果)，並輸出可用來在 Azure Machine Learning 工作室中檢視作業的連結。

警告

Azure Machine Learning 藉由複製整個來源目錄來執行定型指令碼。如果您不想上傳敏感性資料，請使用 .ignore 檔案，或不要將敏感性資料放入來源目錄中。

作業執行期間發生的情況

作業執行時，會經歷下列階段：

準備：根據定義的環境來建置 Docker 映像。映像上傳至工作區的容器登錄，並快取以供稍後執行。記錄也會串流至作業歷程記錄，並可檢視以監視進度。如果指定策展環境，則會使用支援該策展環境的快取映像。
縮放：如果叢集需要更多節點來執行執行比目前可用的節點，則叢集會嘗試擴大規模。
執行中：指令碼資料夾 src 中的所有指令碼都會上傳至計算目標、掛接或複製資料存放區，並執行指令碼。 stdout 和 ./logs 資料夾的輸出都會串流到作業歷程記錄，並且可用來監視作業。

微調模型超參數

您已使用一組參數來定型模型，現在讓我們看看是否可以進一步改善模型的精確度。您可以使用 Azure Machine Learning 的 sweep 功能來微調和最佳化模型的超參數。

若要微調模型的超參數，請定義在定型期間搜尋的參數空間。您會藉由將傳遞至定型作業的某些參數取代為 azure.ml.sweep 套件的特殊輸入來執行此動作。

由於定型指令碼會使用學習速率排程來衰減每隔數個 Epoch 的學習速率，因此您可以微調初始學習速率和動詞參數。

from azure.ai.ml.sweep import Uniform

# we will reuse the command_job created before. we call it as a function so that we can apply inputs
job_for_sweep = job(
    learning_rate=Uniform(min_value=0.0005, max_value=0.005),
    momentum=Uniform(min_value=0.9, max_value=0.99),
)

然後，您可以使用一些掃掠特定參數 (例如要監看的主要計量，以及要使用的取樣演算法) 在命令作業上設定掃掠。

在下列程式碼中，我們會使用隨機取樣來嘗試不同的超參數設定組，以嘗試將主要計量 best_val_acc 最大化。

我們也會定義早期終止原則 BanditPolicy，以提早終止效能不佳的執行。 BanditPolicy 會終止任何不屬於主要評估計量寬限因數的執行。您會在每個 epoch 套用此原則 (因為我們會在每個 epoch 和 evaluation_interval =1 報告我們的 best_val_acc 計量)。請注意，我們會延遲第一個原則評估，直到前 10 個 epoch (delay_evaluation =10) 之後。

from azure.ai.ml.sweep import BanditPolicy

sweep_job = job_for_sweep.sweep(
    compute="gpu-cluster",
    sampling_algorithm="random",
    primary_metric="best_val_acc",
    goal="Maximize",
    max_total_trials=8,
    max_concurrent_trials=4,
    early_termination_policy=BanditPolicy(
        slack_factor=0.15, evaluation_interval=1, delay_evaluation=10
    ),
)

現在，您可以如先前一樣提交此作業。這次，您會執行掃掠作業，以掃掠定型作業。

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

您可以使用在作業執行期間呈現的工作室使用者介面連結來監視作業。

尋找最佳模型

一旦完成所有執行，您就可以找到產生模型且精確度最高的回合。

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]

    # lets get the model from this run
    model = Model(
        # the script stores the model as "outputs"
        path="azureml://jobs/{}/outputs/artifacts/paths/outputs/".format(best_run),
        name="run-model-example",
        description="Model created from run.",
        type="custom_model",
    )

else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

將模型部署為線上端點

您現在可以將模型部署為線上端點，也就是 Azure 雲端中的 Web 服務。

若要部署機器學習服務，您一般需要：

您想要部署的模型資產。這些資產包括您已在定型作業中註冊的模型檔案和中繼資料。
一些要以服務的形式執行的程式碼。程式碼會在指定的輸入要求 (輸入腳本) 上執行模型。輸入腳本會接收提交給已部署 Web 服務的資料，並將其傳遞給模型。模型處理資料之後，指令碼會將模型的回應傳回給用戶端。指令碼是模型專用的，必須了解模型所預期和傳回的資料。當您使用 MLFlow 模型時，Azure Machine Learning 會自動為您建立此指令碼。

如需部署的詳細資訊，請參閱使用 Python SDK 第 2 版，搭配受控線上端點部署和評分機器學習模型。

建立新的線上端點

在部署模型的第一個步驟中，您需要建立線上端點。端點名稱在整個 Azure 區域中必須是唯一的。在本文中，您會使用通用唯一識別碼 (UUID) 建立唯一名稱。

import uuid

# Creating a unique name for the endpoint
online_endpoint_name = "aci-birds-endpoint-" + str(uuid.uuid4())[:8]

from azure.ai.ml.entities import ManagedOnlineEndpoint

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Classify turkey/chickens using transfer learning with PyTorch",
    auth_mode="key",
    tags={"data": "birds", "method": "transfer learning", "framework": "pytorch"},
)

endpoint = ml_client.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state: {endpoint.provisioning_state}")

在建立端點後，您可以擷取端點，如下所示：

endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

將模型部署至端點

您現在可以使用輸入指令碼來部署模型。一個端點可以有多個部署。接著，端點就可以使用規則將流量導向這些部署。

在下列程式碼中，您將建立單一部署，處理 100% 的連入流量。我們已為部署指定任意色彩名稱 aci-blue。您也可以針對部署使用任何其他名稱，例如 aci-green 或 aci-red。

用來將模型部署至端點的程式碼：

會部署您稍早註冊之模型的最佳版本。
使用 score.py 檔案為模型評分。
會使用 (您先前指定的) 策展環境來執行推斷。

from azure.ai.ml.entities import (
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
)

online_deployment_name = "aci-blue"

# create an online deployment.
blue_deployment = ManagedOnlineDeployment(
    name=online_deployment_name,
    endpoint_name=online_endpoint_name,
    model=model,
    environment=curated_env_name,
    code_configuration=CodeConfiguration(code="./score/", scoring_script="score.py"),
    instance_type="Standard_NC6s_v3",
    instance_count=1,
)

blue_deployment = ml_client.begin_create_or_update(blue_deployment).result()

注意

預期此部署需要一些時間才能完成。

測試已部署的模型

既然您已將模型部署至端點，接下來便可以使用端點上的 invoke 方法來預測已部署模型的輸出。

若要測試端點，以便我們使用範例影像進行預測。首先，讓我們顯示影像。

# install pillow if PIL cannot imported
%pip install pillow
import json
from PIL import Image
import matplotlib.pyplot as plt

%matplotlib inline
plt.imshow(Image.open("test_img.jpg"))

建立函式來格式化和調整影像大小。

# install torch and torchvision if needed
%pip install torch
%pip install torchvision

import torch
from torchvision import transforms


def preprocess(image_file):
    """Preprocess the input image."""
    data_transforms = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

    image = Image.open(image_file)
    image = data_transforms(image).float()
    image = torch.tensor(image)
    image = image.unsqueeze(0)
    return image.numpy()

格式化影像，並將其轉換成 JSON 檔案。

image_data = preprocess("test_img.jpg")
input_data = json.dumps({"data": image_data.tolist()})
with open("request.json", "w") as outfile:
    outfile.write(input_data)

然後，您可以使用這個 JSON 叫用端點，並列印結果。

# test the blue deployment
result = ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    request_file="request.json",
    deployment_name=online_deployment_name,
)

print(result)

清除資源

如果您不再需要端點，請將其刪除以停止使用資源。刪除端點之前，請確定沒有其他部署在使用端點。

ml_client.online_endpoints.begin_delete(name=online_endpoint_name)

注意

預期此清除需要一些時間才能完成。

在本文中，您已在 Azure Machine Learning 上使用 PyTorch 來定型和註冊深度學習和神經網路。您也已將模型部署至線上端點。若要深入了解 Azure Machine Learning，請參閱下列其他文章。

共用方式為

使用 Azure Machine Learning 大規模訓練 PyTorch 模型

必要條件

設定作業

連線到工作區

建立計算資源以執行作業

建立作業環境

設定並提交您的定型作業

取得訓練資料

建立定型指令碼

建置定型作業

設定命令

提交作業

作業執行期間發生的情況

微調模型超參數

尋找最佳模型

將模型部署為線上端點

建立新的線上端點

將模型部署至端點

測試已部署的模型

清除資源

意見反應

其他資源

共用方式為

使用 Azure Machine Learning 大規模訓練 PyTorch 模型

必要條件

設定作業

連線到工作區

建立計算資源以執行作業

建立作業環境

設定並提交您的定型作業

取得訓練資料

建立定型指令碼

建置定型作業

設定命令

提交作業

作業執行期間發生的情況

微調模型超參數

尋找最佳模型

將模型部署為線上端點

建立新的線上端點

將模型部署至端點

測試已部署的模型

清除資源

相關內容

意見反應

其他資源