啟動、監視及追蹤執行歷程記錄

發行項
03/06/2024

適用於 Python v1 的 Azure Machine Learning SDK 和 Machine Learning CLI 提供各種方法，供您監視、組織及追蹤訓練和實驗的執行。 ML 的執行歷程記錄是可說明且可重複 ML 開發程序的重要部分。

提示

如需有關使用工作室的資訊，請參閱使用工作室追蹤、監視和分析執行。

若使用 Azure Machine Learning SDK v2，請參閱下列文章：

記錄與檢視計量和記錄檔 (v2) (部分機器翻譯)。
使用 MLflow 和 CLI (v2) 追蹤實驗。

本文示範如何執行下列工作：

監視執行效能。
標記和尋找執行。
對您的執行歷程記錄執行搜尋。
執行取消或失敗。
建立子執行。
透過電子郵件通知來監視執行狀態。

提示

如果您要尋找有關監視 Azure Machine Learning 服務和相關聯 Azure 服務的資訊，請參閱如何監視 Azure Machine Learning。如果您要尋找針對部署為 Web 服務的模型進行監視的資訊，請參閱收集模型資料和使用 Application Insights 進行監視。

必要條件

您將需要下列項目：

Azure 訂用帳戶。如果您沒有 Azure 訂用帳戶，請在開始前建立免費帳戶。立即試用免費或付費版本的 Azure Machine Learning。
Azure Machine Learning 工作區。
適用於 Python 的 Azure Machine Learning SDK (1.0.21 版或更新版本)。若要安裝或更新至最新版本的 SDK，請參閱安裝或更新 SDK。

若要檢查您的 Azure Machine Learning SDK 版本，請使用下列程式碼：
```
print(azureml.core.VERSION)
```
Azure CLI 和適用於 Azure Machine Learning 的 CLI 擴充功能。

重要

本文中的 Azure CLI 命令使用 azure-cli-ml 或 v1 (Azure Machine Learning 的擴充功能)。 v1 擴充功能的支援將於 2025 年 9 月 30 日終止。您將能安裝並使用 v1 擴充功能，直到該日期為止。

建議您在 2025 年 9 月 30 日之前轉換至 ml 或 v2 擴充功能。如需有關 v2 擴充功能的詳細資訊，請參閱 Azure ML CLI 擴充功能和 Python SDK v2。

監視執行效能

開始執行及其記錄程序
- Python SDK
- Azure CLI
適用於：Python SDK azureml v1
1. 藉由從 azureml.core 套件匯入工作區、實驗、執行及 ScriptRunConfig 類別，設定您的實驗。
```
import azureml.core
from azureml.core import Workspace, Experiment, Run
from azureml.core import ScriptRunConfig

ws = Workspace.from_config()
exp = Experiment(workspace=ws, name="explore-runs")
```
2. 使用 start_logging() 方法開始執行及其記錄程序。
```
notebook_run = exp.start_logging()
notebook_run.log(name="message", value="Hello from run!")
```
適用於：Azure CLI ML 延伸模組 v1

若要開始執行您的實驗，請使用下列步驟：
1. 從殼層或命令提示字元，使用 Azure CLI 來驗證您的 Azure 訂用帳戶：
```
az login
```
  提示
  
  登入之後，您會看到一份與您的 Azure 帳戶相關聯的訂閱清單。具有 isDefault: true 的訂用帳戶資訊是目前針對 Azure CLI 命令啟用的訂用帳戶。此訂用帳戶必須是包含 Azure Machine Learning 工作區的相同訂用帳戶。您可從 Azure 入口網站瀏覽工作區的概觀頁面來尋找訂用帳戶識別碼。
  
  若要選取另一個訂用帳戶，請使用 az account set -s <subscription name or ID> 命令並指定訂用帳戶名稱或識別碼以進行切換。如需訂用帳戶選取的詳細資訊，請參閱使用多個 Azure 訂用帳戶。
2. 將工作區組態連結至包含定型指令碼的資料夾。將 myworkspace 取代為您的 Azure Machine Learning 工作區。將 myresourcegroup 取代為包含您的工作區的 Azure 資源群組：
```
az ml folder attach -w myworkspace -g myresourcegroup
```
  此命令會建立一個 .azureml 子目錄，其中包含範例 runconfig 和 conda 環境檔案。它也包含用來與 Azure Machine Learning 工作區通訊的 config.json 檔案。
  
  如需詳細資訊，請參閱 az ml folder attach。
3. 若要開始執行，請使用下列命令。使用此命令時，請對應 -c 參數指定 runconfig 檔案的名稱 (如果您是在檔案系統中查看，則為 *.runconfig 前面的文字)。
```
az ml run submit-script -c sklearn -e testexperiment train.py
```
  提示
  
  az ml folder attach 命令會建立一個 .azureml 子目錄，其中包含兩個範例 runconfig 檔案。
  
  如果您有使用程式設計方式建立執行設定物件的 Python 指令碼，則可以使用 RunConfig.save() 以將其儲存為 runconfig 檔案。
  
  如需 runconfig 檔案的詳細資訊，請參閱 https://github.com/MicrosoftDocs/pipelines-azureml/。
  
  如需詳細資訊，請參閱 az ml run submit-script。
監視執行的狀態
- Python SDK
- Azure CLI
適用於：Python SDK azureml v1
- 使用 get_status() 方法取得執行的狀態。
```
print(notebook_run.get_status())
```
- 若要取得執行識別碼、執行時間和其他有關執行的詳細資料，請使用 get_details() 方法。
```
print(notebook_run.get_details())
```
- 當您的執行成功完成時，請使用 complete() 方法將其標示為已完成。
```
notebook_run.complete()
print(notebook_run.get_status())
```
- 如果您使用 Python 的 with...as 設計模式，執行會在執行超出範圍時，自動將本身標示為已完成。您不需要手動將執行標示為已完成。
```
with exp.start_logging() as notebook_run:
    notebook_run.log(name="message", value="Hello from run!")
    print(notebook_run.get_status())

print(notebook_run.get_status())
```
適用於：Azure CLI ML 延伸模組 v1
- 若要檢視實驗的執行清單，請使用下列命令。將 experiment 取代為實驗的名稱：
```
az ml run list --experiment-name experiment
```
  此命令會傳回 JSON 文件，其中列出此實驗執行的相關資訊。
  
  如需詳細資訊，請參閱 az ml experiment list。
- 若要檢視特定執行的相關資訊，請使用下列命令。將 runid 取代為執行的識別碼：
```
az ml run show -r runid
```
  此命令會傳回 JSON 文件，其中列出執行的相關資訊。
  
  如需詳細資訊，請參閱 az ml run show。

標記和尋找執行

在 Azure Machine Learning 中，您可以使用屬性和標記來協助組織和查詢執行中的重要資訊。

新增屬性和標記
- Python SDK
- Azure CLI
適用於：Python SDK azureml v1

若要將可搜尋的中繼資料新增至您的執行，請使用 add_properties() 方法。例如，下列程式碼會將 "author" 屬性新增至執行：
```
local_run.add_properties({"author":"azureml-user"})
print(local_run.get_properties())
```
屬性是不可變的，因此屬性會建立用於進行稽核的永久記錄。下列程式碼範例會產生錯誤，因為我們已經在上述程式碼中新增 "azureml-user" 作為 "author" 屬性值：
```
try:
    local_run.add_properties({"author":"different-user"})
except Exception as e:
    print(e)
```
不同於屬性，標記是可變動的。若要為您實驗的取用者新增可搜尋且有意義的資訊，請使用 tag() 方法。
```
local_run.tag("quality", "great run")
print(local_run.get_tags())

local_run.tag("quality", "fantastic run")
print(local_run.get_tags())
```
您也可以新增簡單的字串標記。當這些標記以索引鍵的形式出現在標記字典中時，其值為 None。
```
local_run.tag("worth another look")
print(local_run.get_tags())
```
適用於：Azure CLI ml 延伸模組 v1

注意

使用 CLI，您只能新增或更新標記。

若要新增或更新標記，請使用下列命令：
```
az ml run update -r runid --add-tag quality='fantastic run'
```
如需詳細資訊，請參閱 az ml run update。

查詢屬性和標記

您可以查詢實驗內的執行，以傳回符合特定屬性和標記的執行清單。

Python SDK
Azure CLI

適用於：Python SDK azureml v1

list(exp.get_runs(properties={"author":"azureml-user"},tags={"quality":"fantastic run"}))
list(exp.get_runs(properties={"author":"azureml-user"},tags="worth another look"))

適用於：Azure CLI ml 延伸模組 v1

Azure CLI 支援 JMESPath 查詢，可用來根據屬性和標記篩選執行。若要使用 JMESPath 查詢搭配 Azure CLI，請使用 --query 參數來指定。下列範例顯示使用屬性和標記的一些查詢：

# list runs where the author property = 'azureml-user'
az ml run list --experiment-name experiment [?properties.author=='azureml-user']
# list runs where the tag contains a key that starts with 'worth another look'
az ml run list --experiment-name experiment [?tags.keys(@)[?starts_with(@, 'worth another look')]]
# list runs where the author property = 'azureml-user' and the 'quality' tag starts with 'fantastic run'
az ml run list --experiment-name experiment [?properties.author=='azureml-user' && tags.quality=='fantastic run']

如需查詢 Azure CLI 結果的詳細資訊，請參閱查詢 Azure CLI 命令輸出。

執行取消或失敗

如果您發現錯誤，或您的執行時間太長而無法完成，您可以取消執行。

Python SDK
Azure CLI

適用於：Python SDK azureml v1

若要使用 SDK 取消執行，請使用 cancel() 方法：

src = ScriptRunConfig(source_directory='.', script='hello_with_delay.py')
local_run = exp.submit(src)
print(local_run.get_status())

local_run.cancel()
print(local_run.get_status())

如果您的執行完成，但包含錯誤 (例如，使用了不正確的定型指令碼)，您可以使用 fail() 方法將其標示為失敗。

local_run = exp.submit(src)
local_run.fail()
print(local_run.get_status())

適用於：Azure CLI ML 延伸模組 v1

若要使用 CLI 取消執行，請使用下列命令。將 runid 取代為執行的識別碼

az ml run cancel -r runid -w workspace_name -e experiment_name

如需詳細資訊，請參閱 az ml run cancel。

建立子執行

適用於：Python SDK azureml 第 1 版

建立子執行以將相關的執行群組在一起，例如用於不同的超參數微調反覆項目。

注意

您只能使用 SDK 來建立子執行。

這個程式碼範例會使用 child_run() 方法，在提交的執行內使用 hello_with_children.py 指令碼建立五個子執行的批次：

!more hello_with_children.py
src = ScriptRunConfig(source_directory='.', script='hello_with_children.py')

local_run = exp.submit(src)
local_run.wait_for_completion(show_output=True)
print(local_run.get_status())

with exp.start_logging() as parent_run:
    for c,count in enumerate(range(5)):
        with parent_run.child_run() as child:
            child.log(name="Hello from child run", value=c)

注意

當子執行移出範圍時，子執行會自動標示為已完成。

若要有效率地建立許多子執行，請使用 create_children() 方法。因為每個建立都會導致網路呼叫，所以建立批次的執行比逐一建立執行更有效率。

提交子執行

您也可以從父執行提交子執行。這可讓您建立父系和子執行的階層。您無法建立無父系子回合：即使父執行不會執行任何動作，而只是啟動子執行，還是必須建立階層。所有執行的狀態都是獨立的：即使有一或多個子執行已取消或失敗，父系也可以處於 "Completed" 成功狀態。

您可能會想要讓子執行的執行組態與父執行不同。例如，您可能會針對父系使用較不強大的 CPU 型組態，同時針對您的子系使用 GPU 型組態。另一種常見的情況是傳遞每個子系不同的引數和資料。若要自訂子執行，請為子執行建立 ScriptRunConfig 物件。

重要

若要從遠端計算上的父執行提交子執行，您必須先登入父執行程式碼中的工作區。根據預設，遠端執行中的執行內容物件沒有提交子執行的認證。使用服務主體或受控識別認證來登入。如需有關驗證的詳細資訊，請參閱設定驗證。

下列程式碼：

從工作區 ws 擷取名為 "gpu-cluster" 的計算資源
反覆運算不同的引數值，以傳遞給子 ScriptRunConfig 物件
使用自訂計算資源和引數，建立並提交新的子執行
封鎖直到所有子執行完成為止

# parent.py
# This script controls the launching of child scripts
from azureml.core import Run, ScriptRunConfig

compute_target = ws.compute_targets["gpu-cluster"]

run = Run.get_context()

child_args = ['Apple', 'Banana', 'Orange']
for arg in child_args: 
    run.log('Status', f'Launching {arg}')
    child_config = ScriptRunConfig(source_directory=".", script='child.py', arguments=['--fruit', arg], compute_target=compute_target)
    # Starts the run asynchronously
    run.submit_child(child_config)

# Experiment will "complete" successfully at this point. 
# Instead of returning immediately, block until child runs complete

for child in run.get_children():
    child.wait_for_completion()

若要有效率地建立具有相同組態、引數和輸入的許多子執行，請使用 create_children() 方法。因為每個建立都會導致網路呼叫，所以建立批次的執行比逐一建立執行更有效率。

在子執行中，您可以檢視父執行識別碼：

## In child run script
child_run = Run.get_context()
child_run.parent.id

查詢子執行

若要查詢特定父系的子執行，請使用 get_children() 方法。 recursive = True 引數可讓您查詢子系和孫系的巢狀樹狀目錄。

print(parent_run.get_children())

記錄到父系或根執行

您可以使用 Run.parent 欄位來存取啟動目前子執行的執行。使用 Run.parent 的常見使用案例是將記錄結果合併到單一位置。子執行會以非同步方式執行，而且不保證順序或同步處理不會超過父系等候其子執行完成的能力。

# in child (or even grandchild) run

def root_run(self : Run) -> Run :
    if self.parent is None : 
        return self
    return root_run(self.parent)

current_child_run = Run.get_context()
root_run(current_child_run).log("MyMetric", f"Data from child run {current_child_run.id}")

透過電子郵件通知來監視執行狀態

在 Azure 入口網站的左側導覽列中，選取 [監視] 索引標籤。
選取 [診斷設定]，然後選取 [+ 新增診斷設定]。
在 [診斷設定] 中，
1. 在 [類別詳細資料] 底下，選取 [AmlRunStatusChangedEvent]。
2. 在 [目的地詳細資料] 中，選取 [傳送至 Log Analytics 工作區]，並指定 [訂用帳戶] 和 [Log Analytics 工作區]。
注意

[Azure Log Analytics 工作區] 是與 [Azure Machine Learning service 工作區] 不同類型的 Azure 資源。如果該清單中沒有任何選項，則您可以建立 Log Analytics 工作區。
在 [記錄] 索引標籤中，新增 [新增警示規則]。
請參閱如何使用 Azure 監視器來建立和管理記錄警示。

Notebook 範例

下列筆記本示範了此文章中說明的概念：

若要深入了解記錄 API，請參閱記錄 API 筆記本。
如需使用 Azure Machine Learning SDK 來管理執行的詳細資訊，請參閱管理執行筆記本。

下一步

若要了解如何記錄您的實驗的計量，請參閱在定型執行期間記錄計量。
若要了解如何監視 Azure Machine Learning 的資源和記錄，請參閱監視 Azure Machine Learning。

共用方式為