無伺服器計算上的模型定型

發行項
10/03/2024

適用於：Azure CLI ml 延伸模組 v2 (目前)Python SDK azure-ai-ml v2 (目前)

您不再需要建立和管理計算，以可調整的方式將模型定型。您可以改為將作業提交至新的計算目標類型，稱為無伺服器計算。無伺服器計算是在 Azure Machine Learning 上執行定型作業最簡單的方式。無伺服器計算是完全受控的隨選計算。 Azure Machine Learning 會為您建立、調整及管理計算。透過使用無伺服器計算進行模型定型，機器學習專業人員可以專注於其建置機器學習模型的專業知識，而不需要了解計算基礎結構或進行設定。

機器學習專業人員可以指定作業所需的資源。 Azure Machine Learning 會管理計算基礎結構，並提供受管理的網路隔離，以減輕您的負擔。

企業也可以藉由為每個作業指定最佳資源來降低成本。 IT 系統管理員仍然可以在訂用帳戶和工作區層級指定核心配額並套用 Azure 原則，以掌握控制權。

無伺服器計算可用來微調模型目錄中的模型，例如 LLAMA 2。無伺服器計算可用來從 Azure Machine Learning 工作室、SDK 和 CLI 執行所有類型的作業。無伺服器計算也可用於建置環境映像和負責任的 AI 儀表板案例。無伺服器作業會使用與 Azure Machine Learning 計算配額相同的配額。您可以選擇標準 (專用) 層或現成 (低優先順序) VM。無伺服器作業支援受控識別和使用者身分識別。計費模型與 Azure Machine Learning 計算相同。

無伺服器計算的優點

Azure Machine Learning 會管理建立、設定、調整、刪除、修補、計算基礎結構，以減少管理額外負荷
您不需要了解計算、各種計算類型和相關屬性。
不需要針對每個所需的 VM 大小重複建立叢集、使用相同的設定，以及針對每個工作區複寫。
您可以藉由指定每個作業在執行階段所需的個體類型 (VM 大小) 和執行個體計數確切資源，以最佳化成本。您可以監視作業的使用率計量，以最佳化作業所需的資源。
減少執行作業所涉及的步驟
若要進一步簡化作業提交，您可以完全略過資源。 Azure Machine Learning 預設執行個體計數，並根據配額、成本、效能和磁碟大小等因素選擇執行個體類型 (VM 大小)。
在某些情況下，作業開始執行之前的等候時間較少。
作業提交支援使用者身分識別和工作區使用者指派的受控識別。
透過受管理的網路隔離，您可以簡化及自動化網路隔離設定。也支援客戶虛擬網路
透過配額和 Azure 原則進行系統管理控制

如何使用無伺服器計算

您可以使用筆記本來微調 LLAMA 2 等基礎模型，如下所示：
- 微調 LLAMA 2
- 使用多個節點微調 LLAMA 2
當您建立自己的計算叢集時，您會在命令作業中使用其名稱，例如 compute="cpu-cluster"。使用無伺服器時，您可以略過建立計算叢集，並省略 compute 參數而改為使用無伺服器計算。未為作業指定 compute 時，作業會在無伺服器計算上執行。請省略 CLI 或 SDK 作業中的計算名稱，以在下列作業類型中使用無伺服器計算，並選擇性地提供作業在執行個體計數和執行個體類型方面所需的資源：
- 命令作業，包括互動式作業和分散式定型
- AutoML 作業
- 掃掠作業
- 平行作業
針對透過 CLI 的管線作業，使用管線層級預設計算的 default_compute: azureml:serverless。針對透過 SDK 的管線作業，使用 default_compute="serverless"。如需範例，請參閱管線作業。
當您在工作室中提交定型作業 (預覽) 時，請選取 [無伺服器] 作為計算類型。
使用 Azure Machine Learning 設計工具時，請選取 [無伺服器] 作為預設計算。
您可以針對負責任的 AI 儀表板使用無伺服器計算
- 具有 RAI 儀表板的 AutoML 影像分類案例

效能考量

無伺服器計算可協助您以下列方式加速定型：

配額不足：當您建立自己的計算叢集時，您必須負責找出要建立的 VM 大小和節點計數。當您的作業執行時，如果您沒有足夠的叢集配額，作業就會失敗。無伺服器計算會使用配額的相關資訊，依預設選取適當的 VM 大小。

縮小最佳化：當計算叢集縮小時，新作業必須等候縮小然後擴大，才能執行作業。使用無伺服器計算，您不需要等待縮小，您的作業可以開始在另一個叢集/節點上執行 (假設您有配額)。

叢集忙碌最佳化：當作業在計算叢集上執行且另一個作業提交時，您的作業會排入目前執行中作業後方。使用無伺服器計算，您會取得另一個節點/另一個叢集以開始執行作業 (假設您有配額)。

配額

提交作業時，您仍然需要足夠的 Azure Machine Learning 計算配額才能繼續進行 (工作區和訂用帳戶層級配額)。會根據此配額選取無伺服器作業的預設 VM 大小。如果您指定自己的 VM 大小/系列：

如果您的 VM 大小/系列有一些配額，但沒有足夠的執行個體數目配額，您會看到錯誤。錯誤建議根據配額限制將執行個體數目減少為有效數目，或要求此 VM 系列增加配額，或變更 VM 大小
如果您沒有指定 VM 大小的配額，您會看到錯誤。錯誤建議針對有此 VM 系列配額或要求配額的情形選取不同 VM 大小
如果您有足夠的配額讓 VM 系列執行無伺服器作業，但其他作業正在使用配額，您會收到一則訊息，指出您的作業必須在佇列中等候，直到配額可用為止

當您在 Azure 入口網站中檢視使用量和配額時，您會看到名稱「無伺服器」，以查看無伺服器作業取用的所有配額。

身分識別支援和認證傳遞

使用者認證傳遞：無伺服器計算完全支援使用者認證傳遞。提交作業之使用者的使用者權杖會用於儲存體存取。這些認證來自您的 Microsoft Entra ID。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace
from azure.identity import DefaultAzureCredential     # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
        identity=UserIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

使用下列內容建立名為你好.yaml 的檔案：

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
identity:
  type: user_identity

使用下列命令提交作業：

az ml job create --file hello.yaml --resource-group my-resource-group --workspace-name my-workspace

CLI 範例的其餘部分會顯示你好.yaml 檔案的變化。以相同方式執行每一個。

使用者指派的受控識別：當您已將工作區設定為使用者指派的受控識別時，您可以搭配無伺服器作業使用該身分識別以進行儲存體存取。若要存取祕密，請參閱在 Azure Machine Learning 作業中使用驗證認證祕密。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient     # Handle to the workspace
from azure.identity import DefaultAzureCredential    # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import ManagedIdentityConfiguration

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
        identity= ManagedIdentityConfiguration(),
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
identity:
  type: managed

如需連結使用者指派的受控識別的詳細資訊，請參閱連結使用者指派的受控識別。

設定命令作業的屬性

如果未指定命令、掃掠和 AutoML 作業的計算目標，則計算預設為無伺服器計算。例如，針對此命令作業：

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication package

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest

計算預設為無伺服器計算，具有：

此作業的單一節點。預設節點數目是以作業類型為基礎。針對其他作業類型，請參閱下列各節。
CPU 虛擬機器，取決於配額、效能、成本和磁碟大小。
專用虛擬機器
工作區位置

您可以覆寫這些預設值。如果您要指定無伺服器計算的 VM 類型或節點數目，請將 resources 新增至您的作業：

instance_type 以選擇特定 VM。如果您想要特定 CPU/GPU VM 大小，請使用此參數

instance_count 以指定節點數目。

Python SDK
Azure CLI

from azure.ai.ml import command 
from azure.ai.ml import MLClient # Handle to the workspace
from azure.identity import DefaultAzureCredential # Authentication package
from azure.ai.ml.entities import JobResourceConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    resources = JobResourceConfiguration(instance_type="Standard_NC24", instance_count=4)
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
resources:
  instance_count: 4
  instance_type: Standard_NC24

若要變更作業層級，請使用 queue_settings 在專用 VM (job_tier: Standard) 和低優先順序 (jobtier: Spot) 之間進行選擇。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient    # Handle to the workspace
from azure.identity import DefaultAzureCredential    # Authentication package
credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
    queue_settings={
      "job_tier": "spot"  
    }
)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
component: ./train.yml 
queue_settings:
   job_tier: Standard #Possible Values are Standard (dedicated), Spot (low priority). Default is Standard.

命令作業所有欄位的範例

以下是已指定所有欄位的範例，包括作業應使用的身分識別。不需要指定虛擬網路設定，因為會自動使用工作區層級受管理的網路隔離。

Python SDK
Azure CLI

from azure.ai.ml import command
from azure.ai.ml import MLClient      # Handle to the workspace
from azure.identity import DefaultAzureCredential     # Authentication package
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.entities import UserIdentityConfiguration 

credential = DefaultAzureCredential()
# Get a handle to the workspace. You can find the info on the workspace tab on ml.azure.com
ml_client = MLClient(
    credential=credential,
    subscription_id="<Azure subscription id>", 
    resource_group_name="<Azure resource group>",
    workspace_name="<Azure Machine Learning Workspace>",
)
job = command(
    command="echo 'hello world'",
    environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest",
         identity=UserIdentityConfiguration(),
    queue_settings={
      "job_tier": "Standard"  
    }
)
job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=1)
# submit the command job
ml_client.create_or_update(job)

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: echo "hello world"
environment:
  image: library/python:latest
queue_settings:
   job_tier: Standard #Possible Values are Standard, Spot. Default is Standard.
identity:
  type: user_identity #Possible values are Managed, user_identity
resources:
  instance_count: 1
  instance_type: Standard_E4s_v3

在以下位置檢視更多使用無伺服器計算進行定型的範例：

AutoML 作業

不需要指定 AutoML 作業的計算。可以選擇性地指定資源。如果未指定執行個體計數，則會根據 max_concurrent_trials 和 max_nodes 參數來預設。如果您提交沒有執行個體類型的 AutoML 影像分類或 NLP 工作，則會自動選取 GPU VM 大小。可以透過 CLI、SDK 或工作室提交 AutoML 作業。若要在工作室中使用無伺服器計算提交 AutoML 作業，請先在預覽面板中啟用在工作室中提交定型作業 (預覽) 功能。

Python SDK
Azure CLI

如果您想要指定類型或執行個體計數，請使用 ResourceConfiguration 類別。

# Create the AutoML classification job with the related factory-function.
from azure.ai.ml.entities import ResourceConfiguration 

classification_job = automl.classification(
    experiment_name=exp_name,
    training_data=my_training_data_input,
    target_column_name="y",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"},
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    blocked_training_algorithms=[ClassificationModels.LOGISTIC_REGRESSION],
    enable_onnx_compatible_models=True,
)

# Serverless compute resources used to run the job
classification_job.resources = 
ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=6)

如果您想要指定類型或執行個體計數，請新增 resources 區段。

$schema: https://azuremlsdk2.blob.core.windows.net/preview/0.0.1/autoMLJob.schema.json
type: automl
experiment_name: dpv2-cli-automl-classifier-experiment
description: A Classification job using bank marketing
# Serverless compute is used to run this AutoML job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.

task: classification
log_verbosity: debug
primary_metric: accuracy

target_column_name: "y"

#validation_data_size: 0.20
#n_cross_validations: 5
#test_data_size: 0.1

training_data:
  path: "./training-mltable-folder"
  type: mltable
validation_data:
  path: "./validation-mltable-folder"
  type: mltable
test_data:
  path: "./test-mltable-folder"
  type: mltable

limits:
  timeout_minutes: 180
  max_trials: 40
  max_concurrent_trials: 5
  trial_timeout_minutes: 20
  enable_early_termination: true
  exit_score: 0.92

featurization:
  mode: custom
  transformer_params:
    imputer:
      - fields: ["job"]
        parameters:
          strategy: most_frequent
  blocked_transformers:
    - WordEmbedding
training:
  enable_model_explainability: true
  allowed_training_algorithms:
    - gradient_boosting
    - logistic_regression
# Resources to run this serverless job
resources:
  instance_type="Standard_E4s_v3"
  instance_count=5

針對管線作業，將 "serverless" 指定為預設計算類型以使用無伺服器計算。

# Construct pipeline
@pipeline()
def pipeline_with_components_from_yaml(
    training_input,
    test_input,
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
):
    """E2E dummy train-score-eval pipeline with components defined via yaml."""
    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=training_input,
        max_epochs=training_max_epochs,
        learning_rate=training_learning_rate,
        learning_rate_schedule=learning_rate_schedule,
    )

    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_input
    )
    score_with_sample_data.outputs.score_output.mode = "upload"

    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "trained_model": train_with_sample_data.outputs.model_output,
        "scored_data": score_with_sample_data.outputs.score_output,
        "evaluation_report": eval_with_sample_data.outputs.eval_output,
    }


pipeline_job = pipeline_with_components_from_yaml(
    training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
)

# set pipeline to use serverless compute
pipeline_job.settings.default_compute = "serverless"

針對管線作業，將 azureml:serverless 指定為預設計算類型以使用無伺服器計算。

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: 1b_e2e_registered_components
description: E2E dummy train-score-eval pipeline with registered components
# Serverless compute is used to run this pipeline job. 
# Through serverless compute, Azure Machine Learning takes care of creating, scaling, deleting, patching and managing compute, along with providing managed network isolation, reducing the burden on you.
inputs:
  pipeline_job_training_max_epocs: 20
  pipeline_job_training_learning_rate: 1.8
  pipeline_job_learning_rate_schedule: 'time-based'

outputs: 
  pipeline_job_trained_model:
    mode: upload
  pipeline_job_scored_data:
    mode: upload
  pipeline_job_evaluation_report:
    mode: upload

settings:
 default_compute: azureml:serverless

jobs:
  train_job:
    type: command
    component: azureml:my_train@latest
    inputs:
      training_data: 
        type: uri_folder 
        path: ./data      
      max_epocs: ${{parent.inputs.pipeline_job_training_max_epocs}}
      learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.pipeline_job_learning_rate_schedule}}
    outputs:
      model_output: ${{parent.outputs.pipeline_job_trained_model}}
    services:
      my_vscode:
        type: vs_code
      my_jupyter_lab:
        type: jupyter_lab
      my_tensorboard:
        type: tensor_board
        log_dir: "outputs/tblogs"
    #  my_ssh:
    #    type: tensor_board
    #    ssh_public_keys: <paste the entire pub key content>
    #    nodes: all # Use the `nodes` property to pick which node you want to enable interactive services on. If `nodes` are not selected, by default, interactive applications are only enabled on the head node.

  score_job:
    type: command
    component: azureml:my_score@latest
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: 
        type: uri_folder 
        path: ./data
    outputs:
      score_output: ${{parent.outputs.pipeline_job_scored_data}}

  evaluate_job:
    type: command
    component: azureml:my_eval@latest
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.pipeline_job_evaluation_report}}

您也可以在設計工具中將無伺服器計算設定為預設計算。

下一步