Azure Machine Learning을 사용하여 대규모 scikit-learn 모델 학습

아티클
03/26/2024

적용 대상: Python SDK azure-ai-ml v2(현재)

이 문서에서는 Azure Machine Learning Python SDK v2를 사용하여 scikit-learn 학습 스크립트를 실행하는 방법을 알아봅니다.

이 문서의 예제 스크립트는 scikit-learn의 붓꽃 데이터 세트에 따라 기계 학습 모델을 빌드하도록 붓꽃 이미지를 분류하는 데 사용됩니다.

처음부터 기계 학습 scikit-learn 모델을 학습시키는지 또는 기존 모델을 클라우드로 가져오는지에 관계없이 Azure Machine Learning을 사용하여 탄력적 클라우드 컴퓨팅 리소스를 사용해 오픈 소스 학습 작업을 스케일 아웃할 수 있습니다. Azure Machine Learning을 사용하여 프로덕션 등급 모델을 빌드, 배포, 버전 관리 및 모니터링할 수 있습니다.

필수 조건

Azure Machine Learning 컴퓨팅 인스턴스 또는 사용자 고유의 Jupyter Notebook에서 이 문서의 코드를 실행할 수 있습니다.

Azure Machine Learning 컴퓨팅 인스턴스
- 컴퓨팅 인스턴스를 만들려면 시작하려면 리소스 만들기를 완료합니다. 모든 컴퓨팅 인스턴스에는 SDK 및 Notebook 샘플 리포지토리가 미리 로드된 전용 Notebook 서버가 포함됩니다.
- Azure Machine Learning 스튜디오에서 Notebook 탭을 선택합니다. 학습 샘플 디렉터리에서 v2 > sdk > jobs > single-step > scikit-learn > train-hyperparameter-tune-deploy-with-sklearn 디렉터리로 이동하여 완성된 확장 Notebook을 찾습니다.
- 샘플 학습 폴더에서 미리 채워진 코드를 사용하여 이 자습서를 완료할 수 있습니다.
Jupyter Notebook 서버
- Azure Machine Learning SDK(v2)를 설치합니다.

작업 설정

이 섹션에서는 필요한 Python 패키지를 로드하고, 작업 영역에 연결하고, 명령 작업을 실행할 컴퓨팅 리소스를 만들고, 작업을 실행할 환경을 만들어 학습 작업을 설정합니다.

작업 영역에 연결

먼저, Azure Machine Learning 작업 영역에 연결해야 합니다. Azure Machine Learning 작업 영역은 서비스의 최상위 리소스입니다. 작업 영역은 Azure Machine Learning를 사용하는 경우 만드는 모든 아티팩트를 사용할 수 있는 중앙 집중식 위치를 제공합니다.

DefaultAzureCredential을 사용하여 작업 영역에 액세스합니다. 이 자격 증명은 대부분의 Azure SDK 인증 시나리오를 처리할 수 있어야 합니다.

DefaultAzureCredential이 적합하지 않은 경우 사용 가능한 추가 자격 증명은 azure-identity reference documentation 또는 Set up authentication을 참조하세요.

# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

브라우저를 사용하여 로그인하고 인증하려는 경우 다음 코드에서 주석을 제거하고 대신 사용해야 합니다.

# Handle to the workspace
# from azure.ai.ml import MLClient

# Authentication package
# from azure.identity import InteractiveBrowserCredential
# credential = InteractiveBrowserCredential()

다음으로, 구독 ID, 리소스 그룹 이름, 작업 영역 이름을 제공하여 작업 영역에 대한 핸들을 가져옵니다. 이 매개 변수를 찾으려면:

Azure Machine Learning 스튜디오 도구 모음의 오른쪽 위에서 작업 영역 이름을 찾습니다.
작업 영역 이름을 선택하여 리소스 그룹 및 구독 ID를 표시합니다.
리소스 그룹 및 구독 ID의 값을 코드에 복사합니다.

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="<SUBSCRIPTION_ID>",
    resource_group_name="<RESOURCE_GROUP>",
    workspace_name="<AML_WORKSPACE_NAME>",
)

이 스크립트를 실행한 결과는 다른 리소스와 작업을 관리하는 데 사용할 작업 영역 핸들입니다.

참고 항목

MLClient를 만들면 클라이언트가 작업 영역에 연결되지 않습니다. 클라이언트 초기화는 지연되며 처음 호출해야 할 때까지 기다립니다. 이 문서에서는 컴퓨팅을 만드는 동안 이 문제가 발생합니다.

컴퓨팅 리소스 만들기

Azure Machine Learning에서는 작업을 실행할 컴퓨팅 리소스가 필요합니다. 이 리소스는 Linux 또는 Windows OS를 사용하는 단일 또는 다중 노드 머신 또는 Spark와 같은 특정 컴퓨팅 패브릭입니다.

다음 예제 스크립트에서는 Linux compute cluster를 프로비저닝합니다. VM 크기 및 가격의 전체 목록은 Azure Machine Learning pricing 페이지에서 확인할 수 있습니다. 이 예제에서는 기본 클러스터만 필요합니다. 따라서 두 개의 vCPU 코어와 7GB RAM이 있는 Standard_DS3_v2 모델을 선택하여 Azure Machine Learning 컴퓨팅을 만듭니다.

from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS3_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=4,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )

    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster).result()

print(
    f"AMLCompute with name {cpu_cluster.name} is created, the compute size is {cpu_cluster.size}"
)

작업 환경 만들기

Azure Machine Learning 작업을 실행하려면 환경이 필요합니다. Azure Machine Learning 환경은 컴퓨팅 리소스에서 기계 학습의 학습 스크립트를 실행하는 데 필요한 종속성(예: 소프트웨어 런타임 및 라이브러리)을 캡슐화합니다. 이 환경은 로컬 머신의 Python 환경과 비슷합니다.

Azure Machine Learning을 사용하면 큐레이팅된(또는 즉시 사용 가능한) 환경을 사용하거나 Docker 이미지 또는 Conda 구성을 사용하여 사용자 지정 환경을 만들 수 있습니다. 이 문서에서는 Conda YAML 파일을 사용하여 작업에 대한 사용자 지정 환경을 만듭니다.

사용자 지정 환경 만들기

사용자 지정 환경을 만들려면 YAML 파일에서 Conda 종속성을 정의합니다. 먼저 파일을 저장하기 위한 디렉터리를 만듭니다. 이 예제에서는 디렉터리 이름을 env로 지정했습니다.

import os

dependencies_dir = "./env"
os.makedirs(dependencies_dir, exist_ok=True)

그런 다음, 종속성 디렉터리에 파일을 만듭니다. 이 예제에서는 파일 이름을 conda.yml로 지정했습니다.

%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pip:  
    - azureml-mlflow==1.42.0
    - mlflow-skinny==2.3.2

사양에는 작업에 사용할 몇 가지 일반적인 패키지(예: numpy, pip)가 포함됩니다.

다음으로, YAML 파일을 사용하여 이 사용자 지정 환경을 만들고 작업 영역에 등록합니다. 환경은 런타임 시 Docker 컨테이너로 패키지됩니다.

from azure.ai.ml.entities import Environment

custom_env_name = "sklearn-env"

job_env = Environment(
    name=custom_env_name,
    description="Custom environment for sklearn image classification",
    conda_file=os.path.join(dependencies_dir, "conda.yaml"),
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
job_env = ml_client.environments.create_or_update(job_env)

print(
    f"Environment with name {job_env.name} is registered to workspace, the environment version is {job_env.version}"
)

환경을 만들고 사용하는 방법에 대한 자세한 내용은 Azure Machine Learning에서 소프트웨어 환경 만들기 및 사용을 참조하세요.

[선택 사항] Scikit-Learn용 Intel® 확장을 사용하여 사용자 지정 환경 만들기

Intel 하드웨어에서 scikit-learn 스크립트의 속도를 높이고 싶으신가요? Scikit-Learn용 Intel® 확장을 conda yaml 파일에 추가하고 위에 설명된 후속 단계를 수행해 보세요. 이 예제의 뒷부분에서 이러한 최적화를 사용하도록 설정하는 방법을 보여 줍니다.

%%writefile {dependencies_dir}/conda.yaml
name: sklearn-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scikit-learn-intelex
  - scipy=1.7.1
  - pip:  
    - azureml-mlflow==1.42.0
    - mlflow-skinny==2.3.2

학습 작업 구성 및 제출

이 섹션에서는 제공한 학습 스크립트를 사용하여 학습 작업을 실행하는 방법을 설명합니다. 시작하려면 학습 스크립트를 실행하기 위한 명령을 구성하여 학습 작업을 빌드합니다. 그런 다음 Azure Machine Learning에서 실행할 학습 작업을 제출합니다.

학습 스크립트 준비

이 문서에서는 train_iris.py 학습 스크립트를 제공했습니다. 실제로 사용자 지정 학습 스크립트를 있는 그대로 사용하여 코드를 수정하지 않고도 Azure Machine Learning에서 실행할 수 있어야 합니다.

참고 항목

제공된 학습 스크립트는 다음을 수행합니다.

Azure Machine Learning 실행에 대한 일부 메트릭을 로그하는 방법을 보여 줍니다.
iris = datasets.load_iris()를 사용하여 학습 데이터를 다운로드하고 추출합니다.
모델을 학습시킨 다음, 저장하고 등록합니다.

고유한 데이터를 사용하고 액세스하려면 작업에서 데이터를 읽고 쓰는 방법을 참조하여 학습 중에 데이터를 사용 가능하도록 만듭니다.

학습 스크립트를 사용하려면 먼저 파일을 저장할 디렉터리를 만듭니다.

import os

src_dir = "./src"
os.makedirs(src_dir, exist_ok=True)

다음으로, 원본 디렉터리에 스크립트 파일을 만듭니다.

%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/

import argparse
import os

# importing necessary libraries
import numpy as np

from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import joblib

import mlflow
import mlflow.sklearn

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument('--kernel', type=str, default='linear',
                        help='Kernel type to be used in the algorithm')
    parser.add_argument('--penalty', type=float, default=1.0,
                        help='Penalty parameter of the error term')

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    args = parser.parse_args()
    mlflow.log_param('Kernel type', str(args.kernel))
    mlflow.log_metric('Penalty', float(args.penalty))

    # loading the iris dataset
    iris = datasets.load_iris()

    # X -> features, y -> label
    X = iris.data
    y = iris.target

    # dividing X, y into train and test data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # training a linear SVM classifier
    from sklearn.svm import SVC
    svm_model_linear = SVC(kernel=args.kernel, C=args.penalty)
    svm_model_linear = svm_model_linear.fit(X_train, y_train)
    svm_predictions = svm_model_linear.predict(X_test)

    # model accuracy for X_test
    accuracy = svm_model_linear.score(X_test, y_test)
    print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
    mlflow.log_metric('Accuracy', float(accuracy))
    # creating a confusion matrix
    cm = confusion_matrix(y_test, svm_predictions)
    print(cm)

    registered_model_name="sklearn-iris-flower-classify-model"

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=svm_model_linear,
        registered_model_name=registered_model_name,
        artifact_path=registered_model_name
    )

    # # Saving the model to a file
    print("Saving the model via MLFlow")
    mlflow.sklearn.save_model(
        sk_model=svm_model_linear,
        path=os.path.join(registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    mlflow.end_run()

if __name__ == '__main__':
    main()

[선택 사항] Intel® 하드웨어에서 성능 향상을 위해 Scikit-Learn용 Intel® 확장 최적화 사용

이전 섹션에서 설명한 대로 Scikit-Learn용 Intel® 확장을 설치한 경우 아래와 같이 스크립트 파일의 맨 위에 두 줄의 코드를 추가하여 성능 최적화를 사용하도록 설정할 수 있습니다.

Scikit-Learn용 Intel® 확장에 대한 자세한 내용은 패키지의 설명서를 참조하세요.

%%writefile {src_dir}/train_iris.py
# Modified from https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/

import argparse
import os

# Import and enable Intel Extension for Scikit-learn optimizations
# where possible

from sklearnex import patch_sklearn
patch_sklearn()

# importing necessary libraries
import numpy as np


from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import joblib

import mlflow
import mlflow.sklearn

def main():
    parser = argparse.ArgumentParser()

    parser.add_argument('--kernel', type=str, default='linear',
                        help='Kernel type to be used in the algorithm')
    parser.add_argument('--penalty', type=float, default=1.0,
                        help='Penalty parameter of the error term')

    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    args = parser.parse_args()
    mlflow.log_param('Kernel type', str(args.kernel))
    mlflow.log_metric('Penalty', float(args.penalty))

    # loading the iris dataset
    iris = datasets.load_iris()

    # X -> features, y -> label
    X = iris.data
    y = iris.target

    # dividing X, y into train and test data
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # training a linear SVM classifier
    from sklearn.svm import SVC
    svm_model_linear = SVC(kernel=args.kernel, C=args.penalty)
    svm_model_linear = svm_model_linear.fit(X_train, y_train)
    svm_predictions = svm_model_linear.predict(X_test)

    # model accuracy for X_test
    accuracy = svm_model_linear.score(X_test, y_test)
    print('Accuracy of SVM classifier on test set: {:.2f}'.format(accuracy))
    mlflow.log_metric('Accuracy', float(accuracy))
    # creating a confusion matrix
    cm = confusion_matrix(y_test, svm_predictions)
    print(cm)

    registered_model_name="sklearn-iris-flower-classify-model"

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=svm_model_linear,
        registered_model_name=registered_model_name,
        artifact_path=registered_model_name
    )

    # # Saving the model to a file
    print("Saving the model via MLFlow")
    mlflow.sklearn.save_model(
        sk_model=svm_model_linear,
        path=os.path.join(registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    mlflow.end_run()

if __name__ == '__main__':
    main()

학습 작업 빌드

이제 작업을 실행하는 데 필요한 모든 자산이 있으므로 Azure Machine Learning Python SDK v2를 사용하여 작업을 빌드할 차례입니다. 작업을 실행하기 위해 command를 만듭니다.

Azure Machine Learning command는 클라우드에서 학습 코드를 실행하는 데 필요한 모든 세부 정보를 지정하는 리소스입니다. 이 세부 정보에는 입력 및 출력, 사용할 하드웨어 유형, 설치할 소프트웨어, 코드 실행 방법이 포함됩니다. command에는 단일 명령을 실행하기 위한 정보가 포함됩니다.

명령 구성

범용 command를 사용하여 학습 스크립트를 실행하고 원하는 작업을 수행합니다. 학습 작업의 구성 세부 정보를 지정하는 Command 개체를 만듭니다.

이 명령의 입력에는 Epoch 수, 학습 속도, 모멘텀, 출력 디렉터리가 포함됩니다.
매개 변수 값의 경우:
- 이 명령을 실행하기 위해 만든 컴퓨팅 클러스터 cpu_compute_target = "cpu-cluster"를 제공합니다.
- Azure Machine Learning 작업을 실행하기 위해 만든 사용자 지정 환경 sklearn-env를 제공합니다.
- 명령줄 작업 자체를 구성합니다. 이 경우 명령은 python train_iris.py입니다. ${{ ... }} 표기법을 통해 명령의 입력 및 출력에 액세스할 수 있습니다.
- 표시 이름, 실험 이름 등의 메타데이터를 구성합니다. 여기서 실험은 특정 프로젝트에서 수행하는 모든 반복에 대한 컨테이너입니다. 동일한 실험 이름으로 제출된 모든 작업은 Azure Machine Learning 스튜디오에서 나란히 나열됩니다.

from azure.ai.ml import command
from azure.ai.ml import Input

job = command(
    inputs=dict(kernel="linear", penalty=1.0),
    compute=cpu_compute_target,
    environment=f"{job_env.name}:{job_env.version}",
    code="./src/",
    command="python train_iris.py --kernel ${{inputs.kernel}} --penalty ${{inputs.penalty}}",
    experiment_name="sklearn-iris-flowers",
    display_name="sklearn-classify-iris-flower-images",
)

작업 제출

이제 Azure Machine Learning에서 실행할 작업을 제출할 차례입니다. 이번에는 ml_client.jobs에서 create_or_update를 사용합니다.

ml_client.jobs.create_or_update(job)

완료되면 작업은 학습의 결과로 모델을 작업 영역에 등록하고 Azure Machine Learning 스튜디오에서 작업을 보기 위한 링크를 출력합니다.

Warning

Azure Machine Learning는 전체 원본 디렉터리를 복사하여 학습 스크립트를 실행합니다. 업로드를 원하지 않는 중요한 데이터가 있다면 .ignore 파일을 사용하거나 데이터를 원본 디렉터리에 포함하지 마세요.

작업을 실행하는 동안 수행되는 작업

작업이 실행되면 다음 단계를 거칩니다.

준비: 정의된 환경에 따라 Docker 이미지가 생성됩니다. 이미지는 작업 영역의 컨테이너 레지스트리에 업로드되고 나중에 실행될 수 있도록 캐시됩니다. 또한 로그는 실행 기록으로 스트리밍되며 진행 상황을 모니터링할 수 있도록 표시됩니다. 큐레이팅된 환경이 지정되면 해당 큐레이팅된 환경을 지원하는 캐시된 이미지가 사용됩니다.
스케일링: 클러스터에서 Run을 실행하는 데 현재 사용할 수 있는 노드보다 더 많은 노드가 필요한 경우 클러스터에서 스케일 업을 시도합니다.
실행: 스크립트 폴더 src의 모든 스크립트가 컴퓨팅 대상으로 업로드되고, 데이터 저장소가 탑재되거나 복사되며, 스크립트가 실행됩니다. stdout 및 ./logs 폴더의 출력은 실행 기록으로 스트림되며 Run을 모니터링하는 데 사용될 수 있습니다.

모델 하이퍼 매개 변수 튜닝

이제 SDK를 사용하여 간단한 Scikit-learn 학습 실행을 수행하는 방법을 살펴보았으므로 모델의 정확도를 더 향상시킬 수 있는지 살펴보겠습니다. Azure Machine Learning의 sweep 기능을 사용하여 모델의 하이퍼 매개 변수를 튜닝하고 최적화할 수 있습니다.

모델의 하이퍼 매개 변수를 튜닝하려면 학습 중에 검색할 매개 변수 공간을 정의합니다. 이 작업을 수행하려면 패키지의 특수 입력을 사용하여 학습 작업에 전달된 일부 매개 변수(kernel 및 penalty)를 azure.ml.sweep 패키지의 특수 입력으로 바꿉니다.

from azure.ai.ml.sweep import Choice

# we will reuse the command_job created before. we call it as a function so that we can apply inputs
# we do not apply the 'iris_csv' input again -- we will just use what was already defined earlier
job_for_sweep = job(
    kernel=Choice(values=["linear", "rbf", "poly", "sigmoid"]),
    penalty=Choice(values=[0.5, 1, 1.5]),
)

그런 다음, 감시할 기본 메트릭, 사용할 샘플링 알고리즘 등의 일부 스윕 관련 매개 변수를 사용하여 명령 작업에 대한 스윕을 구성합니다.

다음 코드에서는 임의 샘플링을 사용하여 기본 메트릭 Accuracy를 최대화하기 위해 하이퍼 매개 변수의 다양한 구성 세트를 시도합니다.

sweep_job = job_for_sweep.sweep(
    compute="cpu-cluster",
    sampling_algorithm="random",
    primary_metric="Accuracy",
    goal="Maximize",
    max_total_trials=12,
    max_concurrent_trials=4,
)

이제 이전과 같이 이 작업을 제출할 수 있습니다. 이번에는 학습 작업을 스윕하는 스윕 작업을 실행합니다.

returned_sweep_job = ml_client.create_or_update(sweep_job)

# stream the output and wait until the job is finished
ml_client.jobs.stream(returned_sweep_job.name)

# refresh the latest status of the job after streaming
returned_sweep_job = ml_client.jobs.get(name=returned_sweep_job.name)

작업 실행 중에 표시되는 스튜디오 사용자 인터페이스 링크를 사용하여 작업을 모니터링할 수 있습니다.

최적 모델 찾기 및 등록

모든 실행이 완료되면 정확도가 가장 높은 모델을 생성한 실행을 찾을 수 있습니다.

from azure.ai.ml.entities import Model

if returned_sweep_job.status == "Completed":

    # First let us get the run which gave us the best result
    best_run = returned_sweep_job.properties["best_child_run_id"]

    # lets get the model from this run
    model = Model(
        # the script stores the model as "sklearn-iris-flower-classify-model"
        path="azureml://jobs/{}/outputs/artifacts/paths/sklearn-iris-flower-classify-model/".format(
            best_run
        ),
        name="run-model-example",
        description="Model created from run.",
        type="custom_model",
    )

else:
    print(
        "Sweep job status: {}. Please wait until it completes".format(
            returned_sweep_job.status
        )
    )

그런 다음, 이 모델을 등록할 수 있습니다.

registered_model = ml_client.models.create_or_update(model=model)

모델 배포

모델을 등록한 후 Azure Machine Learning에서 등록된 다른 모델과 동일한 방식으로 배포할 수 있습니다. 배포에 대한 자세한 내용은 Python SDK v2를 사용하여 관리형 온라인 엔드포인트를 통해 기계 학습 모델 배포 및 점수 매기기를 참조하세요.

다음 단계

이 문서에서 scikit-learn 모델을 학습시키고 등록했으며 배포 옵션에 대해 알아보았습니다. Azure Machine Learning에 대한 자세한 내용은 다음 문서를 참조하세요.