다중 GPU 워크로드

중요합니다

이 기능은 베타 버전으로 제공됩니다. 작업 영역 관리자는 미리 보기 페이지에서 이 기능에 대한 액세스를 제어할 수 있습니다. Azure Databricks 미리 보기 관리를 참조하세요.

서버리스 GPU Python API를 사용하여 단일 노드의 여러 GPU에서 분산 워크로드를 시작할 수 있습니다. API는 GPU 프로비전, 환경 설정 및 워크로드 배포의 세부 정보를 추상화하는 간단하고 통합된 인터페이스를 제공합니다. 코드 변경을 최소화하면 단일 GPU 학습에서 동일한 Notebook에서 다중 GPU 분산 실행으로 원활하게 이동할 수 있습니다.

지원되는 프레임워크

API는 @distributed 주요 분산 학습 라이브러리와 통합됩니다.

PyTorch DDP(분산 데이터 병렬) - 표준 다중 GPU 데이터 병렬 처리입니다.
FSDP(완전 분할 데이터 병렬) - 대형 모델에 대한 메모리 효율적인 학습입니다.
DeepSpeed - 대규모 모델 학습을 위한 Microsoft의 최적화 라이브러리입니다.

serverless_gpu API 대 TorchDistributor

다음 표에서는 API와 serverless_gpu@distributedTorchDistributor를 비교합니다.

특징	`serverless_gpu` `@distributed` API	토치디스트리뷰터
인프라	완전 서버리스, 클러스터 관리 없음	GPU 작업자가 있는 Spark 클러스터 필요
설치	단일 데코레이터, 최소 구성	Spark 클러스터 및 TorchDistributor 설정 필요
프레임워크 지원	PyTorch DDP, FSDP, DeepSpeed	주로 PyTorch DDP
데이터 로드	데코레이터 내부에서 Unity 카탈로그 볼륨을 사용	Spark 또는 파일 시스템을 통해

serverless_gpu API는 Databricks의 새로운 딥 러닝 워크로드에 권장되는 방법입니다. TorchDistributor는 Spark 클러스터와 긴밀하게 결합된 워크로드에 계속 사용할 수 있습니다.

빠른 시작

분산 학습을 위한 서버리스 GPU API는 Databricks Notebook 및 작업 내에서 서버리스 GPU에 연결될 때 미리 설치됩니다. GPU 환경 4 이상을 사용하는 것이 좋습니다. 분산 학습에 사용하려면 distributed 데코레이터를 가져와서 학습 기능을 분산시키는 데 사용하십시오.

모델 학습 코드를 함수로 감싸고 @distributed 데코레이터를 사용하여 함수를 장식합니다. 데코레이팅된 함수는 분산 실행의 진입점이 됩니다. 모든 학습 논리, 데이터 로드 및 모델 초기화는 이 함수 내에서 정의해야 합니다.

경고

노트북이 연결된 가속기 유형과 일치해야 하는 파라미터 gpu_type 는 @distributed 에 있습니다. 예를 들어, @distributed(gpus=8, gpu_type='H100')는 노트북 컴퓨터가 H100 액셀러레이터에 연결되어 있어야 합니다. 일치하지 않는 가속기 유형(예: H100을 지정하는 동안 A10에 연결)을 사용하면 워크로드가 실패합니다.

아래 코드 조각은 @distributed의 기본 사용법을 보여 줍니다.

# Import the distributed decorator
from serverless_gpu import distributed

# Decorate your training function with @distributed and specify the number of GPUs and GPU type
@distributed(gpus=8, gpu_type='H100')
def run_train():
    ...

다음은 Notebook에서 8개의 H100 GPU에서 MLP(다중 계층 퍼셉트론) 모델을 학습시키는 전체 예제입니다.

모델을 설정하고 유틸리티 함수를 정의합니다.


# Define the model
import os
import torch
import torch.distributed as dist
import torch.nn as nn

def setup():
    dist.init_process_group("nccl")
    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))

def cleanup():
    dist.destroy_process_group()

class SimpleMLP(nn.Module):
    def __init__(self, input_dim=10, hidden_dim=64, output_dim=1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.net(x)

serverless_gpu 라이브러리 및 분산 모듈을 가져옵니다.
```
import serverless_gpu
from serverless_gpu import distributed
```

모델 학습 코드를 함수로 감싸고 @distributed 데코레이터를 사용하여 함수를 장식합니다.

@distributed(gpus=8, gpu_type='H100')
def run_train(num_epochs: int, batch_size: int) -> None:
    import mlflow
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.data import DataLoader, DistributedSampler, TensorDataset

    # 1. Set up multi-GPU environment
    setup()
    device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")

    # 2. Apply the Torch distributed data parallel (DDP) library for data-parellel training.
    model = SimpleMLP().to(device)
    model = DDP(model, device_ids=[device])

    # 3. Create and load dataset.
    x = torch.randn(5000, 10)
    y = torch.randn(5000, 1)

    dataset = TensorDataset(x, y)
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=batch_size)

    # 4. Define the training loop.
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)
        model.train()
        total_loss = 0.0
        for step, (xb, yb) in enumerate(dataloader):
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            loss = loss_fn(model(xb), yb)
            # Log loss to MLflow metric
            mlflow.log_metric("loss", loss.item(), step=step)

            loss.backward()
            optimizer.step()
            total_loss += loss.item() * xb.size(0)

        mlflow.log_metric("total_loss", total_loss)
        print(f"Total loss for epoch {epoch}: {total_loss}")

    cleanup()

사용자 정의 인수를 사용하여 분산 함수를 호출하여 분산 학습을 실행합니다.
```
run_train.distributed(num_epochs=3, batch_size=1)
```
실행하면 Notebook 셀 출력에 MLflow 실행 링크가 생성됩니다. MLflow 실행 링크를 클릭하거나 실험 패널에서 찾아서 실행 결과를 확인합니다.

분산 실행 세부 정보

서버리스 GPU API는 다음과 같은 몇 가지 주요 구성 요소로 구성됩니다.

컴퓨팅 관리자: 리소스 할당 및 관리 처리
런타임 환경: Python 환경 및 종속성 관리
시작 관리자: 작업 실행 및 모니터링 조율

분산 모드에서 실행하는 경우:

함수는 지정된 수의 GPU에 직렬화되고 분산됩니다.
각 GPU는 동일한 매개 변수를 사용하여 함수의 복사본을 실행합니다.
환경은 모든 GPU에서 동기화됩니다.
결과는 모든 GPU에서 수집 및 반환됩니다.

API는 DDP( 분산 데이터 병렬 ), FSDP( 완전 분할된 데이터 병렬 ), DeepSpeed와 같은 인기 있는 병렬 학습 라이브러리를 지원합니다.

Notebook 예제의 다양한 라이브러리를 사용하여 보다 실제 분산 학습 시나리오를 찾을 수 있습니다.

자주 묻는 질문 (FAQ)

데이터 로드 코드는 어디에 배치해야 하나요?

분산 학습에 서버리스 GPU API 를 사용하는 경우 @distributed 데코레이터 내에서 데이터 로드 코드를 이동합니다. 데이터 세트 크기는 피클에서 허용하는 최대 크기를 초과할 수 있으므로 아래와 같이 데코레이터 내부에 데이터 세트를 생성하는 것이 좋습니다.

from serverless_gpu import distributed

# this may cause pickle error
dataset = get_dataset(file_path)
@distributed(gpus=8, gpu_type='H100')
def run_train():
  # good practice
  dataset = get_dataset(file_path)
  ....

자세히 알아보기

API 참조는 서버리스 GPU Python API 설명서를 참조하세요.

피드백

이 페이지가 도움이 되었나요?

Last updated on 2026-03-21