최적화된 LLM(대규모 언어 모델) 서비스

중요함

이 기능은 공개 미리 보기 상태입니다.

중요함

이 가이드의 코드 예제에서는 사용되지 않는 API를 사용합니다. Databricks에서는 LLM의 최적화된 유추를 위해 프로비전된 처리량 환경을 사용하는 것이 좋습니다. 최적화된 LLM 서빙 엔드포인트를 프로비전된 처리량으로 마이그레이션 하기를 참조하세요.

이 문서에서는 Mosaic AI 모델 서비스에서 LLM(대규모 언어 모델)에 대한 최적화를 지원하는 방법을 보여줍니다.

최적화된 LLM 서비스는 기존 서비스 접근 방식과 비교하여 3~5배 더 나은 범위에서 향상된 처리량 및 대기 시간을 제공합니다. 다음 표에는 지원되는 LLM 제품군 및 해당 변형이 요약되어 있습니다.

Databricks에서는 Databricks Marketplace를 사용하여 기본 모델을 설치하는 것이 좋습니다. 모델 패밀리를 검색한 후 모델 페이지에서 액세스 권한 받기를 선택하고, 로그인 자격 증명을 입력하여 모델을 Unity 카탈로그에 설치하세요.

모델 제품군	마켓플레이스에서 설치
라마 2	Llama 2 모델
MPT (엠피티)
미스트랄	Mistral 모델

요구 사항

최적화된 LLM 서비스는 GPU 배포 공개 미리 보기의 일부로 지원됩니다.
MLflow 2.4 이상 또는 Databricks Runtime 13.2 ML 이상을 사용하여 모델을 기록해야 합니다.
모델을 배포할 때 모델의 매개 변수 크기를 적절한 컴퓨팅 크기와 일치시키는 것이 중요합니다. 매개 변수가 500억 개 이상인 모델의 경우 Azure Databricks 계정 팀에 문의하여 필요한 GPU에 액세스하세요.

모델 매개 변수 크기 권장 컴퓨팅 크기 워크로드 유형

70억 1xA100 GPU_LARGE

130억 1xA100 GPU_LARGE

300~340억 1xA100 GPU_LARGE

700억 2xA100 GPU_LARGE_2

모델 매개 변수 크기	권장 컴퓨팅 크기	워크로드 유형
70억	1xA100	`GPU_LARGE`
130억	1xA100	`GPU_LARGE`
300~340억	1xA100	`GPU_LARGE`
700억	2xA100	`GPU_LARGE_2`

대규모 언어 모델 기록

먼저 MLflow transformers 버전으로 모델을 기록하고 metadata = {"task": "llm/v1/completions"}를 사용하여 MLflow 메타데이터)에 작업 필드를 지정합니다. 이렇게 하면 엔드포인트를 제공하는 모델에 사용되는 API 서명이 지정됩니다.

최적화된 LLM 서비스는 Azure Databricks AI Gateway에서 지원하는 경로 유형과 호환됩니다. 현재 llm/v1/completions입니다. 지원되지 않는 서비스를 제공하려면 모델 제품군 또는 작업 종류가 있는 경우 Azure Databricks 계정 팀에 문의하세요.

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        artifact_path="model",
        transformers_model=components,
        input_example=["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"],
        metadata={"task": "llm/v1/completions"},
        registered_model_name='mpt'
    )

모델이 기록되면 Unity 카탈로그에 모델을 등록할 수 있습니다. 여기서 CATALOG.SCHEMA.MODEL_NAME 모델의 세 가지 수준 이름으로 바꿀 수 있습니다.


mlflow.set_registry_uri("databricks-uc")

registered_model_name=CATALOG.SCHEMA.MODEL_NAME

모델 서비스 엔드포인트 만들기

다음은 모델 서비스 엔드포인트를 만듭니다. 최적화된 LLM 서비스에서 모델을 지원하는 경우 Azure Databricks는 서비스를 제공하려고 할 때 엔드포인트를 제공하는 최적화된 모델을 자동으로 만듭니다.

import requests
import json

# Set the name of the MLflow endpoint
endpoint_name = "llama2-3b-chat"

# Name of the registered MLflow model
model_name = "ml.llm-catalog.llama-13b"

# Get the latest version of the MLflow model
model_version = 3

# Specify the type of compute (CPU, GPU_SMALL, GPU_LARGE, etc.)
workload_type = "GPU_LARGE"

# Specify the scale-out size of compute (Small, Medium, Large, etc.)
workload_size = "Small"

# Specify Scale to Zero (only supported for CPU endpoints)
scale_to_zero = False

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

# send the POST request to create the serving endpoint

data = {
    "name": endpoint_name,
    "config": {
        "served_models": [
            {
                "model_name": model_name,
                "model_version": model_version,
                "workload_size": workload_size,
                "scale_to_zero_enabled": scale_to_zero,
                "workload_type": workload_type,
            }
        ]
    },
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

입력 및 출력 스키마 형식

최적화된 LLM 서비스 엔드포인트에는 Azure Databricks가 제어하는 입력 및 출력 스키마가 있습니다. 네 가지 형식이 지원됩니다.

dataframe_split는 split 방향으로 JSON 직렬화된 Pandas Dataframe입니다.

{
  "dataframe_split": {
    "columns": ["prompt"],
    "index": [0],
    "data": [
      [
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
      ]
    ]
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

dataframe_records는 records 방향으로 JSON 직렬화된 Pandas Dataframe입니다.

{
  "dataframe_records": [
    {
      "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
    }
  ],
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

인스턴스

{
  "instances": [
    {
      "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
    }
  ],
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

입력

{
  "inputs": {
    "prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instructions:\nWhat is Apache Spark?\n\n### Response:\n"
  },
  "params": {
    "temperature": 0.5,
    "max_tokens": 100,
    "stop": ["word1", "word2"],
    "candidate_count": 1
  }
}

엔드포인트에 쿼리하세요.

엔드포인트가 준비되면 API 요청을 만들어 쿼리할 수 있습니다. 모델 크기와 복잡성에 따라 엔드포인트가 준비하는 데 30분 이상이 걸릴 수 있습니다.


data = {
    "inputs": {
        "prompt": [
            "Hello, I'm a language model,"
        ]
    },
    "params": {
        "max_tokens": 100,
        "temperature": 0.0
    }
}

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

response = requests.post(
    url=f"{API_ROOT}/serving-endpoints/{endpoint_name}/invocations", json=data, headers=headers
)

print(json.dumps(response.json()))

제한 사항

GPU에서 제공되는 모델에 대한 설치 요구 사항이 증가함에 따라 GPU 서비스용 컨테이너 이미지 만들기는 CPU 제공을 위한 이미지 생성보다 오래 걸립니다.
- 모델 크기는 이미지 생성에도 영향을 줍니다. 예를 들어 매개 변수가 300억 개 이상인 모델은 빌드하는 데 1시간 이상 걸릴 수 있습니다.
- Databricks는 다음에 동일한 버전의 모델을 배포할 때 동일한 컨테이너를 다시 사용하므로 후속 배포에는 더 적은 시간이 소요됩니다.
GPU 서비스에 대한 자동 크기 조정은 GPU 컴퓨팅에서 제공되는 모델에 대한 설정 시간이 늘어나기 때문에 CPU 서비스보다 오래 걸립니다. Databricks는 요청 시간 초과를 방지하기 위해 오버 프로비저닝을 권장합니다.

피드백

이 페이지가 도움이 되었나요?

Last updated on 2025-05-09