预配的吞吐量基础模型 API

项目
07/18/2024

本文演示如何使用预配了吞吐量的基础模型 API 部署模型。 Databricks 建议使用为生产工作负荷预配的吞吐量，它为具有性能保证的基础模型提供优化的推理。

有关支持的模型体系结构列表，请参阅预配吞吐量基础模型 API。

要求

请参阅要求。有关如何部署经过微调的基础模型，请参阅部署经过微调的基础模型。

[建议]从 Unity Catalog 部署基础模型

重要

此功能目前以公共预览版提供。

Databricks 建议使用预安装在 Unity Catalog 中的基础模型。可以在架构 ai (system.ai) 中的目录 system 下找到这些模型。

部署基础模型：

在目录资源管理器中导航到 system.ai。
单击要部署的模型的名称。
在模型页上，单击“服务此模型”按钮。
此时会显示“创建服务终结点”页面。请参阅使用 UI 创建预配吞吐量终结点。

从 Databricks 市场部署基本基础模型

或者，可以从 Databricks 市场将基础模型安装到 Unity Catalog。

你可以搜索模型系列，然后从模型页面选择“获取访问权限”并提供登录凭据，以将模型安装到 Unity Catalog。

将模型安装到 Unity Catalog 后，可以使用服务 UI 创建模型服务终结点。

部署 DBRX 模型

Databricks 建议为工作负荷提供 DBRX Instruct 模型。若要使用预配的吞吐量提供 DBRX Instruct 模型，请按照[建议]从 Unity Catalog 部署基础模型中的指南进行操作。

提供这些 DBRX 模型时，预配置吞吐量支持高达 16k 的上下文长度。

DBRX 模型使用以下默认系统提示来确保模型响应的相关性和准确度：

You are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.
YOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.
You assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).
(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)
This is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.
YOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.

部署经过微调的基础模型

如果无法在 system.ai 架构中使用模型，或者无法从 Databricks 市场安装模型，则可以通过将模型记录到 Unity Catalog 来部署经过微调的基础模型。本部分和后面的部分演示如何设置代码以将 MLflow 模型记录到 Unity Catalog，并使用 UI 或 REST API 创建预配吞吐量终结点。

要求

仅 MLflow 2.11 或更高版本支持部署经过微调的基础模型。 Databricks Runtime 15.0 ML 及更高版本预安装了兼容的 MLflow 版本。
对于嵌入终结点，模型必须是小型或大型 BGE 嵌入模型体系结构。
Databricks 建议在 Unity 目录中使用模型，以便更快地上传和下载大型模型。

定义目录、架构和模型名称

若要部署经过微调的基础模型，请定义目标 Unity Catalog 目录、架构和你选择的模型名称。

mlflow.set_registry_uri('databricks-uc')
CATALOG = "catalog"
SCHEMA = "schema"
MODEL_NAME = "model_name"
registered_model_name = f"{CATALOG}.{SCHEMA}.{MODEL_NAME}"

记录模型

若要为模型终结点启用预配吞吐量，必须使用 MLflow transformers 风格记录模型，并从以下选项中通过适当的模型类型接口指定 task 参数：

"llm/v1/completions"
"llm/v1/chat"
"llm/v1/embeddings"

这些参数指定用于模型服务终结点的 API 签名。有关这些任务和相应输入/输出架构的详细信息，请参阅 MLflow 文档。

下面是关于如何使用 MLflow 记录文本补全语言模型的示例：

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")

with mlflow.start_run():
    components = {
      "model": model,
      "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        transformers_model=components,
        artifact_path="model",
        # Specify the llm/v1/xxx task that is compatible with the model being logged
        task="llm/v1/completions",
        # Specify an input example that conforms to the input schema for the task.
        input_example={"prompt": np.array(["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"])},
        # By passing the model name, MLflow automatically registers the Transformers model to Unity Catalog with the given catalog/schema/model_name.
        registered_model_name=registered_model_name
        # Optionally, you can set save_pretrained to False to avoid unnecessary copy of model weight and gain more efficiency
        save_pretrained=False
    )

注意

如果使用的是早于 2.12 的 MLflow，则必须改为在同一 mlflow.transformer.log_model() 函数的 metadata 参数中指定任务。

metadata = {"task": "llm/v1/completions"}
metadata = {"task": "llm/v1/chat"}
metadata = {"task": "llm/v1/embeddings"}

预配吞吐量也支持小型和大型 BGE 嵌入模型。下面的示例演示了如何记录模型 BAAI/bge-small-en-v1.5，以便可以使用预配吞吐量为其提供服务：

model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5")
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
with mlflow.start_run():
    components = {
      "model": model,
      "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        transformers_model=components,
        artifact_path="model",
        task="llm/v1/embeddings",
        registered_model_name=registered_model_name,
        # model_type is required for logging a fine-tuned BGE models.
        metadata={
            "model_type": "bge-large"  # Or "bge-small"
        }
    )

在模型记录到 Unity Catalog 中之后，继续参阅使用 UI 创建预配吞吐量终结点，以便创建具有预配吞吐量的模型服务终结点。

使用 UI 创建预配吞吐量终结点

记录的模型位于 Unity Catalog 中后，通过以下步骤创建预配吞吐量服务终结点：

导航到工作区中的服务 UI。
选择“创建服务终结点”。
在“实体”字段中，从 Unity Catalog 中选择模型。对于符合条件的模型，服务实体的 UI 会显示“预配吞吐量”屏幕。
在“上限”下拉列表中，可以为终结点配置最大每秒令牌吞吐量。
1. 预配吞吐量终结点会自动缩放，因此可以选择“修改”以查看终结点可以纵向缩减到的最小每秒令牌数。

预配的吞吐量

使用 REST API 创建预配吞吐量终结点

若要使用 REST API 在预配吞吐量模式下部署模型，必须在请求中指定 min_provisioned_throughput 和 max_provisioned_throughput 字段。

若要为模型确定适合的预配吞吐量范围，请参阅以递增方式获取调配吞吐量。

import requests
import json

# Set the name of the MLflow endpoint
endpoint_name = "llama2-13b-chat"

# Name of the registered MLflow model
model_name = "ml.llm-catalog.llama-13b"

# Get the latest version of the MLflow model
model_version = 3

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

optimizable_info = requests.get(
  url=f"{API_ROOT}/api/2.0/serving-endpoints/get-model-optimization-info/{model_name}/{model_version}",
  headers=headers)
  .json()

if 'optimizable' not in optimizable_info or not optimizable_info['optimizable']:
  raise ValueError("Model is not eligible for provisioned throughput")

chunk_size = optimizable_info['throughput_chunk_size']

# Minimum desired provisioned throughput
min_provisioned_throughput = 2 * chunk_size

# Maximum desired provisioned throughput
max_provisioned_throughput = 3 * chunk_size

# Send the POST request to create the serving endpoint
data = {
  "name": endpoint_name,
  "config": {
    "served_entities": [
      {
        "entity_name": model_name,
        "entity_version": model_version,
        "min_provisioned_throughput": min_provisioned_throughput,
        "max_provisioned_throughput": max_provisioned_throughput,
      }
    ]
  },
}

response = requests.post(
  url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

以递增方式获取调配吞吐量

预配吞吐量以每秒令牌数为增量提供，具体增量因模型而异。为了确定适合需求的范围，Databricks 建议使用平台内的模型优化信息 API。

GET api/2.0/serving-endpoints/get-model-optimization-info/{registered_model_name}/{version}

下面是 API 的示例响应：

{
  "optimizable": true,
  "model_type": "llama",
  "throughput_chunk_size": 1580
}

笔记本示例

以下笔记本演示了创建预配吞吐量基础模型 API 的示例：

限制

模型部署可能会因 GPU 容量问题而失败，从而导致终结点创建或更新期间超时。请联系 Databricks 客户团队来帮助解决。
基础模型 API 的自动缩放速度比 CPU 模型服务慢。 Databricks 建议超量预配，避免请求超时。

通过