如何将 Mistral-7B 和 Mixtral 聊天模型与 Azure AI Foundry 配合使用 - Azure AI Foundry

Mistral-7B 和 Mixtral 聊天模型

Mistral-7B 和 Mixtral 聊天模型包括以下模型：

Mistral-7B-Instruct 大型语言模型 (LLM) 是微调的指示 Mistral-7B 版本，它是具有以下体系结构选项的转换器模型：

Grouped-Query Attention
Sliding-Window Attention
字节回退 BPE tokenizer

提供以下模型：

小窍门

此外，MistralAI 还支持将定制的 API 与模型的特定功能配合使用。若使用特定于模型提供商的 API，请查看 MistralAI 文档，或者参阅推理示例部分以编写示例。

先决条件

要将 Mistral-7B 和 Mixtral 聊天模型与 Azure AI Foundry 配合使用，需要满足以下先决条件：

模型部署

部署到自承载托管计算

Mistral-7B 和 Mixtral 聊天模型可以部署到我们的自承载托管推理解决方案，这使你能够自定义和控制有关模型服务方式的所有详细信息。

若要部署到自承载托管计算，你的订阅中必须有足够的配额。如果没有足够的可用配额，则可以使用我们的临时配额，方法是选择选项“我想使用共享配额，并且我确认将在 168 小时内删除此终结点”。

将模型部署到托管计算

已安装推理包

可以通过将 azure-ai-inference 包与 Python 配合使用来使用此模型中的预测。若要安装此包，需要满足以下先决条件：

已安装 Python 3.8 或更高版本，包括 pip。
终结点 URL。若要构造客户端库，需要传入终结点 URL。终结点 URL 采用 https://your-host-name.your-azure-region.inference.ai.azure.com 的形式，其中 your-host-name 是唯一的模型部署主机名，your-azure-region 是部署模型的 Azure 区域（例如 eastus2）。
根据模型部署和身份验证首选项，需要密钥来对服务进行身份验证，或者需要 Microsoft Entra ID 凭据。密钥是一个包含 32 个字符的字符串。

满足这些先决条件后，使用以下命令安装 Azure AI 推理包：

pip install azure-ai-inference

详细了解 Azure AI 推理包和参考。

使用聊天补全

在本部分，我们将 Azure AI Foundry 模型 API 与聊天补全模型一起用于聊天。

小窍门

通过 Foundry 模型 API ，可以使用相同的代码和结构（包括 Mistral-7B 和 Mixtral 聊天模型）与 Azure AI Foundry 门户中部署的大多数模型通信。

创建客户端以使用模型

首先，创建客户端以使用模型。以下代码使用存储在环境变量中的终结点 URL 和密钥。

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
)

将模型部署到支持 Microsoft Entra ID 的自承载联机终结点时，可以使用以下代码片段创建客户端。

import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

获取模型的功能

/info 路由返回有关部署到终结点的模型的信息。通过调用以下方法返回模型的信息：

model_info = client.get_model_info()

响应如下所示：

print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider_name)

Model name: mistralai-Mistral-7B-Instruct-v01
Model type: chat-completions
Model provider name: MistralAI

创建聊天补全请求

以下示例演示如何创建对模型的基本聊天补全请求。

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
)

注释

mistralai-Mistral-7B-Instruct-v01、mistralai-Mistral-7B-Instruct-v02 和 mistralai-Mixtral-8x22B-Instruct-v0-1 不支持系统消息 (role="system")。使用 Foundry 模型 API 时，系统消息将转换为用户消息，这是最接近的功能。提供此翻译是为了方便，但请务必验证模型是否遵循系统消息中的说明并具有正确置信度。

响应如下所示，可从中查看模型的使用统计信息：

print("Response:", response.choices[0].message.content)
print("Model:", response.model)
print("Usage:")
print("\tPrompt tokens:", response.usage.prompt_tokens)
print("\tTotal tokens:", response.usage.total_tokens)
print("\tCompletion tokens:", response.usage.completion_tokens)

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistralai-Mistral-7B-Instruct-v01
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

检查响应中的 usage 部分，查看用于提示的令牌数、生成的令牌总数以及用于补全的令牌数。

流式传输内容

默认情况下，补全 API 会在单个响应中返回整个生成的内容。如果要生成长补全内容，等待响应可能需要几秒钟时间。

可以流式传输内容，以在生成内容时获取它。通过流式处理内容，可以在内容可用时开始处理补全。此模式返回一个对象，该对象将响应作为仅数据服务器发送的事件进行流式传输。从增量字段（而不是消息字段）中提取区块。

result = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    temperature=0,
    top_p=1,
    max_tokens=2048,
    stream=True,
)

若要流式传输补全，请在调用模型时设置 stream=True。

若要可视化输出，请定义用于输出流的帮助程序函数。

def print_stream(result):
    """
    Prints the chat completion with streaming.
    """
    import time
    for update in result:
        if update.choices:
            print(update.choices[0].delta.content, end="")

可以直观显示流式处理如何生成内容：

print_stream(result)

浏览推理客户端支持的更多参数

浏览可以在推理客户端中指定的其他参数。有关所有受支持的参数及其相应文档的完整列表，请参阅 Foundry 模型 API 参考。

from azure.ai.inference.models import ChatCompletionsResponseFormatText

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    presence_penalty=0.1,
    frequency_penalty=0.8,
    max_tokens=2048,
    stop=["<|endoftext|>"],
    temperature=0,
    top_p=1,
    response_format={ "type": ChatCompletionsResponseFormatText() },
)

警告

Mistral 模型不支持 JSON 输出格式（response_format = { "type": "json_object" }）。你始终可以提示模型生成 JSON 输出。但是，这样的输出不能保证是有效的 JSON。

如果要传递未包含在受支持参数列表中的参数，可以使用额外参数将其传递给基础模型。请参阅将额外参数传递给模型。

将额外参数传递给模型

Foundry 模型 API 允许向模型传递额外的参数。以下代码示例演示如何将额外参数 logprobs 传递给模型。

在将额外参数传递给 Foundry 模型 API 之前，请确保模型支持这些额外参数。向基础模型发出请求时，标头 extra-parameters 将传递给具有值 pass-through 的模型。此值告知终结点将额外参数传递给模型。在模型中使用额外参数并不能保证模型能够实际处理它们。请阅读模型的文档以了解哪些额外参数受支持。

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    model_extras={
        "logprobs": True
    }
)

以下额外参数可以传递给 Mistral-7B 和 Mixtral 聊天模型：

名称	DESCRIPTION	类型
`logit_bias`	接受 JSON 对象，该对象将标记（由 tokenizer 中的标记 ID 指定）映射到从 -100 到 100 的相关偏差值。在数学上，采样之前会将偏差添加到由模型生成的 logit 中。具体效果因模型而异，但 -1 和 1 之间的值会减少或增加选择的可能性；-100 或 100 等值会导致相关令牌的禁止或独占选择。	`float`
`logprobs`	是否返回输出令牌的对数概率。如果为 true，则返回在 `content` 的 `message` 中返回的每个输出令牌的对数概率。	`int`
`top_logprobs`	一个介于 0 和 20 之间的整数，指定在每个标记位置最有可能返回的的标记数，每个都有关联的对数概率。如果使用此参数，则必须将 `logprobs` 设置为 `true`。	`float`
`n`	要为每个输入消息生成的聊天完成选项数。系统会根据所有选项中生成的标记数向你收费。	`int`

Mistral-7B 和 Mixtral 聊天模型

Mistral-7B 和 Mixtral 聊天模型包括以下模型：

Mistral-7B-Instruct 大型语言模型 (LLM) 是微调的指示 Mistral-7B 版本，它是具有以下体系结构选项的转换器模型：

Grouped-Query Attention
Sliding-Window Attention
字节回退 BPE tokenizer

提供以下模型：

小窍门

此外，MistralAI 还支持将定制的 API 与模型的特定功能配合使用。若使用特定于模型提供商的 API，请查看 MistralAI 文档，或者参阅推理示例部分以编写示例。

先决条件

要将 Mistral-7B 和 Mixtral 聊天模型与 Azure AI Foundry 配合使用，需要满足以下先决条件：

模型部署

部署到自承载托管计算

Mistral-7B 和 Mixtral 聊天模型可以部署到我们的自承载托管推理解决方案，这使你能够自定义和控制有关模型服务方式的所有详细信息。

若要部署到自承载托管计算，你的订阅中必须有足够的配额。如果没有足够的可用配额，则可以使用我们的临时配额，方法是选择选项“我想使用共享配额，并且我确认将在 168 小时内删除此终结点”。

将模型部署到托管计算

已安装推理包

可以通过使用来自 @azure-rest/ai-inference 的 npm 包来使用此模型中的预测。若要安装此包，需要满足以下先决条件：

带有 Node.js 的 npm 的 LTS 版本。
终结点 URL。若要构造客户端库，需要传入终结点 URL。终结点 URL 采用 https://your-host-name.your-azure-region.inference.ai.azure.com 的形式，其中 your-host-name 是唯一的模型部署主机名，your-azure-region 是部署模型的 Azure 区域（例如 eastus2）。
根据模型部署和身份验证首选项，需要密钥来对服务进行身份验证，或者需要 Microsoft Entra ID 凭据。密钥是一个包含 32 个字符的字符串。

满足这些先决条件后，使用以下命令安装适用于 JavaScript 的 Azure 推理库：

npm install @azure-rest/ai-inference

使用聊天补全

在本部分，我们将 Azure AI Foundry 模型 API 与聊天补全模型一起用于聊天。

小窍门

通过 Foundry 模型 API ，可以使用相同的代码和结构（包括 Mistral-7B 和 Mixtral 聊天模型）与 Azure AI Foundry 门户中部署的大多数模型通信。

创建客户端以使用模型

首先，创建客户端以使用模型。以下代码使用存储在环境变量中的终结点 URL 和密钥。

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);

将模型部署到支持 Microsoft Entra ID 的自承载联机终结点时，可以使用以下代码片段创建客户端。

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { DefaultAzureCredential }  from "@azure/identity";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new DefaultAzureCredential()
);

获取模型的功能

/info 路由返回有关部署到终结点的模型的信息。通过调用以下方法返回模型的信息：

var model_info = await client.path("/info").get()

响应如下所示：

console.log("Model name: ", model_info.body.model_name)
console.log("Model type: ", model_info.body.model_type)
console.log("Model provider name: ", model_info.body.model_provider_name)

Model name: mistralai-Mistral-7B-Instruct-v01
Model type: chat-completions
Model provider name: MistralAI

创建聊天补全请求

以下示例演示如何创建对模型的基本聊天补全请求。

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
    }
});

注释

mistralai-Mistral-7B-Instruct-v01、mistralai-Mistral-7B-Instruct-v02 和 mistralai-Mixtral-8x22B-Instruct-v0-1 不支持系统消息 (role="system")。使用 Foundry 模型 API 时，系统消息将转换为用户消息，这是最接近的功能。提供此翻译是为了方便，但请务必验证模型是否遵循系统消息中的说明并具有正确置信度。

响应如下所示，可从中查看模型的使用统计信息：

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log("Response: ", response.body.choices[0].message.content);
console.log("Model: ", response.body.model);
console.log("Usage:");
console.log("\tPrompt tokens:", response.body.usage.prompt_tokens);
console.log("\tTotal tokens:", response.body.usage.total_tokens);
console.log("\tCompletion tokens:", response.body.usage.completion_tokens);

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistralai-Mistral-7B-Instruct-v01
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

检查响应中的 usage 部分，查看用于提示的令牌数、生成的令牌总数以及用于补全的令牌数。

流式传输内容

默认情况下，补全 API 会在单个响应中返回整个生成的内容。如果要生成长补全内容，等待响应可能需要几秒钟时间。

可以流式传输内容，以在生成内容时获取它。通过流式处理内容，可以在内容可用时开始处理补全。此模式返回一个对象，该对象将响应作为仅数据服务器发送的事件进行流式传输。从增量字段（而不是消息字段）中提取区块。

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
    }
}).asNodeStream();

若要流式传输补全，请在调用模型时使用 .asNodeStream()。

可以直观显示流式处理如何生成内容：

var stream = response.body;
if (!stream) {
    stream.destroy();
    throw new Error(`Failed to get chat completions with status: ${response.status}`);
}

if (response.status !== "200") {
    throw new Error(`Failed to get chat completions: ${response.body.error}`);
}

var sses = createSseStream(stream);

for await (const event of sses) {
    if (event.data === "[DONE]") {
        return;
    }
    for (const choice of (JSON.parse(event.data)).choices) {
        console.log(choice.delta?.content ?? "");
    }
}

浏览推理客户端支持的更多参数

浏览可以在推理客户端中指定的其他参数。有关所有受支持的参数及其相应文档的完整列表，请参阅 Foundry 模型 API 参考。

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
        presence_penalty: "0.1",
        frequency_penalty: "0.8",
        max_tokens: 2048,
        stop: ["<|endoftext|>"],
        temperature: 0,
        top_p: 1,
        response_format: { type: "text" },
    }
});

警告

Mistral 模型不支持 JSON 输出格式（response_format = { "type": "json_object" }）。你始终可以提示模型生成 JSON 输出。但是，这样的输出不能保证是有效的 JSON。

如果要传递未包含在受支持参数列表中的参数，可以使用额外参数将其传递给基础模型。请参阅将额外参数传递给模型。

将额外参数传递给模型

Foundry 模型 API 允许向模型传递额外的参数。以下代码示例演示如何将额外参数 logprobs 传递给模型。

在将额外参数传递给 Foundry 模型 API 之前，请确保模型支持这些额外参数。向基础模型发出请求时，标头 extra-parameters 将传递给具有值 pass-through 的模型。此值告知终结点将额外参数传递给模型。在模型中使用额外参数并不能保证模型能够实际处理它们。请阅读模型的文档以了解哪些额外参数受支持。

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    headers: {
        "extra-params": "pass-through"
    },
    body: {
        messages: messages,
        logprobs: true
    }
});

以下额外参数可以传递给 Mistral-7B 和 Mixtral 聊天模型：

名称	DESCRIPTION	类型
`logit_bias`	接受 JSON 对象，该对象将标记（由 tokenizer 中的标记 ID 指定）映射到从 -100 到 100 的相关偏差值。在数学上，采样之前会将偏差添加到由模型生成的 logit 中。具体效果因模型而异，但 -1 和 1 之间的值会减少或增加选择的可能性；-100 或 100 等值会导致相关令牌的禁止或独占选择。	`float`
`logprobs`	是否返回输出令牌的对数概率。如果为 true，则返回在 `content` 的 `message` 中返回的每个输出令牌的对数概率。	`int`
`top_logprobs`	一个介于 0 和 20 之间的整数，指定在每个标记位置最有可能返回的的标记数，每个都有关联的对数概率。如果使用此参数，则必须将 `logprobs` 设置为 `true`。	`float`
`n`	要为每个输入消息生成的聊天完成选项数。系统会根据所有选项中生成的标记数向你收费。	`int`

Mistral-7B 和 Mixtral 聊天模型

Mistral-7B 和 Mixtral 聊天模型包括以下模型：

Mistral-7B-Instruct 大型语言模型 (LLM) 是微调的指示 Mistral-7B 版本，它是具有以下体系结构选项的转换器模型：

Grouped-Query Attention
Sliding-Window Attention
字节回退 BPE tokenizer

提供以下模型：

小窍门

此外，MistralAI 还支持将定制的 API 与模型的特定功能配合使用。若使用特定于模型提供商的 API，请查看 MistralAI 文档，或者参阅推理示例部分以编写示例。

先决条件

要将 Mistral-7B 和 Mixtral 聊天模型与 Azure AI Foundry 配合使用，需要满足以下先决条件：

模型部署

部署到自承载托管计算

Mistral-7B 和 Mixtral 聊天模型可以部署到我们的自承载托管推理解决方案，这使你能够自定义和控制有关模型服务方式的所有详细信息。

若要部署到自承载托管计算，你的订阅中必须有足够的配额。如果没有足够的可用配额，则可以使用我们的临时配额，方法是选择选项“我想使用共享配额，并且我确认将在 168 小时内删除此终结点”。

将模型部署到托管计算

已安装推理包

可以通过使用来自 Azure.AI.Inference 的包来使用此模型中的预测。若要安装此包，需要满足以下先决条件：

终结点 URL。若要构造客户端库，需要传入终结点 URL。终结点 URL 采用 https://your-host-name.your-azure-region.inference.ai.azure.com 的形式，其中 your-host-name 是唯一的模型部署主机名，your-azure-region 是部署模型的 Azure 区域（例如 eastus2）。
根据模型部署和身份验证首选项，需要密钥来对服务进行身份验证，或者需要 Microsoft Entra ID 凭据。密钥是一个包含 32 个字符的字符串。

满足这些先决条件后，请使用以下命令安装 Azure AI 推理库：

dotnet add package Azure.AI.Inference --prerelease

也可使用 Microsoft Entra ID（以前称为 Azure Active Directory）进行身份验证。若要使用 Azure SDK 提供的凭据提供程序，请安装 Azure.Identity 包：

dotnet add package Azure.Identity

导入下列命名空间：

using Azure;
using Azure.Identity;
using Azure.AI.Inference;

此示例还使用以下命名空间，但你可能并不总是需要它们：

using System.Text.Json;
using System.Text.Json.Serialization;
using System.Reflection;

使用聊天补全

在本部分，我们将 Azure AI Foundry 模型 API 与聊天补全模型一起用于聊天。

小窍门

通过 Foundry 模型 API ，可以使用相同的代码和结构（包括 Mistral-7B 和 Mixtral 聊天模型）与 Azure AI Foundry 门户中部署的大多数模型通信。

创建客户端以使用模型

首先，创建客户端以使用模型。以下代码使用存储在环境变量中的终结点 URL 和密钥。

ChatCompletionsClient client = new ChatCompletionsClient(
    new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
    new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL"))
);

将模型部署到支持 Microsoft Entra ID 的自承载联机终结点时，可以使用以下代码片段创建客户端。

client = new ChatCompletionsClient(
    new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
    new DefaultAzureCredential(includeInteractiveCredentials: true)
);

获取模型的功能

/info 路由返回有关部署到终结点的模型的信息。通过调用以下方法返回模型的信息：

Response<ModelInfo> modelInfo = client.GetModelInfo();

响应如下所示：

Console.WriteLine($"Model name: {modelInfo.Value.ModelName}");
Console.WriteLine($"Model type: {modelInfo.Value.ModelType}");
Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}");

Model name: mistralai-Mistral-7B-Instruct-v01
Model type: chat-completions
Model provider name: MistralAI

创建聊天补全请求

以下示例演示如何创建对模型的基本聊天补全请求。

ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
};

Response<ChatCompletions> response = client.Complete(requestOptions);

注释

mistralai-Mistral-7B-Instruct-v01、mistralai-Mistral-7B-Instruct-v02 和 mistralai-Mixtral-8x22B-Instruct-v0-1 不支持系统消息 (role="system")。使用 Foundry 模型 API 时，系统消息将转换为用户消息，这是最接近的功能。提供此翻译是为了方便，但请务必验证模型是否遵循系统消息中的说明并具有正确置信度。

响应如下所示，可从中查看模型的使用统计信息：

Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");
Console.WriteLine($"Model: {response.Value.Model}");
Console.WriteLine("Usage:");
Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}");
Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}");
Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}");

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: mistralai-Mistral-7B-Instruct-v01
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

检查响应中的 usage 部分，查看用于提示的令牌数、生成的令牌总数以及用于补全的令牌数。

流式传输内容

默认情况下，补全 API 会在单个响应中返回整个生成的内容。如果要生成长补全内容，等待响应可能需要几秒钟时间。

可以流式传输内容，以在生成内容时获取它。通过流式处理内容，可以在内容可用时开始处理补全。此模式返回一个对象，该对象将响应作为仅数据服务器发送的事件进行流式传输。从增量字段（而不是消息字段）中提取区块。

static async Task StreamMessageAsync(ChatCompletionsClient client)
{
    ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
    {
        Messages = {
            new ChatRequestSystemMessage("You are a helpful assistant."),
            new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.")
        },
        MaxTokens=4096
    };

    StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions);

    await PrintStream(streamResponse);
}

若要流式传输补全，请在调用模型时使用 CompleteStreamingAsync 方法。请注意，在这个例子中，我们的调用被封装在一个异步方法中。

为了可视化输出，请定义一个异步方法，用于在控制台中输出流。

static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response)
{
    await foreach (StreamingChatCompletionsUpdate chatUpdate in response)
    {
        if (chatUpdate.Role.HasValue)
        {
            Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: ");
        }
        if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate))
        {
            Console.Write(chatUpdate.ContentUpdate);
        }
    }
}

可以直观显示流式处理如何生成内容：

StreamMessageAsync(client).GetAwaiter().GetResult();

浏览推理客户端支持的更多参数

浏览可以在推理客户端中指定的其他参数。有关所有受支持的参数及其相应文档的完整列表，请参阅 Foundry 模型 API 参考。

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    PresencePenalty = 0.1f,
    FrequencyPenalty = 0.8f,
    MaxTokens = 2048,
    StopSequences = { "<|endoftext|>" },
    Temperature = 0,
    NucleusSamplingFactor = 1,
    ResponseFormat = new ChatCompletionsResponseFormatText()
};

response = client.Complete(requestOptions);
Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");

警告

Mistral 模型不支持 JSON 输出格式（response_format = { "type": "json_object" }）。你始终可以提示模型生成 JSON 输出。但是，这样的输出不能保证是有效的 JSON。

如果要传递未包含在受支持参数列表中的参数，可以使用额外参数将其传递给基础模型。请参阅将额外参数传递给模型。

将额外参数传递给模型

Foundry 模型 API 允许向模型传递额外的参数。以下代码示例演示如何将额外参数 logprobs 传递给模型。

在将额外参数传递给 Foundry 模型 API 之前，请确保模型支持这些额外参数。向基础模型发出请求时，标头 extra-parameters 将传递给具有值 pass-through 的模型。此值告知终结点将额外参数传递给模型。在模型中使用额外参数并不能保证模型能够实际处理它们。请阅读模型的文档以了解哪些额外参数受支持。

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } },
};

response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough);
Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");

以下额外参数可以传递给 Mistral-7B 和 Mixtral 聊天模型：

名称	DESCRIPTION	类型
`logit_bias`	接受 JSON 对象，该对象将标记（由 tokenizer 中的标记 ID 指定）映射到从 -100 到 100 的相关偏差值。在数学上，采样之前会将偏差添加到由模型生成的 logit 中。具体效果因模型而异，但 -1 和 1 之间的值会减少或增加选择的可能性；-100 或 100 等值会导致相关令牌的禁止或独占选择。	`float`
`logprobs`	是否返回输出令牌的对数概率。如果为 true，则返回在 `content` 的 `message` 中返回的每个输出令牌的对数概率。	`int`
`top_logprobs`	一个介于 0 和 20 之间的整数，指定在每个标记位置最有可能返回的的标记数，每个都有关联的对数概率。如果使用此参数，则必须将 `logprobs` 设置为 `true`。	`float`
`n`	要为每个输入消息生成的聊天完成选项数。系统会根据所有选项中生成的标记数向你收费。	`int`

Mistral-7B 和 Mixtral 聊天模型

Mistral-7B 和 Mixtral 聊天模型包括以下模型：

Mistral-7B-Instruct 大型语言模型 (LLM) 是微调的指示 Mistral-7B 版本，它是具有以下体系结构选项的转换器模型：

Grouped-Query Attention
Sliding-Window Attention
字节回退 BPE tokenizer

提供以下模型：

小窍门

此外，MistralAI 还支持将定制的 API 与模型的特定功能配合使用。若使用特定于模型提供商的 API，请查看 MistralAI 文档，或者参阅推理示例部分以编写示例。

先决条件

要将 Mistral-7B 和 Mixtral 聊天模型与 Azure AI Foundry 配合使用，需要满足以下先决条件：

模型部署

部署到自承载托管计算

Mistral-7B 和 Mixtral 聊天模型可以部署到我们的自承载托管推理解决方案，这使你能够自定义和控制有关模型服务方式的所有详细信息。

若要部署到自承载托管计算，你的订阅中必须有足够的配额。如果没有足够的可用配额，则可以使用我们的临时配额，方法是选择选项“我想使用共享配额，并且我确认将在 168 小时内删除此终结点”。

将模型部署到托管计算

一个 REST 客户端

可以使用任何 REST 客户端来消费通过 Foundry 模型 API 部署的模型。若要使用 REST 客户端，需要满足以下先决条件：

若要构造请求，需要传入终结点 URL。终结点 URL 采用 https://your-host-name.your-azure-region.inference.ai.azure.com 的形式，其中 your-host-name`` is your unique model deployment host name and your-azure-region`` 是部署模型的 Azure 区域（例如 eastus2）。
根据模型部署和身份验证首选项，需要密钥来对服务进行身份验证，或者需要 Microsoft Entra ID 凭据。密钥是一个包含 32 个字符的字符串。

使用聊天补全

在本部分，我们将 Azure AI Foundry 模型 API 与聊天补全模型一起用于聊天。

小窍门

通过 Foundry 模型 API ，可以使用相同的代码和结构（包括 Mistral-7B 和 Mixtral 聊天模型）与 Azure AI Foundry 门户中部署的大多数模型通信。

创建客户端以使用模型

首先，创建客户端以使用模型。以下代码使用存储在环境变量中的终结点 URL 和密钥。

将模型部署到支持 Microsoft Entra ID 的自承载联机终结点时，可以使用以下代码片段创建客户端。

获取模型的功能

/info 路由返回有关部署到终结点的模型的信息。通过调用以下方法返回模型的信息：

GET /info HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json

响应如下所示：

{
    "model_name": "mistralai-Mistral-7B-Instruct-v01",
    "model_type": "chat-completions",
    "model_provider_name": "MistralAI"
}

创建聊天补全请求

以下示例演示如何创建对模型的基本聊天补全请求。

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ]
}

注释

mistralai-Mistral-7B-Instruct-v01、mistralai-Mistral-7B-Instruct-v02 和 mistralai-Mixtral-8x22B-Instruct-v0-1 不支持系统消息 (role="system")。使用 Foundry 模型 API 时，系统消息将转换为用户消息，这是最接近的功能。提供此翻译是为了方便，但请务必验证模型是否遵循系统消息中的说明并具有正确置信度。

响应如下所示，可从中查看模型的使用统计信息：

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "mistralai-Mistral-7B-Instruct-v01",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

检查响应中的 usage 部分，查看用于提示的令牌数、生成的令牌总数以及用于补全的令牌数。

流式传输内容

默认情况下，补全 API 会在单个响应中返回整个生成的内容。如果要生成长补全内容，等待响应可能需要几秒钟时间。

可以流式传输内容，以在生成内容时获取它。通过流式处理内容，可以在内容可用时开始处理补全。此模式返回一个对象，该对象将响应作为仅数据服务器发送的事件进行流式传输。从增量字段（而不是消息字段）中提取区块。

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 2048
}

可以直观显示流式处理如何生成内容：

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "mistralai-Mistral-7B-Instruct-v01",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "content": ""
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}

流中的最后一条消息已设置 finish_reason，指示生成进程停止的原因。

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "mistralai-Mistral-7B-Instruct-v01",
    "choices": [
        {
            "index": 0,
            "delta": {
                "content": ""
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

浏览推理客户端支持的更多参数

浏览可以在推理客户端中指定的其他参数。有关所有受支持的参数及其相应文档的完整列表，请参阅 Foundry 模型 API 参考。

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "presence_penalty": 0.1,
    "frequency_penalty": 0.8,
    "max_tokens": 2048,
    "stop": ["<|endoftext|>"],
    "temperature" :0,
    "top_p": 1,
    "response_format": { "type": "text" }
}

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "mistralai-Mistral-7B-Instruct-v01",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

警告

Mistral 模型不支持 JSON 输出格式（response_format = { "type": "json_object" }）。你始终可以提示模型生成 JSON 输出。但是，这样的输出不能保证是有效的 JSON。

如果要传递未包含在受支持参数列表中的参数，可以使用额外参数将其传递给基础模型。请参阅将额外参数传递给模型。

将额外参数传递给模型

Foundry 模型 API 允许向模型传递额外的参数。以下代码示例演示如何将额外参数 logprobs 传递给模型。

在将额外参数传递给 Foundry 模型 API 之前，请确保模型支持这些额外参数。向基础模型发出请求时，标头 extra-parameters 将传递给具有值 pass-through 的模型。此值告知终结点将额外参数传递给模型。在模型中使用额外参数并不能保证模型能够实际处理它们。请阅读模型的文档以了解哪些额外参数受支持。

POST /chat/completions HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json
extra-parameters: pass-through

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "logprobs": true
}

以下额外参数可以传递给 Mistral-7B 和 Mixtral 聊天模型：

名称	DESCRIPTION	类型
`logit_bias`	接受 JSON 对象，该对象将标记（由 tokenizer 中的标记 ID 指定）映射到从 -100 到 100 的相关偏差值。在数学上，采样之前会将偏差添加到由模型生成的 logit 中。具体效果因模型而异，但 -1 和 1 之间的值会减少或增加选择的可能性；-100 或 100 等值会导致相关令牌的禁止或独占选择。	`float`
`logprobs`	是否返回输出令牌的对数概率。如果为 true，则返回在 `content` 的 `message` 中返回的每个输出令牌的对数概率。	`int`
`top_logprobs`	一个介于 0 和 20 之间的整数，指定在每个标记位置最有可能返回的的标记数，每个都有关联的对数概率。如果使用此参数，则必须将 `logprobs` 设置为 `true`。	`float`
`n`	要为每个输入消息生成的聊天完成选项数。系统会根据所有选项中生成的标记数向你收费。	`int`

DESCRIPTION	语言	示例
CURL 请求	Bash（Unix命令解释器）	链接。
适用于 C# 的 Azure AI 推理包	C#（编程语言）	链接。
适用于 JavaScript 的 Azure AI 推理包	Javascript	链接。
适用于 Python 的 Azure AI 推理包	Python语言	链接。
Python Web 请求	Python语言	链接。
OpenAI SDK（实验性）	Python语言	链接。
LangChain	Python语言	链接。
Mistral人工智能	Python语言	链接。
LiteLLM	Python语言	链接。

通过

如何使用 Mistral-7B 和 Mixtral 聊天模型

Mistral-7B 和 Mixtral 聊天模型

先决条件

模型部署

已安装推理包

使用聊天补全

创建客户端以使用模型

获取模型的功能

创建聊天补全请求

流式传输内容

浏览推理客户端支持的更多参数

将额外参数传递给模型

Mistral-7B 和 Mixtral 聊天模型

先决条件

模型部署

已安装推理包

使用聊天补全

创建客户端以使用模型

获取模型的功能

创建聊天补全请求

流式传输内容

浏览推理客户端支持的更多参数

将额外参数传递给模型

Mistral-7B 和 Mixtral 聊天模型

先决条件

模型部署

已安装推理包

使用聊天补全

创建客户端以使用模型

获取模型的功能

创建聊天补全请求

流式传输内容

浏览推理客户端支持的更多参数

将额外参数传递给模型

Mistral-7B 和 Mixtral 聊天模型

先决条件

模型部署

一个 REST 客户端

使用聊天补全

创建客户端以使用模型

获取模型的功能

创建聊天补全请求

流式传输内容

浏览推理客户端支持的更多参数

将额外参数传递给模型

更多推理示例

部署到托管计算的 Mistral 模型的成本和配额注意事项

相关内容

反馈

其他资源