How to use the Meta Llama family of models with Azure AI Studio - Azure AI Studio

Meta Llama family of models

The Meta Llama family of models include the following models:

The Llama 3.2 collection of SLMs and image reasoning models are now available. Coming soon, Llama 3.2 11B Vision Instruct and Llama 3.2 90B Vision Instruct will be available as a serverless API endpoint via Models-as-a-Service. Starting today, the following models will be available for deployment via managed compute:

Llama 3.2 1B
Llama 3.2 3B
Llama 3.2 1B Instruct
Llama 3.2 3B Instruct
Llama Guard 3 1B
Llama Guard 11B Vision
Llama 3.2 11B Vision Instruct
Llama 3.2 90B Vision Instruct are available for managed compute deployment.

Prerequisites

To use Meta Llama models with Azure AI Studio, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Meta Llama models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

Deployment to a self-hosted managed compute

Meta Llama models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served.

For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.

Deploy the model to managed compute

The inference package installed

You can consume predictions from this model by using the azure-ai-inference package with Python. To install this package, you need the following prerequisites:

Python 3.8 or later installed, including pip.
The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure AI inference package with the following command:

pip install azure-ai-inference

Read more about the Azure AI inference package and reference.

Work with chat completions

In this section, you use the Azure AI model inference API with a chat completions model for chat.

Tip

The Azure AI model inference API allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama Instruct models - text-only or image reasoning models.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["AZURE_INFERENCE_CREDENTIAL"]),
)

When you deploy the model to a self-hosted online endpoint with Microsoft Entra ID support, you can use the following code snippet to create a client.

import os
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZURE_INFERENCE_ENDPOINT"],
    credential=DefaultAzureCredential(),
)

Note

Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

model_info = client.get_model_info()

The response is as follows:

print("Model name:", model_info.model_name)
print("Model type:", model_info.model_type)
print("Model provider name:", model_info.model_provider_name)

Model name: Meta-Llama-3.1-405B-Instruct
Model type: chat-completions
Model provider name: Meta

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
)

The response is as follows, where you can see the model's usage statistics:

print("Response:", response.choices[0].message.content)
print("Model:", response.model)
print("Usage:")
print("\tPrompt tokens:", response.usage.prompt_tokens)
print("\tTotal tokens:", response.usage.total_tokens)
print("\tCompletion tokens:", response.usage.completion_tokens)

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: Meta-Llama-3.1-405B-Instruct
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

result = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    temperature=0,
    top_p=1,
    max_tokens=2048,
    stream=True,
)

To stream completions, set stream=True when you call the model.

To visualize the output, define a helper function to print the stream.

def print_stream(result):
    """
    Prints the chat completion with streaming.
    """
    import time
    for update in result:
        if update.choices:
            print(update.choices[0].delta.content, end="")

You can visualize how streaming generates content:

print_stream(result)

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

from azure.ai.inference.models import ChatCompletionsResponseFormatText

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    presence_penalty=0.1,
    frequency_penalty=0.8,
    max_tokens=2048,
    stop=["<|endoftext|>"],
    temperature=0,
    top_p=1,
    response_format={ "type": ChatCompletionsResponseFormatText() },
)

Warning

Meta Llama models don't support JSON output formatting (response_format = { "type": "json_object" }). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    model_extras={
        "logprobs": True
    }
)

The following extra parameters can be passed to Meta Llama models:

Name	Description	Type
`n`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`best_of`	Generates best_of completions server-side and returns the best (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to return—best_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`logprobs`	A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response.	`integer`
`ignore_eos`	Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated.	`boolean`
`use_beam_search`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0.	`boolean`
`stop_token_ids`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.	`array`
`skip_special_tokens`	Whether to skip special tokens in the output.	`boolean`

Apply content safety

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

from azure.ai.inference.models import AssistantMessage, UserMessage, SystemMessage

try:
    response = client.complete(
        messages=[
            SystemMessage(content="You are an AI assistant that helps people find information."),
            UserMessage(content="Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."),
        ]
    )

    print(response.choices[0].message.content)

except HttpResponseError as ex:
    if ex.status_code == 400:
        response = ex.response.json()
        if isinstance(response, dict) and "error" in response:
            print(f"Your request triggered an {response['error']['code']} error:\n\t {response['error']['message']}")
        else:
            raise
    raise

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Note

Azure AI content safety is only available for models deployed as serverless API endpoints.

Meta Llama models

The Meta Llama models include the following models:

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks.

The following models are available:

Prerequisites

To use Meta Llama models with Azure AI Studio, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Meta Llama models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

Deployment to a self-hosted managed compute

Meta Llama models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served.

For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.

Deploy the model to managed compute

The inference package installed

You can consume predictions from this model by using the @azure-rest/ai-inference package from npm. To install this package, you need the following prerequisites:

LTS versions of Node.js with npm.
The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure Inference library for JavaScript with the following command:

npm install @azure-rest/ai-inference

Work with chat completions

In this section, you use the Azure AI model inference API with a chat completions model for chat.

Tip

The Azure AI model inference API allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama models.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { AzureKeyCredential } from "@azure/core-auth";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new AzureKeyCredential(process.env.AZURE_INFERENCE_CREDENTIAL)
);

When you deploy the model to a self-hosted online endpoint with Microsoft Entra ID support, you can use the following code snippet to create a client.

import ModelClient from "@azure-rest/ai-inference";
import { isUnexpected } from "@azure-rest/ai-inference";
import { DefaultAzureCredential }  from "@azure/identity";

const client = new ModelClient(
    process.env.AZURE_INFERENCE_ENDPOINT, 
    new DefaultAzureCredential()
);

Note

Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

var model_info = await client.path("/info").get()

The response is as follows:

console.log("Model name: ", model_info.body.model_name)
console.log("Model type: ", model_info.body.model_type)
console.log("Model provider name: ", model_info.body.model_provider_name)

Model name: Meta-Llama-3.1-405B-Instruct
Model type: chat-completions
Model provider name: Meta

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
    }
});

The response is as follows, where you can see the model's usage statistics:

if (isUnexpected(response)) {
    throw response.body.error;
}

console.log("Response: ", response.body.choices[0].message.content);
console.log("Model: ", response.body.model);
console.log("Usage:");
console.log("\tPrompt tokens:", response.body.usage.prompt_tokens);
console.log("\tTotal tokens:", response.body.usage.total_tokens);
console.log("\tCompletion tokens:", response.body.usage.completion_tokens);

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: Meta-Llama-3.1-405B-Instruct
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
    }
}).asNodeStream();

To stream completions, use .asNodeStream() when you call the model.

You can visualize how streaming generates content:

var stream = response.body;
if (!stream) {
    stream.destroy();
    throw new Error(`Failed to get chat completions with status: ${response.status}`);
}

if (response.status !== "200") {
    throw new Error(`Failed to get chat completions: ${response.body.error}`);
}

var sses = createSseStream(stream);

for await (const event of sses) {
    if (event.data === "[DONE]") {
        return;
    }
    for (const choice of (JSON.parse(event.data)).choices) {
        console.log(choice.delta?.content ?? "");
    }
}

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    body: {
        messages: messages,
        presence_penalty: "0.1",
        frequency_penalty: "0.8",
        max_tokens: 2048,
        stop: ["<|endoftext|>"],
        temperature: 0,
        top_p: 1,
        response_format: { type: "text" },
    }
});

Warning

Meta Llama models don't support JSON output formatting (response_format = { "type": "json_object" }). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

var messages = [
    { role: "system", content: "You are a helpful assistant" },
    { role: "user", content: "How many languages are in the world?" },
];

var response = await client.path("/chat/completions").post({
    headers: {
        "extra-params": "pass-through"
    },
    body: {
        messages: messages,
        logprobs: true
    }
});

The following extra parameters can be passed to Meta Llama models:

Name	Description	Type
`n`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`best_of`	Generates best_of completions server-side and returns the best (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to return—best_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`logprobs`	A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response.	`integer`
`ignore_eos`	Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated.	`boolean`
`use_beam_search`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0.	`boolean`
`stop_token_ids`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.	`array`
`skip_special_tokens`	Whether to skip special tokens in the output.	`boolean`

Apply content safety

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

try {
    var messages = [
        { role: "system", content: "You are an AI assistant that helps people find information." },
        { role: "user", content: "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills." },
    ];

    var response = await client.path("/chat/completions").post({
        body: {
            messages: messages,
        }
    });

    console.log(response.body.choices[0].message.content);
}
catch (error) {
    if (error.status_code == 400) {
        var response = JSON.parse(error.response._content);
        if (response.error) {
            console.log(`Your request triggered an ${response.error.code} error:\n\t ${response.error.message}`);
        }
        else
        {
            throw error;
        }
    }
}

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Note

Azure AI content safety is only available for models deployed as serverless API endpoints.

Meta Llama models

The Meta Llama models include the following models:

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed models on common industry benchmarks.

The following models are available:

Prerequisites

To use Meta Llama models with Azure AI Studio, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Meta Llama models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

Deployment to a self-hosted managed compute

Meta Llama models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served.

For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.

Deploy the model to managed compute

The inference package installed

You can consume predictions from this model by using the Azure.AI.Inference package from NuGet. To install this package, you need the following prerequisites:

The endpoint URL. To construct the client library, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Once you have these prerequisites, install the Azure AI inference library with the following command:

dotnet add package Azure.AI.Inference --prerelease

You can also authenticate with Microsoft Entra ID (formerly Azure Active Directory). To use credential providers provided with the Azure SDK, install the Azure.Identity package:

dotnet add package Azure.Identity

Import the following namespaces:

using Azure;
using Azure.Identity;
using Azure.AI.Inference;

This example also uses the following namespaces but you may not always need them:

using System.Text.Json;
using System.Text.Json.Serialization;
using System.Reflection;

Work with chat completions

In this section, you use the Azure AI model inference API with a chat completions model for chat.

Tip

The Azure AI model inference API allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

ChatCompletionsClient client = new ChatCompletionsClient(
    new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
    new AzureKeyCredential(Environment.GetEnvironmentVariable("AZURE_INFERENCE_CREDENTIAL"))
);

When you deploy the model to a self-hosted online endpoint with Microsoft Entra ID support, you can use the following code snippet to create a client.

client = new ChatCompletionsClient(
    new Uri(Environment.GetEnvironmentVariable("AZURE_INFERENCE_ENDPOINT")),
    new DefaultAzureCredential(includeInteractiveCredentials: true)
);

Note

Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

Response<ModelInfo> modelInfo = client.GetModelInfo();

The response is as follows:

Console.WriteLine($"Model name: {modelInfo.Value.ModelName}");
Console.WriteLine($"Model type: {modelInfo.Value.ModelType}");
Console.WriteLine($"Model provider name: {modelInfo.Value.ModelProviderName}");

Model name: Meta-Llama-3.1-405B-Instruct
Model type: chat-completions
Model provider name: Meta

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
};

Response<ChatCompletions> response = client.Complete(requestOptions);

The response is as follows, where you can see the model's usage statistics:

Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");
Console.WriteLine($"Model: {response.Value.Model}");
Console.WriteLine("Usage:");
Console.WriteLine($"\tPrompt tokens: {response.Value.Usage.PromptTokens}");
Console.WriteLine($"\tTotal tokens: {response.Value.Usage.TotalTokens}");
Console.WriteLine($"\tCompletion tokens: {response.Value.Usage.CompletionTokens}");

Response: As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.
Model: Meta-Llama-3.1-405B-Instruct
Usage: 
  Prompt tokens: 19
  Total tokens: 91
  Completion tokens: 72

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

static async Task StreamMessageAsync(ChatCompletionsClient client)
{
    ChatCompletionsOptions requestOptions = new ChatCompletionsOptions()
    {
        Messages = {
            new ChatRequestSystemMessage("You are a helpful assistant."),
            new ChatRequestUserMessage("How many languages are in the world? Write an essay about it.")
        },
        MaxTokens=4096
    };

    StreamingResponse<StreamingChatCompletionsUpdate> streamResponse = await client.CompleteStreamingAsync(requestOptions);

    await PrintStream(streamResponse);
}

To stream completions, use CompleteStreamingAsync method when you call the model. Notice that in this example we the call is wrapped in an asynchronous method.

To visualize the output, define an asynchronous method to print the stream in the console.

static async Task PrintStream(StreamingResponse<StreamingChatCompletionsUpdate> response)
{
    await foreach (StreamingChatCompletionsUpdate chatUpdate in response)
    {
        if (chatUpdate.Role.HasValue)
        {
            Console.Write($"{chatUpdate.Role.Value.ToString().ToUpperInvariant()}: ");
        }
        if (!string.IsNullOrEmpty(chatUpdate.ContentUpdate))
        {
            Console.Write(chatUpdate.ContentUpdate);
        }
    }
}

You can visualize how streaming generates content:

StreamMessageAsync(client).GetAwaiter().GetResult();

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    PresencePenalty = 0.1f,
    FrequencyPenalty = 0.8f,
    MaxTokens = 2048,
    StopSequences = { "<|endoftext|>" },
    Temperature = 0,
    NucleusSamplingFactor = 1,
    ResponseFormat = new ChatCompletionsResponseFormatText()
};

response = client.Complete(requestOptions);
Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");

Warning

Meta Llama models don't support JSON output formatting (response_format = { "type": "json_object" }). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

requestOptions = new ChatCompletionsOptions()
{
    Messages = {
        new ChatRequestSystemMessage("You are a helpful assistant."),
        new ChatRequestUserMessage("How many languages are in the world?")
    },
    AdditionalProperties = { { "logprobs", BinaryData.FromString("true") } },
};

response = client.Complete(requestOptions, extraParams: ExtraParameters.PassThrough);
Console.WriteLine($"Response: {response.Value.Choices[0].Message.Content}");

The following extra parameters can be passed to Meta Llama models:

Name	Description	Type
`n`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`best_of`	Generates best_of completions server-side and returns the best (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to return—best_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`logprobs`	A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response.	`integer`
`ignore_eos`	Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated.	`boolean`
`use_beam_search`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0.	`boolean`
`stop_token_ids`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.	`array`
`skip_special_tokens`	Whether to skip special tokens in the output.	`boolean`

Apply content safety

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

try
{
    requestOptions = new ChatCompletionsOptions()
    {
        Messages = {
            new ChatRequestSystemMessage("You are an AI assistant that helps people find information."),
            new ChatRequestUserMessage(
                "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
            ),
        },
    };

    response = client.Complete(requestOptions);
    Console.WriteLine(response.Value.Choices[0].Message.Content);
}
catch (RequestFailedException ex)
{
    if (ex.ErrorCode == "content_filter")
    {
        Console.WriteLine($"Your query has trigger Azure Content Safety: {ex.Message}");
    }
    else
    {
        throw;
    }
}

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Note

Azure AI content safety is only available for models deployed as serverless API endpoints.

Meta Llama chat models

The Meta Llama chat models include the following models:

The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open-source and closed chat models on common industry benchmarks.

The following models are available:

Prerequisites

To use Meta Llama models with Azure AI Studio, you need the following prerequisites:

A model deployment

Deployment to serverless APIs

Meta Llama chat models can be deployed to serverless API endpoints with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need.

Deployment to a serverless API endpoint doesn't require quota from your subscription. If your model isn't deployed already, use the Azure AI Studio, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates to deploy the model as a serverless API.

Deploy the model to serverless API endpoints

Deployment to a self-hosted managed compute

Meta Llama models can be deployed to our self-hosted managed inference solution, which allows you to customize and control all the details about how the model is served.

For deployment to a self-hosted managed compute, you must have enough quota in your subscription. If you don't have enough quota available, you can use our temporary quota access by selecting the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.

Deploy the model to managed compute

A REST client

Models deployed with the Azure AI model inference API can be consumed using any REST client. To use the REST client, you need the following prerequisites:

To construct the requests, you need to pass in the endpoint URL. The endpoint URL has the form https://your-host-name.your-azure-region.inference.ai.azure.com, where your-host-name`` is your unique model deployment host name and your-azure-region`` is the Azure region where the model is deployed (for example, eastus2).
Depending on your model deployment and authentication preference, you need either a key to authenticate against the service, or Microsoft Entra ID credentials. The key is a 32-character string.

Work with chat completions

In this section, you use the Azure AI model inference API with a chat completions model for chat.

Tip

The Azure AI model inference API allows you to talk with most models deployed in Azure AI Studio with the same code and structure, including Meta Llama chat models.

Create a client to consume the model

First, create the client to consume the model. The following code uses an endpoint URL and key that are stored in environment variables.

When you deploy the model to a self-hosted online endpoint with Microsoft Entra ID support, you can use the following code snippet to create a client.

Note

Currently, serverless API endpoints do not support using Microsoft Entra ID for authentication.

Get the model's capabilities

The /info route returns information about the model that is deployed to the endpoint. Return the model's information by calling the following method:

GET /info HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json

The response is as follows:

{
    "model_name": "Meta-Llama-3.1-405B-Instruct",
    "model_type": "chat-completions",
    "model_provider_name": "Meta"
}

Create a chat completion request

The following example shows how you can create a basic chat completions request to the model.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ]
}

The response is as follows, where you can see the model's usage statistics:

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "Meta-Llama-3.1-405B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Inspect the usage section in the response to see the number of tokens used for the prompt, the total number of tokens generated, and the number of tokens used for the completion.

Stream content

By default, the completions API returns the entire generated content in a single response. If you're generating long completions, waiting for the response can take many seconds.

You can stream the content to get it as it's being generated. Streaming content allows you to start processing the completion as content becomes available. This mode returns an object that streams back the response as data-only server-sent events. Extract chunks from the delta field, rather than the message field.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "stream": true,
    "temperature": 0,
    "top_p": 1,
    "max_tokens": 2048
}

You can visualize how streaming generates content:

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "Meta-Llama-3.1-405B-Instruct",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "content": ""
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}

The last message in the stream has finish_reason set, indicating the reason for the generation process to stop.

{
    "id": "23b54589eba14564ad8a2e6978775a39",
    "object": "chat.completion.chunk",
    "created": 1718726371,
    "model": "Meta-Llama-3.1-405B-Instruct",
    "choices": [
        {
            "index": 0,
            "delta": {
                "content": ""
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Explore more parameters supported by the inference client

Explore other parameters that you can specify in the inference client. For a full list of all the supported parameters and their corresponding documentation, see Azure AI Model Inference API reference.

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "presence_penalty": 0.1,
    "frequency_penalty": 0.8,
    "max_tokens": 2048,
    "stop": ["<|endoftext|>"],
    "temperature" :0,
    "top_p": 1,
    "response_format": { "type": "text" }
}

{
    "id": "0a1234b5de6789f01gh2i345j6789klm",
    "object": "chat.completion",
    "created": 1718726686,
    "model": "Meta-Llama-3.1-405B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "As of now, it's estimated that there are about 7,000 languages spoken around the world. However, this number can vary as some languages become extinct and new ones develop. It's also important to note that the number of speakers can greatly vary between languages, with some having millions of speakers and others only a few hundred.",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "total_tokens": 91,
        "completion_tokens": 72
    }
}

Warning

Meta Llama models don't support JSON output formatting (response_format = { "type": "json_object" }). You can always prompt the model to generate JSON outputs. However, such outputs are not guaranteed to be valid JSON.

If you want to pass a parameter that isn't in the list of supported parameters, you can pass it to the underlying model using extra parameters. See Pass extra parameters to the model.

Pass extra parameters to the model

The Azure AI Model Inference API allows you to pass extra parameters to the model. The following code example shows how to pass the extra parameter logprobs to the model.

Before you pass extra parameters to the Azure AI model inference API, make sure your model supports those extra parameters. When the request is made to the underlying model, the header extra-parameters is passed to the model with the value pass-through. This value tells the endpoint to pass the extra parameters to the model. Use of extra parameters with the model doesn't guarantee that the model can actually handle them. Read the model's documentation to understand which extra parameters are supported.

POST /chat/completions HTTP/1.1
Host: <ENDPOINT_URI>
Authorization: Bearer <TOKEN>
Content-Type: application/json
extra-parameters: pass-through

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "How many languages are in the world?"
        }
    ],
    "logprobs": true
}

The following extra parameters can be passed to Meta Llama chat models:

Name	Description	Type
`n`	How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`best_of`	Generates best_of completions server-side and returns the best (the one with the lowest log probability per token). Results can't be streamed. When used with `n`, best_of controls the number of candidate completions and n specifies how many to return—best_of must be greater than `n`. Note: Because this parameter generates many completions, it can quickly consume your token quota.	`integer`
`logprobs`	A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs+1 elements in the response.	`integer`
`ignore_eos`	Whether to ignore the `EOS` token and continue generating tokens after the `EOS` token is generated.	`boolean`
`use_beam_search`	Whether to use beam search instead of sampling. In such case, `best_of` must be greater than 1 and temperature must be 0.	`boolean`
`stop_token_ids`	List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens.	`array`
`skip_special_tokens`	Whether to skip special tokens in the output.	`boolean`

Apply content safety

The Azure AI model inference API supports Azure AI content safety. When you use deployments with Azure AI content safety turned on, inputs and outputs pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering (preview) system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions.

The following example shows how to handle events when the model detects harmful content in the input prompt and content safety is enabled.

{
    "messages": [
        {
            "role": "system",
            "content": "You are an AI assistant that helps people find information."
        },
                {
            "role": "user",
            "content": "Chopping tomatoes and cutting them into cubes or wedges are great ways to practice your knife skills."
        }
    ]
}

{
    "error": {
        "message": "The response was filtered due to the prompt triggering Microsoft's content management policy. Please modify your prompt and retry.",
        "type": null,
        "param": "prompt",
        "code": "content_filter",
        "status": 400
    }
}

Tip

To learn more about how you can configure and control Azure AI content safety settings, check the Azure AI content safety documentation.

Note

Azure AI content safety is only available for models deployed as serverless API endpoints.

Description	Language	Sample
CURL request	Bash	Link
Azure AI Inference package for JavaScript	JavaScript	Link
Azure AI Inference package for Python	Python	Link
Python web requests	Python	Link
OpenAI SDK (experimental)	Python	Link
LangChain	Python	Link
LiteLLM	Python	Link

Share via

How to use the Meta Llama family of models

Meta Llama family of models

Prerequisites

A model deployment

The inference package installed

Work with chat completions

Create a client to consume the model

Get the model's capabilities

Create a chat completion request

Stream content

Explore more parameters supported by the inference client

Pass extra parameters to the model

Apply content safety

Meta Llama models

Prerequisites

A model deployment

The inference package installed

Work with chat completions

Create a client to consume the model

Get the model's capabilities

Create a chat completion request

Stream content

Explore more parameters supported by the inference client

Pass extra parameters to the model

Apply content safety

Meta Llama models

Prerequisites

A model deployment

The inference package installed

Work with chat completions

Create a client to consume the model

Get the model's capabilities

Create a chat completion request

Stream content

Explore more parameters supported by the inference client

Pass extra parameters to the model

Apply content safety

Meta Llama chat models

Prerequisites

A model deployment

A REST client

Work with chat completions

Create a client to consume the model

Get the model's capabilities

Create a chat completion request

Stream content

Explore more parameters supported by the inference client

Pass extra parameters to the model

Apply content safety

More inference examples

Cost and quota considerations for Meta Llama models deployed as serverless API endpoints

Cost and quota considerations for Meta Llama models deployed to managed compute

Related content

Feedback

Additional resources