How to deploy Meta Llama models with Azure AI Studio
Important
Some of the features described in this article might only be available in preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
In this article, you learn about the Meta Llama models. You also learn how to use Azure AI Studio to deploy models from this set either to serverless APIs with pay-as you go billing or to managed compute.
Important
Read more about the announcement of Meta Llama 3 models available now on Azure AI Model Catalog: Microsoft Tech Community Blog and from Meta Announcement Blog.
Meta Llama 3 models and tools are a collection of pretrained and fine-tuned generative text models ranging in scale from 8 billion to 70 billion parameters. The model family also includes fine-tuned versions optimized for dialogue use cases with reinforcement learning from human feedback (RLHF), called Meta-Llama-3-8B-Instruct and Meta-Llama-3-70B-Instruct. See the following GitHub samples to explore integrations with LangChain, LiteLLM, OpenAI and the Azure API.
Deploy Meta Llama models as a serverless API
Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go, providing a way to consume them as an API without hosting them on your subscription while keeping the enterprise security and compliance organizations need. This deployment option doesn't require quota from your subscription.
Meta Llama 3 models are deployed as a serverless API with pay-as-you-go billing through Microsoft Azure Marketplace, and they might add more terms of use and pricing.
Azure Marketplace model offerings
The following models are available in Azure Marketplace for Llama 3 when deployed as a service with pay-as-you-go:
If you need to deploy a different model, deploy it to managed compute instead.
Prerequisites
An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a paid Azure account to begin.
An AI Studio hub. The serverless API model deployment offering for Meta Llama 3 is only available with hubs created in these regions:
- East US
- East US 2
- North Central US
- South Central US
- West US
- West US 3
- Sweden Central
For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see Region availability for models in serverless API endpoints.
An AI Studio project in Azure AI Studio.
Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure subscription. Alternatively, your account can be assigned a custom role that has the following permissions:
On the Azure subscription—to subscribe the AI Studio project to the Azure Marketplace offering, once for each project, per offering:
Microsoft.MarketplaceOrdering/agreements/offers/plans/read
Microsoft.MarketplaceOrdering/agreements/offers/plans/sign/action
Microsoft.MarketplaceOrdering/offerTypes/publishers/offers/plans/agreements/read
Microsoft.Marketplace/offerTypes/publishers/offers/plans/agreements/read
Microsoft.SaaS/register/action
On the resource group—to create and use the SaaS resource:
Microsoft.SaaS/resources/read
Microsoft.SaaS/resources/write
On the AI Studio project—to deploy endpoints (the Azure AI Developer role contains these permissions already):
Microsoft.MachineLearningServices/workspaces/marketplaceModelSubscriptions/*
Microsoft.MachineLearningServices/workspaces/serverlessEndpoints/*
For more information on permissions, see Role-based access control in Azure AI Studio.
Create a new deployment
To create a deployment:
Sign in to Azure AI Studio.
Choose the model you want to deploy from the Azure AI Studio model catalog.
Alternatively, you can initiate deployment by starting from your project in AI Studio. Select a project and then select Deployments > + Create.
On the model's Details page, select Deploy and then select Serverless API with Azure AI Content Safety.
Select the project in which you want to deploy your models. To use the pay-as-you-go model deployment offering, your workspace must belong to the East US 2 or Sweden Central region.
On the deployment wizard, select the link to Azure Marketplace Terms to learn more about the terms of use. You can also select the Marketplace offer details tab to learn about pricing for the selected model.
If this is your first time deploying the model in the project, you have to subscribe your project for the particular offering (for example, Meta-Llama-3-70B) from Azure Marketplace. This step requires that your account has the Azure subscription permissions and resource group permissions listed in the prerequisites. Each project has its own subscription to the particular Azure Marketplace offering, which allows you to control and monitor spending. Select Subscribe and Deploy.
Note
Subscribing a project to a particular Azure Marketplace offering (in this case, Meta-Llama-3-70B) requires that your account has Contributor or Owner access at the subscription level where the project is created. Alternatively, your user account can be assigned a custom role that has the Azure subscription permissions and resource group permissions listed in the prerequisites.
Once you sign up the project for the particular Azure Marketplace offering, subsequent deployments of the same offering in the same project don't require subscribing again. Therefore, you don't need to have the subscription-level permissions for subsequent deployments. If this scenario applies to you, select Continue to deploy.
Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region.
Select Deploy. Wait until the deployment is ready and you're redirected to the Deployments page.
Select Open in playground to start interacting with the model.
You can return to the Deployments page, select the deployment, and note the endpoint's Target URL and the Secret Key, which you can use to call the deployment and generate completions.
You can always find the endpoint's details, URL, and access keys by navigating to the project page and selecting Deployments from the left menu.
To learn about billing for Meta Llama models deployed with pay-as-you-go, see Cost and quota considerations for Llama 3 models deployed as a service.
Consume Meta Llama models as a service
Models deployed as a service can be consumed using either the chat or the completions API, depending on the type of model you deployed.
Select your project or hub and then select Deployments from the left menu.
Find and select the deployment you created.
Select Open in playground.
Select View code and copy the Endpoint URL and the Key value.
Make an API request based on the type of model you deployed.
- For completions models, such as
Meta-Llama-3-8B
, use the/completions
API. - For chat models, such as
Meta-Llama-3-8B-Instruct
, use the/chat/completions
API.
For more information on using the APIs, see the reference section.
- For completions models, such as
Reference for Meta Llama models deployed as a service
Llama models accept both the Azure AI Model Inference API on the route /chat/completions
or a Llama Chat API on /v1/chat/completions
. In the same way, text completions can be generated using the Azure AI Model Inference API on the route /completions
or a Llama Completions API on /v1/completions
The Azure AI Model Inference API schema can be found in the reference for Chat Completions article and an OpenAPI specification can be obtained from the endpoint itself.
Completions API
Use the method POST
to send the request to the /v1/completions
route:
Request
POST /v1/completions HTTP/1.1
Host: <DEPLOYMENT_URI>
Authorization: Bearer <TOKEN>
Content-type: application/json
Request schema
Payload is a JSON formatted string containing the following parameters:
Key | Type | Default | Description |
---|---|---|---|
prompt |
string |
No default. This value must be specified. | The prompt to send to the model. |
stream |
boolean |
False |
Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. |
max_tokens |
integer |
16 |
The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens can't exceed the model's context length. |
top_p |
float |
1 |
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering top_p or temperature , but not both. |
temperature |
float |
1 |
The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or top_p , but not both. |
n |
integer |
1 |
How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. |
stop |
array |
null |
String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence. |
best_of |
integer |
1 |
Generates best_of completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with n , best_of controls the number of candidate completions and n specifies how many to return–best_of must be greater than n . Note: Because this parameter generates many completions, it can quickly consume your token quota. |
logprobs |
integer |
null |
A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API always returns the logprob of the sampled token, so there might be up to logprobs +1 elements in the response. |
presence_penalty |
float |
null |
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
ignore_eos |
boolean |
True |
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. |
use_beam_search |
boolean |
False |
Whether to use beam search instead of sampling. In such case, best_of must be greater than 1 and temperature must be 0 . |
stop_token_ids |
array |
null |
List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. |
skip_special_tokens |
boolean |
null |
Whether to skip special tokens in the output. |
Example
Body
{
"prompt": "What's the distance to the moon?",
"temperature": 0.8,
"max_tokens": 512
}
Response schema
The response payload is a dictionary with the following fields.
Key | Type | Description |
---|---|---|
id |
string |
A unique identifier for the completion. |
choices |
array |
The list of completion choices the model generated for the input prompt. |
created |
integer |
The Unix timestamp (in seconds) of when the completion was created. |
model |
string |
The model_id used for completion. |
object |
string |
The object type, which is always text_completion . |
usage |
object |
Usage statistics for the completion request. |
Tip
In the streaming mode, for each chunk of response, finish_reason
is always null
, except from the last one which is terminated by a payload [DONE]
.
The choices
object is a dictionary with the following fields.
Key | Type | Description |
---|---|---|
index |
integer |
Choice index. When best_of > 1, the index in this array might not be in order and might not be 0 to n-1. |
text |
string |
Completion result. |
finish_reason |
string |
The reason the model stopped generating tokens: - stop : model hit a natural stop point, or a provided stop sequence. - length : if max number of tokens have been reached. - content_filter : When RAI moderates and CMP forces moderation. - content_filter_error : an error during moderation and wasn't able to make decision on the response. - null : API response still in progress or incomplete. |
logprobs |
object |
The log probabilities of the generated tokens in the output text. |
The usage
object is a dictionary with the following fields.
Key | Type | Value |
---|---|---|
prompt_tokens |
integer |
Number of tokens in the prompt. |
completion_tokens |
integer |
Number of tokens generated in the completion. |
total_tokens |
integer |
Total tokens. |
The logprobs
object is a dictionary with the following fields:
Key | Type | Value |
---|---|---|
text_offsets |
array of integers |
The position or index of each token in the completion output. |
token_logprobs |
array of float |
Selected logprobs from dictionary in top_logprobs array. |
tokens |
array of string |
Selected tokens. |
top_logprobs |
array of dictionary |
Array of dictionary. In each dictionary, the key is the token and the value is the prob. |
Example
{
"id": "12345678-1234-1234-1234-abcdefghijkl",
"object": "text_completion",
"created": 217877,
"choices": [
{
"index": 0,
"text": "The Moon is an average of 238,855 miles away from Earth, which is about 30 Earths away.",
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 7,
"total_tokens": 23,
"completion_tokens": 16
}
}
Chat API
Use the method POST
to send the request to the /v1/chat/completions
route:
Request
POST /v1/chat/completions HTTP/1.1
Host: <DEPLOYMENT_URI>
Authorization: Bearer <TOKEN>
Content-type: application/json
Request schema
Payload is a JSON formatted string containing the following parameters:
Key | Type | Default | Description |
---|---|---|---|
messages |
string |
No default. This value must be specified. | The message or history of messages to use to prompt the model. |
stream |
boolean |
False |
Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available. |
max_tokens |
integer |
16 |
The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens can't exceed the model's context length. |
top_p |
float |
1 |
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering top_p or temperature , but not both. |
temperature |
float |
1 |
The sampling temperature to use, between 0 and 2. Higher values mean the model samples more broadly the distribution of tokens. Zero means greedy sampling. We recommend altering this or top_p , but not both. |
n |
integer |
1 |
How many completions to generate for each prompt. Note: Because this parameter generates many completions, it can quickly consume your token quota. |
stop |
array |
null |
String or a list of strings containing the word where the API stops generating further tokens. The returned text won't contain the stop sequence. |
best_of |
integer |
1 |
Generates best_of completions server-side and returns the "best" (the one with the lowest log probability per token). Results can't be streamed. When used with n , best_of controls the number of candidate completions and n specifies how many to return—best_of must be greater than n . Note: Because this parameter generates many completions, it can quickly consume your token quota. |
logprobs |
integer |
null |
A number indicating to include the log probabilities on the logprobs most likely tokens and the chosen tokens. For example, if logprobs is 10, the API returns a list of the 10 most likely tokens. the API will always return the logprob of the sampled token, so there might be up to logprobs +1 elements in the response. |
presence_penalty |
float |
null |
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
ignore_eos |
boolean |
True |
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. |
use_beam_search |
boolean |
False |
Whether to use beam search instead of sampling. In such case, best_of must be greater than 1 and temperature must be 0 . |
stop_token_ids |
array |
null |
List of IDs for tokens that, when generated, stop further token generation. The returned output contains the stop tokens unless the stop tokens are special tokens. |
skip_special_tokens |
boolean |
null |
Whether to skip special tokens in the output. |
The messages
object has the following fields:
Key | Type | Value |
---|---|---|
content |
string |
The contents of the message. Content is required for all messages. |
role |
string |
The role of the message's author. One of system , user , or assistant . |
Example
Body
{
"messages":
[
{
"role": "system",
"content": "You are a helpful assistant that translates English to Italian."},
{
"role": "user",
"content": "Translate the following sentence from English to Italian: I love programming."
}
],
"temperature": 0.8,
"max_tokens": 512,
}
Response schema
The response payload is a dictionary with the following fields.
Key | Type | Description |
---|---|---|
id |
string |
A unique identifier for the completion. |
choices |
array |
The list of completion choices the model generated for the input messages. |
created |
integer |
The Unix timestamp (in seconds) of when the completion was created. |
model |
string |
The model_id used for completion. |
object |
string |
The object type, which is always chat.completion . |
usage |
object |
Usage statistics for the completion request. |
Tip
In the streaming mode, for each chunk of response, finish_reason
is always null
, except from the last one which is terminated by a payload [DONE]
. In each choices
object, the key for messages
is changed by delta
.
The choices
object is a dictionary with the following fields.
Key | Type | Description |
---|---|---|
index |
integer |
Choice index. When best_of > 1, the index in this array might not be in order and might not be 0 to n-1 . |
messages or delta |
string |
Chat completion result in messages object. When streaming mode is used, delta key is used. |
finish_reason |
string |
The reason the model stopped generating tokens: - stop : model hit a natural stop point or a provided stop sequence. - length : if max number of tokens have been reached. - content_filter : When RAI moderates and CMP forces moderation - content_filter_error : an error during moderation and wasn't able to make decision on the response - null : API response still in progress or incomplete. |
logprobs |
object |
The log probabilities of the generated tokens in the output text. |
The usage
object is a dictionary with the following fields.
Key | Type | Value |
---|---|---|
prompt_tokens |
integer |
Number of tokens in the prompt. |
completion_tokens |
integer |
Number of tokens generated in the completion. |
total_tokens |
integer |
Total tokens. |
The logprobs
object is a dictionary with the following fields:
Key | Type | Value |
---|---|---|
text_offsets |
array of integers |
The position or index of each token in the completion output. |
token_logprobs |
array of float |
Selected logprobs from dictionary in top_logprobs array. |
tokens |
array of string |
Selected tokens. |
top_logprobs |
array of dictionary |
Array of dictionary. In each dictionary, the key is the token and the value is the prob. |
Example
The following is an example response:
{
"id": "12345678-1234-1234-1234-abcdefghijkl",
"object": "chat.completion",
"created": 2012359,
"model": "",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Sure, I\'d be happy to help! The translation of ""I love programming"" from English to Italian is:\n\n""Amo la programmazione.""\n\nHere\'s a breakdown of the translation:\n\n* ""I love"" in English becomes ""Amo"" in Italian.\n* ""programming"" in English becomes ""la programmazione"" in Italian.\n\nI hope that helps! Let me know if you have any other sentences you\'d like me to translate."
}
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 40,
"completion_tokens": 30
}
}
Deploy Meta Llama models to managed compute
Apart from deploying with the pay-as-you-go managed service, you can also deploy Meta Llama models to managed compute in AI Studio. When deployed to managed compute, you can select all the details about the infrastructure running the model, including the virtual machines to use and the number of instances to handle the load you're expecting. Models deployed to managed compute consume quota from your subscription. All the models in the Llama family can be deployed to managed compute.
Follow these steps to deploy a model such as Llama-2-7b-chat
to a real-time endpoint in Azure AI Studio.
Choose the model you want to deploy from the Azure AI Studio model catalog.
Alternatively, you can initiate deployment by starting from your project in AI Studio. Select your project and then select Deployments > + Create.
On the model's Details page, select Deploy next to the View license button.
On the Deploy with Azure AI Content Safety (preview) page, select Skip Azure AI Content Safety so that you can continue to deploy the model using the UI.
Tip
In general, we recommend that you select Enable Azure AI Content Safety (Recommended) for deployment of the Llama model. This deployment option is currently only supported using the Python SDK and it happens in a notebook.
Select Proceed.
Select the project where you want to create a deployment.
Tip
If you don't have enough quota available in the selected project, you can use the option I want to use shared quota and I acknowledge that this endpoint will be deleted in 168 hours.
Select the Virtual machine and the Instance count that you want to assign to the deployment.
Select if you want to create this deployment as part of a new endpoint or an existing one. Endpoints can host multiple deployments while keeping resource configuration exclusive for each of them. Deployments under the same endpoint share the endpoint URI and its access keys.
Indicate if you want to enable Inferencing data collection (preview).
Select Deploy. After a few moments, the endpoint's Details page opens up.
Wait for the endpoint creation and deployment to finish. This step can take a few minutes.
Select the Consume tab of the deployment to obtain code samples that can be used to consume the deployed model in your application.
Consume Llama 2 models deployed to managed compute
For reference about how to invoke Llama models deployed to managed compute, see the model's card in the Azure AI Studio model catalog. Each model's card has an overview page that includes a description of the model, samples for code-based inferencing, fine-tuning, and model evaluation.
More inference examples
Package | Sample Notebook |
---|---|
OpenAI SDK (experimental) | openaisdk.ipynb |
LangChain | langchain.ipynb |
WebRequests | webrequests.ipynb |
LiteLLM SDK | litellm.ipynb |
Cost and quotas
Cost and quota considerations for Llama models deployed as a service
Llama models deployed as a service are offered by Meta through the Azure Marketplace and integrated with Azure AI Studio for use. You can find the Azure Marketplace pricing when deploying or fine-tuning the models.
Each time a project subscribes to a given offer from the Azure Marketplace, a new resource is created to track the costs associated with its consumption. The same resource is used to track costs associated with inference and fine-tuning; however, multiple meters are available to track each scenario independently.
For more information on how to track costs, see monitor costs for models offered throughout the Azure Marketplace.
Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios.
Cost and quota considerations for Llama models deployed as managed compute
For deployment and inferencing of Llama models with managed compute, you consume virtual machine (VM) core quota that is assigned to your subscription on a per-region basis. When you sign up for Azure AI Studio, you receive a default VM quota for several VM families available in the region. You can continue to create deployments until you reach your quota limit. Once you reach this limit, you can request a quota increase.
Content filtering
Models deployed as a serverless API with pay-as-you-go are protected by Azure AI Content Safety. When deployed to managed compute, you can opt out of this capability. With Azure AI content safety enabled, both the prompt and completion pass through an ensemble of classification models aimed at detecting and preventing the output of harmful content. The content filtering system detects and takes action on specific categories of potentially harmful content in both input prompts and output completions. Learn more about Azure AI Content Safety.
Next steps
Tagasiside
https://aka.ms/ContentUserFeedback.
Varsti tulekul: 2024. aasta jooksul tühistame GitHubi probleemide funktsiooni sisutagasiside mehhanismina ja asendame selle uue tagasisidesüsteemiga. Lisateabe saamiseks vtEsita ja vaata tagasisidet