Editja

Ixxerja permezz ta’


How to deploy Phi-3 family of small language models with Azure AI Studio

Important

Some of the features described in this article might only be available in preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In this article, you learn about the Phi-3 family of small language models (SLMs). You also learn to use Azure AI Studio to deploy models from this family as serverless APIs with pay-as-you-go token-based billing.

The Phi-3 family of SLMs is a collection of instruction-tuned generative text models. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across various language, reasoning, coding, and math benchmarks.

Phi-3 family of models

Phi-3 Mini is a 3.8B parameters, lightweight, state-of-the-art open model. Phi-3-Mini was trained with Phi-3 datasets that include both synthetic data and the filtered, publicly-available websites data, with a focus on high quality and reasoning-dense properties.

The model belongs to the Phi-3 model family, and the Mini version comes in two variants, 4K and 128K, which denote the context length (in tokens) that each model variant can support.

The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. When assessed against benchmarks that test common sense, language understanding, math, code, long context and logical reasoning, Phi-3-Mini-4K-Instruct and Phi-3-Mini-128K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.

Deploy Phi-3 models as serverless APIs

Certain models in the model catalog can be deployed as a serverless API with pay-as-you-go billing. This kind of deployment provides a way to consume models as an API without hosting them on your subscription, while keeping the enterprise security and compliance that organizations need. This deployment option doesn't require quota from your subscription.

Prerequisites

  • An Azure subscription with a valid payment method. Free or trial Azure subscriptions won't work. If you don't have an Azure subscription, create a paid Azure account to begin.

  • An Azure AI Studio hub. The serverless API model deployment offering for Phi-3 is only available with hubs created in these regions:

    • East US 2
    • Sweden Central

    For a list of regions that are available for each of the models supporting serverless API endpoint deployments, see Region availability for models in serverless API endpoints.

  • An Azure AI Studio project.

  • Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure AI Studio. To perform the steps in this article, your user account must be assigned the Azure AI Developer role on the resource group. For more information on permissions, see Role-based access control in Azure AI Studio.

Create a new deployment

To create a deployment:

  1. Sign in to Azure AI Studio.

  2. Select Model catalog from the left sidebar.

  3. Search for and select the model you want to deploy, for example Phi-3-mini-4k-Instruct, to open its Details page.

  4. Select Deploy.

  5. Choose the option Serverless API to open a serverless API deployment window for the model.

  6. Alternatively, you can initiate a deployment by starting from your project in AI Studio.

    1. From the left sidebar of your project, select Components > Deployments.
    2. Select + Create deployment.
    3. Search for and select Phi-3-mini-4k-Instruct to open the model's Details page.
    4. Select Confirm, and choose the option Serverless API to open a serverless API deployment window for the model.
  7. Select the project in which you want to deploy your model. To deploy the Phi-3 model, your project must belong to one of the regions listed in the prerequisites section.

  8. Select the Pricing and terms tab to learn about pricing for the selected model.

  9. Give the deployment a name. This name becomes part of the deployment API URL. This URL must be unique in each Azure region.

  10. Select Deploy. Wait until the deployment is ready and you're redirected to the Deployments page. This step requires that your account has the Azure AI Developer role permissions on the Resource Group, as listed in the prerequisites.

  11. Select Open in playground to start interacting with the model.

  12. Return to the Deployments page, select the deployment, and note the endpoint's Target URL and the Secret Key, which you can use to call the deployment and generate completions. For more information on using the APIs, see Reference: Chat Completions.

  13. You can always find the endpoint's details, URL, and access keys by navigating to your Project overview page. Then, from the left sidebar of your project, select Components > Deployments.

Consume Phi-3 models as a service

Models deployed as serverless APIs can be consumed using the chat API, depending on the type of model you deployed.

  1. From your Project overview page, go to the left sidebar and select Components > Deployments.

  2. Find and select the deployment you created.

  3. Copy the Target URL and the Key value.

  4. Make an API request using the /v1/chat/completions API using <target_url>/v1/chat/completions. For more information on using the APIs, see the Reference: Chat Completions.

Cost and quotas

Cost and quota considerations for Phi-3 models deployed as serverless APIs

You can find the pricing information on the Pricing and terms tab of the deployment wizard when deploying the model.

Quota is managed per deployment. Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute. However, we currently limit one deployment per model per project. Contact Microsoft Azure Support if the current rate limits aren't sufficient for your scenarios.