How to deploy a Serverless endpoint on a finetuned model?

Sebastian Buzdugan 40 Reputation points
2024-12-16T08:09:41.3233333+00:00

Hello,

I’ve fine-tuned a model (Phi 3.5 MoE) using Azure Machine Learning. I want to deploy it as a serverless endpoint.

  1. Is serverless deployment currently supported for fine-tuned models in Azure ML?
  2. If not, what type of compute system is recommended for deploying real-time endpoints?
  3. I’ve tried various CPU configurations, but deployments have failed. Is there a list of compatible compute options for endpoints that I can refer to?

Attached is a screenshot of my registered model details for reference.

Thank you!

User's image

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,195 questions
0 comments No comments
{count} votes

Accepted answer
  1. santoshkc 13,680 Reputation points Microsoft External Staff
    2024-12-16T11:12:33.6866667+00:00

    Hi @Sebastian Buzdugan,

    Thank you for reaching out to Microsoft Q&A forum!

    Is serverless deployment currently supported for fine-tuned models in Azure ML?

    Currently, serverless deployment is not supported for custom fine-tuned models in Azure ML. As of now, serverless API deployment is available only for specific models in the model catalog, such as the Meta Llama models and Phi-3.5 models, which do not require a subscription to the model offering.

    If not, what type of compute system is recommended for deploying real-time endpoints?

    For fine-tuned models like Phi 3.5 MoE, you would need to use managed compute for deploying a real-time API endpoint. Managed compute is ideal for handling real-time predictions, as it provides flexibility for running custom models with various compute configurations.

    I’ve tried various CPU configurations, but deployments have failed. Is there a list of compatible compute options for endpoints that I can refer to?

    If your workload involves working with large datasets in memory (for example, in-memory analytics, databases, or applications that require large memory configurations), it's essential to choose VMs that provide a higher memory-to-CPU ratio. This will prevent performance bottlenecks that can occur if the application runs out of available memory during execution.

    For memory-heavy applications, Azure offers VMs that are optimized for large memory workloads, ensuring smooth performance even for the most demanding tasks, such as high-performance databases.

    For more info, please refer to:

    Deploy models as serverless API endpoints.

    Models in serverless API endpoints.

    I hope you understand. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Azar 26,820 Reputation points MVP
    2024-12-16T10:32:11.9033333+00:00

    Hi there Sebastian Buzdugan

    Thanks for using QandA platform

    I beleive that the azure Machine Learning doesn’t currently support serverless deployment for fine-tuned models like Phi 3.5 MoE due to their resource needs. i suggest for real-time endpoints, GPU-based instances such as Standard_NC or Standard_ND, you can use for optimal performance, as CPU configurations may lack the required capacity.

    Supported VM Sizes for Endpoints

    If this helps kindly accept the answer thanks much.

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.