Azure ML Online Endpoint Deployment Failing for HuggingFace Models - ResourceOperationFailure

Sourour JEBALI 10 Reputation points
2025-02-10T12:06:11.6233333+00:00

Hi Azure Community, I'm encountering persistent failures when trying to deploy HuggingFace models as online endpoints in Azure Machine Learning. The deployment fails during the provisioning stage with a ResourceOperationFailure. Error Message: ResourceOperationFailure: Internal error with reference to troubleshooting guide at https://aka.ms/oe-tsg#error-intern...
Steps I've tried:

  1. Deploying through Azure Portal and SDK both
  2. Using standard compute SKU (Standard_F4s_v2) What could be causing this internal ResourceOperationFailure?
    It is happening with all HuggingFace models and there is no log error message Screenshot from 2025-02-10 13-02-59
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,182 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 29,946 Reputation points
    2025-02-10T19:31:12.28+00:00

    I am not expert of this matter, but I have a doubt that your issue may be related to the subscription or the SKU.

    Did you verify if your Azure subscription has sufficient quota for the resources you're trying to deploy ? Sometimes, the failure is due to exceeding the allocated quota for certain SKUs.

    Try to follow this guide for troubleshooting : https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints?view=azureml-api-2&tabs=cli#error-intern...

    Also try to check that the Hugging Face model you're trying to deploy is compatible with the selected SKU (Standard_F4s_v2). Some models may require more memory or computational power than what this SKU provides.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.