Azure online endpoint is Scaling taking long time

Question

Azure online endpoint is Scaling taking long time

Tran Hong Thu (DPS.VI.DTS) 40

I have a project to run an AI model using an online endpoint as a backend service, the endpoint is configured (manually set in the portal) to be auto-scale based on the number of requests.

Expect the endpoint to scale up between 1 to 2 minutes, like other services such as virtual machine scale set, etc...

But with the ML online endpoint, scaling takes a long time, about 12-18 minutes.

Do you have suggestions for speeding up the scaling time?

Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-13T09:57:55.4866667+00:00

Thanks @Saideep Anchuri for replying,

for requesting from the provider and business so we don't change to another service such as AKS, VM scaleset, batch endpoint (needs to be near realtime of the response)

My source code and AI model are very simple, I used to try to deploy it on a VM scaleset and the scaling time is only 1 - 2 minutes, and AKS is about 5 minutes.

Do you have suggestions for configuration or tips & tricks for me to reduce it to 5 minutes as you mentioned?
Saideep Anchuri 9,500 Reputation points Moderator

2025-02-13T10:42:00.4066667+00:00

Hi Tran Hong Thu (DPS.VI.DTS)

You can adjust the Cool down (minutes) (can be reduced to 1 min and responsible Scaling out) and Duration (minimum 5 minutes for metrics monitoring) from and experiment with other metrics like CPU Utilization and connection per second etc. to adjust your scale out time. You can select custom auto scale option then click on "add a rule" to select endpoint-based metrics.

Kindly refer below screenshot:

Thank You.
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-14T06:37:45.7533333+00:00

Hi @Saideep Anchuri

My scaling configuration is based on the number of active messages in the Topic bus, as shown in the images below. Scaling up always takes about 15 minutes, but I think it doesn't affect the scaling-up time much.

as I know that when we deploy an endpoint successfully, it makes a docker image and the endpoint instance will run as a docker container, so when it's scaling up that means a new container is increased. therefore, it shouldn't have taken that much time.
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-02-14T10:12:48.3733333+00:00

Hi Tran Hong Thu (DPS.VI.DTS),

Endpoint scaling time can vary depending on the region. If you have selected a high-demand region like East US 2, East US, or Sweden Central, scaling may take longer. Additionally, network configurations, such as outbound rules in the Network Security Group (NSG) and whitelisting Azure URLs for WebSocket communication, can impact scaling performance. To address regional demand issues, consider testing with different SKU compute options.

Kindly refer this Basic configuration.

Thank you!
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-17T01:01:56.74+00:00

Hi SriLakshmi C, Saideep Anchuri, thanks for your help.

i'm deploying my services in the "North Central US" region, all in a private network, so we use default NSG.

anyway, can you give me the related MS document for detailing the scaling of the online endpoint? i need to explain to my provider the difference in scaling between the online endpoint and other services such as VM scaleset, and AKS. to answer their question, of why it's so slow.

Thank you.
Manas Mohanty 6,040 Reputation points Microsoft External Staff Moderator

2025-02-17T03:37:40.17+00:00

Hi Tran Hong Thu (DPS.VI.DTS)

Sorry for the delay in response.

Below is documentation on Auto scaling endpoints.

Autoscale of online endpoints

Thank you
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-17T03:49:22.5833333+00:00

Thank Manas Mohanty,

Your provided document is more of a guide, it is not detailed enough to serve as a basis for explaining my question and helping me understand more about it.

Accepted answer

0 additional answers

Your answer

Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-13T09:57:55.4866667+00:00

Thanks @Saideep Anchuri for replying,

for requesting from the provider and business so we don't change to another service such as AKS, VM scaleset, batch endpoint (needs to be near realtime of the response)

My source code and AI model are very simple, I used to try to deploy it on a VM scaleset and the scaling time is only 1 - 2 minutes, and AKS is about 5 minutes.

Do you have suggestions for configuration or tips & tricks for me to reduce it to 5 minutes as you mentioned?
Saideep Anchuri 9,500 Reputation points Moderator

2025-02-13T10:42:00.4066667+00:00

Hi Tran Hong Thu (DPS.VI.DTS)

You can adjust the Cool down (minutes) (can be reduced to 1 min and responsible Scaling out) and Duration (minimum 5 minutes for metrics monitoring) from and experiment with other metrics like CPU Utilization and connection per second etc. to adjust your scale out time. You can select custom auto scale option then click on "add a rule" to select endpoint-based metrics.

Kindly refer below screenshot:

Thank You.
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-14T06:37:45.7533333+00:00

Hi @Saideep Anchuri

My scaling configuration is based on the number of active messages in the Topic bus, as shown in the images below. Scaling up always takes about 15 minutes, but I think it doesn't affect the scaling-up time much.

as I know that when we deploy an endpoint successfully, it makes a docker image and the endpoint instance will run as a docker container, so when it's scaling up that means a new container is increased. therefore, it shouldn't have taken that much time.
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-02-14T10:12:48.3733333+00:00

Hi Tran Hong Thu (DPS.VI.DTS),

Endpoint scaling time can vary depending on the region. If you have selected a high-demand region like East US 2, East US, or Sweden Central, scaling may take longer. Additionally, network configurations, such as outbound rules in the Network Security Group (NSG) and whitelisting Azure URLs for WebSocket communication, can impact scaling performance. To address regional demand issues, consider testing with different SKU compute options.

Kindly refer this Basic configuration.

Thank you!
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-17T01:01:56.74+00:00

Hi SriLakshmi C, Saideep Anchuri, thanks for your help.

i'm deploying my services in the "North Central US" region, all in a private network, so we use default NSG.

anyway, can you give me the related MS document for detailing the scaling of the online endpoint? i need to explain to my provider the difference in scaling between the online endpoint and other services such as VM scaleset, and AKS. to answer their question, of why it's so slow.

Thank you.
Manas Mohanty 6,040 Reputation points Microsoft External Staff Moderator

2025-02-17T03:37:40.17+00:00

Hi Tran Hong Thu (DPS.VI.DTS)

Sorry for the delay in response.

Below is documentation on Auto scaling endpoints.

Autoscale of online endpoints

Thank you
Tran Hong Thu (DPS.VI.DTS) 40 Reputation points

2025-02-17T03:49:22.5833333+00:00

Thank Manas Mohanty,

Your provided document is more of a guide, it is not detailed enough to serve as a basis for explaining my question and helping me understand more about it.

Answer 1

Hi Tran Hong Thu (DPS.VI.DTS)

Welcome to Microsoft Q&A Forum, thank you for posting your query here!

Scaling times for Azure Machine Learning online endpoints can vary due to several factors, including the complexity of the model, the size of the resources being allocated, and the current load on the system. Scaling an online endpoint takes at least 5 minutes. For faster scaling, you can use a compute cluster for a batch endpoint, which allows for a scale-down time of less than a minute. Alternatively, you can use an AKS (Azure Kubernetes Service) cluster, which automatically adjusts based on incoming traffic, offering a more reactive scaling solution.

az aks update \
  --resource-group <yourResourceGroup> \
  --name <yourAKSCluster> \
  --enable-cluster-autoscaler \
  --min-count <minNodeCount> \
  --max-count <maxNodeCount>

To increase nodes based on workload azure-cli

Kindly refer below link:

kubernetes-online-endpoints

how-to-attach-kubernetes

machine-learning-reference

Thank You.

Share via

Azure online endpoint is Scaling taking long time

0 additional answers

Your answer