@Suresh Bettadapur I think the number of nodes is not increased as part of autoscaling for an AKS deployment of Azure ml model. Please refer the note here in the documentation.
The azureml-fe component scales the number of replicas for the model within the physical cluster boundaries. I think in your case the Azure ML router azureml-fe might not be started.
Have you tried to check if autoscaling works when you deploy to a new cluster which is created from the azure ml portal with advanced network settings that enables you to select the virtual network instead of using an existing AKS cluster?
Autoscaling issue with AKS attached as ML inference cluster
We have a AKS cluster defined inside a VNet. This AKS cluster is used as an Inference cluster for Azure ML and models are deployed on the AKS from ML. Due to this reason, AKS cluster of "loadBalancer" outbound type is created which creates a load balancer of Public IP [a requirement of ML]
As this is inside a VNet, PublicIP is not routable and to access the scoring endpoints deployed on AKS, we have created a NGINX Ingress controller, with an Internal IP
Now, everything is working fine, but PODs (and Nodes) aren't Autoscaling. Have enabled cluster autoscaler and per MS advice, not enabled HPA.
What could be the reason? Can you please advice?
Azure Machine Learning
Azure Kubernetes Service
3 answers
Sort by: Most helpful
-
romungi-MSFT 49,096 Reputation points Microsoft Employee Moderator2022-05-09T08:22:14.26+00:00 -
Suresh Bettadapur 101 Reputation points
2022-05-09T08:59:44.887+00:00 Thanks for the response
I can see azureml-fe service in default namespace when I run the below command
kubectl get svc -A
default azureml-fe LoadBalancer <Cluster-IP> <External-IP>
POD autoscaling is also not happening
ML service and AKS cluster are created using Terraform scripts. Existing AKS cluster is attached as Inference cluster
-
Anonymous
2023-11-30T14:10:00.9566667+00:00 Exactly the same issue for me. I even went as far as setting the threshold to 1% and still no additional pods get created. Very frustrating implementation and the documentation in inadequate.