Autoscaling issue with AKS attached as ML inference cluster

Suresh Bettadapur 71 Reputation points
2022-05-06T14:54:37.503+00:00

We have a AKS cluster defined inside a VNet. This AKS cluster is used as an Inference cluster for Azure ML and models are deployed on the AKS from ML. Due to this reason, AKS cluster of "loadBalancer" outbound type is created which creates a load balancer of Public IP [a requirement of ML]

As this is inside a VNet, PublicIP is not routable and to access the scoring endpoints deployed on AKS, we have created a NGINX Ingress controller, with an Internal IP

Now, everything is working fine, but PODs (and Nodes) aren't Autoscaling. Have enabled cluster autoscaler and per MS advice, not enabled HPA.

What could be the reason? Can you please advice?

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,708 questions
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,977 questions
0 comments No comments
{count} vote

3 answers

Sort by: Most helpful
  1. romungi-MSFT 43,681 Reputation points Microsoft Employee
    2022-05-09T08:22:14.26+00:00

    @Suresh Bettadapur I think the number of nodes is not increased as part of autoscaling for an AKS deployment of Azure ml model. Please refer the note here in the documentation.
    The azureml-fe component scales the number of replicas for the model within the physical cluster boundaries. I think in your case the Azure ML router azureml-fe might not be started.
    Have you tried to check if autoscaling works when you deploy to a new cluster which is created from the azure ml portal with advanced network settings that enables you to select the virtual network instead of using an existing AKS cluster?

    0 comments No comments

  2. Suresh Bettadapur 71 Reputation points
    2022-05-09T08:59:44.887+00:00

    Thanks for the response

    I can see azureml-fe service in default namespace when I run the below command

    kubectl get svc -A

    default azureml-fe LoadBalancer <Cluster-IP> <External-IP>

    POD autoscaling is also not happening

    ML service and AKS cluster are created using Terraform scripts. Existing AKS cluster is attached as Inference cluster

    0 comments No comments

  3. Dean Martin (ZA-DUR) 0 Reputation points
    2023-11-30T14:10:00.9566667+00:00

    Exactly the same issue for me. I even went as far as setting the threshold to 1% and still no additional pods get created. Very frustrating implementation and the documentation in inadequate.