Faster autoscaling for gpu nodes in AKS

David Liderman 1 Reputation point
2022-11-07T15:22:14.387+00:00

Hello.
Given: AKS with gpu nodepool, set for autoscaling with 1 node minimal, 10 - maximum
Deployment: 1 pod, strategy: rolling update, node requirement: gpu
What happens: deployment starts, a pod from the new deployment is in pending state. This continues for 10 minutes

kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml shows:

apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2022-11-07 15:00:07.602507854 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=6 unready=0 notStarted=0 longNotStarted=0 registered=6 longUnregistered=0)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474
ScaleUp: InProgress (ready=6 registered=6)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-07 14:53:43.908385909 +0000 UTC m=+96129.746695450
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474

NodeGroups:  
  Name:        aks-gpunp-14929123-vmss  
  Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=2 (minSize=1, maxSize=3))  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474  
  ScaleUp:     InProgress (ready=1 cloudProviderTarget=2)  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-07 14:53:43.908385909 +0000 UTC m=+96129.746695450  
  ScaleDown:   NoCandidates (candidates=0)  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474  

a new node becomes available only after 10-11 minutes.

How this can be improved?

Thanks.

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,457 questions
{count} votes

1 answer

Sort by: Most helpful
  1. shiva patpi 13,366 Reputation points Microsoft Employee Moderator
    2022-11-07T19:19:03.06+00:00

    Hello @David Liderman ,
    Can you try out AKS autoscaling feature of : Scale-down mode to Deallocate instead of Delete (which is the default option).
    You can find more details here: https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

    Regards,
    Shiva.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.