Faster autoscaling for gpu nodes in AKS

Question

Faster autoscaling for gpu nodes in AKS

David Liderman 1

Hello.
Given: AKS with gpu nodepool, set for autoscaling with 1 node minimal, 10 - maximum
Deployment: 1 pod, strategy: rolling update, node requirement: gpu
What happens: deployment starts, a pod from the new deployment is in pending state. This continues for 10 minutes

kubectl get configmap -n kube-system cluster-autoscaler-status -o yaml shows:

apiVersion: v1
data:
status: |+
Cluster-autoscaler status at 2022-11-07 15:00:07.602507854 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=6 unready=0 notStarted=0 longNotStarted=0 registered=6 longUnregistered=0)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474
ScaleUp: InProgress (ready=6 registered=6)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-07 14:53:43.908385909 +0000 UTC m=+96129.746695450
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395
LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474

NodeGroups:  
  Name:        aks-gpunp-14929123-vmss  
  Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=2 (minSize=1, maxSize=3))  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474  
  ScaleUp:     InProgress (ready=1 cloudProviderTarget=2)  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-07 14:53:43.908385909 +0000 UTC m=+96129.746695450  
  ScaleDown:   NoCandidates (candidates=0)  
               LastProbeTime:      2022-11-07 15:00:07.591651854 +0000 UTC m=+96513.429961395  
               LastTransitionTime: 2022-11-06 12:11:44.655026033 +0000 UTC m=+10.493335474

a new node becomes available only after 10-11 minutes.

How this can be improved?

Thanks.

shiva patpi 13,366 Reputation points Microsoft Employee Moderator

2022-11-07T19:18:50.757+00:00

Hello @David Liderman ,
Can you try out AKS autoscaling feature of : Scale-down mode to Deallocate instead of Delete (which is the default option).
You can find more details here: https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

Regards,
Shiva.

1 answer

Your answer

shiva patpi 13,366 Reputation points Microsoft Employee Moderator

2022-11-07T19:18:50.757+00:00

Hello @David Liderman ,
Can you try out AKS autoscaling feature of : Scale-down mode to Deallocate instead of Delete (which is the default option).
You can find more details here: https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

Regards,
Shiva.

Answer 1

shiva patpi 13,366 Microsoft Employee Moderator

Hello @David Liderman ,
Can you try out AKS autoscaling feature of : Scale-down mode to Deallocate instead of Delete (which is the default option).
You can find more details here: https://learn.microsoft.com/en-us/azure/aks/scale-down-mode

Regards,
Shiva.

Share via

Faster autoscaling for gpu nodes in AKS

1 answer

Your answer