Understanding AKS CrashLoopBackoff

Question

My team is working on running a kubernetes cluster and I have been struggling to understand how to maximize the number of pods we can run per node. When I update the cpu and memory requests and limits so that we can have 2 pods per node, things run pretty well. When I update it to have 3 pods per node, it starts erroring (CrashLoopBackoff) about 10% of the time. When I have 4 pods per node, its about 25% of the time. Unfortunately, I have not found any error messaging useful. The Events of a failing pod just says "Back-off restarting failed container." My assumption is that when I increase the pod count, they are reaching the max cpu limit per node, but playing around with the numbers and limits is not working as I had hoped. Is there any way to see an actual error report like "pod requests X cpu greater than available Y cpu" so that we can understand the actual issue? I have searched online and there is not many answers to determine what a CrashLoopBackoff actually means or how to debug the underlying error, so any advice or avenues to look for solutions are greatly appreciated.

We are using D4s_v3 (4 vcpu, 16 GiB)
I understand that some resources are reserved (https://learn.microsoft.com/en-us/azure/aks/concepts-clusters-workloads)
Is there a way to see why a pod failed and if it was due to cpu or memory shortage?

Answer

@Scott Mallory

CrashLoopBackOff means the pod has failed/exited unexpectedly/has an error code that is not zero.

There are a couple of ways to check this. I would recommend to go through below links and get the logs for the pod using kubectl logs.

Debug Pods and ReplicationControllers

Determine the Reason for Pod Failure

Use kubectl describe to get more data on the pod failure. If you do not see the issue when creating more pods on the node, then let a pod run on a single node to see how much resources it actually utilizes. There are other options such as reading Kubelet logs and more. This is the base I would suggest that could solves the issue.

A pod can also be in CrashLoopBackOff if it has completed and it’s configured to keep restarting (even with exit code 0). A good example is when you deploy a busybox image without any arguments: it will start, execute, and finish. It will keep restarting.

Hope this helps.

Please 'Accept as answer' if the provided information is helpful, so that it can help others in the community looking for help on similar topics.

Understanding AKS CrashLoopBackoff

1 answer