Memory & CPU Utilization drastically different for AKS

Abdul Aziz 0 Reputation points
2024-06-13T15:21:56.5566667+00:00

I am planning to use Descheduler in my AKS deployment to balance memory consumption of AKS nodes. My current output of kubectl top nodes is:

NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   

aks-nodepool1-53884836-vmss000000   198m         10%    12317Mi         97%       
aks-nodepool1-53884836-vmss000001   189m         9%     12952Mi         102%      
aks-nodepool1-53884836-vmss000002   213m         11%    12747Mi         101%      
aks-nodepool1-53884836-vmss000003   135m         7%     5970Mi          47%    

However when i tried different scenarios in Descheduler i got following output

I0612 13:51:55.145678       1 nodeutilization.go:204] "Node is underutilized" node="aks-nodepool1-53884836-vmss000003" usage={"cpu":"810m","memory":"476Mi","pods":"27"} usagePercentage={"cpu":42.63,"memory":3.78,"pods":10.8}
I0612 13:51:55.145712       1 nodeutilization.go:204] "Node is underutilized" node="aks-nodepool1-53884836-vmss000000" usage={"cpu":"582m","memory":"501Mi","pods":"54"} usagePercentage={"cpu":30.63,"memory":3.98,"pods":21.6}
I0612 13:51:55.145725       1 nodeutilization.go:204] "Node is underutilized" node="aks-nodepool1-53884836-vmss000001" usage={"cpu":"950m","memory":"596Mi","pods":"61"} usagePercentage={"cpu":50,"memory":4.74,"pods":24.4}
I0612 13:51:55.145743       1 nodeutilization.go:210] "Node is appropriately utilized" node="aks-nodepool1-53884836-vmss000002" usage={"cpu":"962m","memory":"647Mi","pods":"56"} usagePercentage={"cpu":50.63,"memory":5.14,"pods":22.4}

As you can see that utilization seen by Descheduler is drastically different from what top is reporting. Especially memory which is not more than 5% utilized in any of the nodes whereas top is reporting 47% and above

When i describe the node i see that utilization of all the custom pods as 0.

 Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---

  default                     alerts-667b7bc-88djq                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     alerts-ag-5544b98c45-xjnss                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     api-6db9645d8b-p6jqm                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     authentication-766496cf6b-js77v                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     authentication-ag-585fdf767nsp6                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     authentication-validator-76b444457c-6x66x                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     authorization-checker-5789b576ff-wssl2                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     authorization-78f759f849-xmk2p                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     backups-agent-68f47f764c-vpmlh                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d
  default                     backups-f7d6c765d-qsl8v                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         32d

....

Allocated resources:

  (Total limits may be over 100 percent, i.e., overcommitted.)

  Resource           Requests    Limits
  --------           --------    ------

  cpu                582m (30%)  4647m (244%)
  memory             501Mi (3%)  3657Mi (29%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)

Describe node allocated resources seem to conform with what descheduler is seeing. Given that utilization of all the custom pods is shown as 0%. Getting top pod for one of the pods e.g. kubectl top pods alerts-667b7bc-88djq

NAME                            CPU(cores)   MEMORY(bytes)   
alerts-667b7bc-88djq             2m           108Mi           

PodMetrics seems to agree with this. kubectl describe PodMetrics alerts-667b7bc-88djq

API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  alerts
  Usage:
    Cpu:     1268872n
    Memory:  111272Ki
Kind:        PodMetrics

Any help understanding whats going on here. Why describe node is failing to register any resource utilization (and subsequently descheduler reporting the same) whereas top nodes is presenting a totally different picture ?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,950 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 6,341 Reputation points
    2024-06-13T20:34:41.1533333+00:00

    Hello Abdul Aziz,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Problem

    I understand that you are having discrepancies between the memory usage metrics reported by kubectl top nodes and the utilization metrics used by the Descheduler in their AKS (Azure Kubernetes Service) deployment.

    Solution

    To solve the issues after analyzing the provided output and information. There are two causes to these:

    1. The discrepancy is likely due to differences in how the metrics are collected and reported, or because the Descheduler uses resource requests and limits, while kubectl top nodes reports actual usage.
    2. The pods may not have resource requests and limits set, causing the Descheduler to underestimate their actual resource usage

    For the common issue listed above, you can do the following:

    STAGE 1

    • Ensure that the metrics server is correctly reporting resource usage.
      kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
      kubectl logs <metrics-server-pod>
    
    • Compare the metrics reported by kubectl top nodes, kubectl top pods, and Descheduler to identify any inconsistencies.
     kubectl top nodes
      kubectl top pods
      kubectl describe nodes
    

    STAGE 2

    • Ensure that all pods have appropriate resource requests and limits set. This will help the Descheduler to make accurate decisions.
      apiVersion: v1
      kind: Pod
      metadata:
        name: example-pod
      spec:
        containers:
        - name: example-container
          image: nginx
          resources:
            requests:
              memory: "128Mi"
              cpu: "500m"
            limits:
              memory: "256Mi"
              cpu: "1000m"
    

    Then, you will apply the changes to the cluster:

      kubectl apply -f example-pod.yaml
    

    Remember: example-pod.yaml was the YAML file above.

    Finally

    After setting the resource requests and limits, you will then need to monitor the metrics to ensure they align.

       kubectl top nodes
       kubectl top pods --all-namespaces
       kubectl describe nodes
    

    So therefore, if you can ensure that the metrics server is functioning correctly, setting appropriate resource requests and limits, and consistently monitoring metrics, you will be able to align the reported resource usage across kubectl top nodes, Descheduler, and kubectl describe node. This alignment is crucial.

    References

    For more detail instruction and source for the above solutions, kindly use the following links:

    Source: Resource Requests and Limits. Accessed, 6/13/2024.

    Source: Kubernetes Metrics Server. Accessed, 6/13/2024.

    Source: Azure Kubernetes Service (AKS) Documentation. Accessed, 6/13/2024.

    Source: Kubernetes Descheduler. Accessed, 6/13/2024.

    Source: Setting Resource Requests and Limits. Accessed, 6/13/2024.

    Source: Monitoring and Troubleshooting Metrics Server. Accessed, 6/13/2024.

    Accept Answer

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    ** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.

    Best Regards,

    Sina Salam

    0 comments No comments