Edit

Share via


AKS Arc telemetry pod consumes too much memory and CPU

Symptoms

The akshci-telemetry pod in a AKS Arc cluster can over time consume a lot of CPU and memory resources. If metrics are enabled, you can verify the CPU and memory usage using the following kubectl command:

kubectl -n kube-system top pod -l app=akshci-telemetry

You might see an output similar to this:

NAME                              CPU(cores)   MEMORY(bytes)
akshci-telemetry-5df56fd5-rjqk4   996m         152Mi

Mitigation

To resolve this issue, set default resource limits for the pods in the kube-system namespace.

Important notes

  • Verify if you have any pods in the kube-system namespace that might require more memory than the default limit setting. If so, adjustments might be needed.
  • The LimitRange is applied to the namespace; in this case, the kube-system namespace. The default resource limits also apply to new pods that don't specify their own limits.
  • Existing pods, including those that already have resource limits, aren't affected.
  • New pods that don't specify their own resource limits are constrained by the limits set in the next section.
  • After you set the resource limits and delete the telemetry pod, the new pod might eventually hit the memory limit and generate OOM (Out-Of-Memory) errors. This is a temporary mitigation.

To proceed with setting the resource limits, you can run the following script. While the script uses az aksarc get-credentials, you can also use az connectedk8s proxy to get the proxy kubeconfig and access the Kubernetes cluster.

Define the LimitRange YAML to set default CPU and memory limits

# Set the $cluster_name and $resource_group of the aksarc cluster
$cluster_name = ""
$resource_group = ""

# Connect to the aksarc cluster
az aksarc get-credentials -n $cluster_name -g $resource_group --admin -f "./kubeconfig-$cluster_name"

$limitRangeYaml = @'
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-mem-resource-constraint
  namespace: kube-system
spec:
  limits:
  - default: # this section defines default limits for containers that haven't specified any limits
      cpu: 250m
      memory: 250Mi
    defaultRequest: # this section defines default requests for containers that haven't specified any requests
      cpu: 10m
      memory: 20Mi
    type: Container
'@

$limitRangeYaml | kubectl apply --kubeconfig "./kubeconfig-$cluster_name" -f -

kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"
kubectl delete pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"

sleep 5
kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name"

Validate if the resource limits were applied correctly

  1. Check the resource limits in the pod's YAML configuration:

    kubectl get pods -l app=akshci-telemetry -n kube-system --kubeconfig "./kubeconfig-$cluster_name" -o yaml
    
  2. In the output, verify that the resources section includes the limits:

    resources:
      limits:
        cpu: 250m
        memory: 250Mi
      requests:
        cpu: 10m
        memory: 20Mi
    

Next steps

Known issues in AKS enabled by Azure Arc