az aks command invoke fails intermittently (gateway error or failed to run command)

Maxim de Bie 10 Reputation points
2023-05-17T14:34:31.9166667+00:00

In our private aks cluster we're experiencing a lot of issues issuing commands using the az aks command invoke command resulting in either:

"statusMessage": "{\"code\":\"KubernetesOperationError\",\"message\":\"Failed to run command in managed cluster due to kubernetes failure. details: exec command on init-command failed with exitCode 1, stdOut:, stdErr:\"}",

or

"statusMessage": "{\"error\":{\"code\":\"GatewayTimeout\",\"message\":\"The gateway did not receive a response from 'Microsoft.ContainerService' within the specified time period.\"}}"

It seems the pods in the aks-command namespace are stuck in Init:Error because the init-command container does not progress beyond:

copying default SA
wait for AAD token
Setting up watches.
Watches established.

It fails about 75% of the time, it seems to fail more often with operations that are not get or describe commands. Any ideas how to fix?
It does not appear to be a resource issue as the cluster is comfortably sized:


  Resource           Requests    Limits
  --------           --------    ------
  cpu                630m (8%)   1850m (23%)
  memory             960Mi (3%)  4036Mi (14%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)

Problem occurs on both mac x86 and windows machines using az cli version 2.48.1

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,987 questions
{count} votes

1 answer

Sort by: Most helpful
  1. vipullag-MSFT 25,866 Reputation points
    2023-05-18T04:48:09.7566667+00:00

    Hello Maxim de Bie

    Welcome to Microsoft Q&A Platform, thanks for posting your query here.

    Based on the provided information, it seems that there might be an issue with the az aks command invoke command in your private AKS cluster. The error messages suggest that the command is failing due to a Kubernetes failure or a gateway timeout.

    One possible reason for this issue could be that the pods in the aks-command namespace are stuck in the Init:Error state. This could be due to an issue with the init-command container, which is responsible for setting up the environment for the command execution.

    To troubleshoot this issue, you can try the following steps:

    Check the logs of the init-command container in the aks-command namespace to see if there are any error messages that might indicate the cause of the issue.

    kubectl logs -n aks-command <pod-name> -c init-command
    

    Check if there are any issues with the Kubernetes API server or the gateway that might be causing the timeouts. You can use the following command to check the status of the Kubernetes API server:

    kubectl get componentstatuses
    

    You can also check the logs of the kube-apiserver container in the kube-system namespace to see if there are any error messages that might indicate the cause of the issue.

    Check if there are any issues with the network connectivity between your development machine and the AKS cluster. You can try running the az aks command invoke command from a different network or machine to see if the issue persists.

    Try upgrading the AKS cluster to the latest version to see if the issue is resolved. You can use the following command to upgrade the cluster:

    az aks upgrade --resource-group <resource-group-name> --name <cluster-name> --kubernetes-version <version>
    

    Note that upgrading the cluster might cause some downtime, so make sure to plan accordingly.

    Hope this helps.
    If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.

    0 comments No comments