System pool in AKS is stuck in Updating state

Rune Gulbrandsen 30 Reputation points
2023-10-26T09:23:23.1966667+00:00

After yesterday's mishaps that is described in https://learn.microsoft.com/en-us/answers/questions/1403535/how-to-fix-an-aks-cluster-that-is-in-state-rotatin we discovered today that one of our clusters didn't recover that well.

There are as far as I can see two different issues we're experiencing:

  1. The system pool is for some reason stuck in Updating state (our user pool is running normally).
  2. The azureKeyvaultSecretsProvider is suddenly missing it's identity.

If these two are connected, or if they are connected to yesterday's issue is not clear, but they did arise after yesterday's issues, and they both affect us heavily.

Issue 1 prohibits us to do ANYTHING with the cluster configuration (except creating a new pool), so we decided to create a new system pool and then drain the old one, but this was effectively stopped by issue 2, since we have an NGINX instance that is running on our system nodes that retrieves certificates from a key vault using the azure key vault provider for secret store which are dependent of this setting.

This is the error we are getting from the secret store provider:

failed to process mount request" err="failed to get objectType:secret, objectName:certificate, objectVersion:: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://<keyvault>.vault.azure.net/secrets/certificate/?api-version=2016-10-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"Identity not found"} Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=<clientid>&resource=https%3A%2F%2Fvault.azure.net

So the questions are:

  1. How can we get the system pool out of the stuck state?
  2. How can we get the identity back into the azureKeyvaultSecretsProvider?
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,650 questions
0 comments No comments
{count} vote

2 answers

Sort by: Most helpful
  1. Rune Gulbrandsen 30 Reputation points
    2023-10-28T08:57:33.17+00:00

    I'll answer my own issue for further reference.

    We didn't find any reason for why the system pool was stuck in Updating state, neither the reason for why the azureKeyvaultSecretsProvider:identity configuration was gone. But, to fix the issue we did the following.

    1. Removed the affinity and toleration that put our NGINX ingress into system pool and redeployed it to the user pools.
    2. Created a new system pool.
    3. Cordoned and drained the old system pool nodes.
    4. Deleted the old system pool nodes.
    5. Restarted the AKS.

    And then suddenly everything was back in order and we re-added the affinity and toleration to our NGINX ingress to move that back to the system pool.


  2. KarishmaTiwari-MSFT 16,482 Reputation points Microsoft Employee
    2023-11-01T22:04:05.6966667+00:00

    @Rune Gulbrandsen

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this!

    Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to "Accept " the answer. Accepted answers show up at the top, resulting in improved discoverability for others.

    Issue: System pool in AKS is stuck in Updating state

    Solution: Customer shared - Removed the affinity and toleration that put our NGINX ingress into system pool and redeployed it to the user pools.

    1. Created a new system pool.
    2. Cordoned and drained the old system pool nodes.
    3. Deleted the old system pool nodes.
    4. Restarted AKS.

    And then suddenly everything was back in order, and we re-added the affinity and toleration to our NGINX ingress to move that back to the system pool.


    If your issue remains unresolved or have further questions, please let us know in the comments how we can assist. We are here to help you and strive to make your experience better and greatly value your feedback.

    0 comments No comments