After yesterday's mishaps that is described in https://learn.microsoft.com/en-us/answers/questions/1403535/how-to-fix-an-aks-cluster-that-is-in-state-rotatin we discovered today that one of our clusters didn't recover that well.
There are as far as I can see two different issues we're experiencing:
- The system pool is for some reason stuck in
Updating
state (our user pool is running normally).
- The azureKeyvaultSecretsProvider is suddenly missing it's identity.
If these two are connected, or if they are connected to yesterday's issue is not clear, but they did arise after yesterday's issues, and they both affect us heavily.
Issue 1 prohibits us to do ANYTHING with the cluster configuration (except creating a new pool), so we decided to create a new system pool and then drain the old one, but this was effectively stopped by issue 2, since we have an NGINX instance that is running on our system nodes that retrieves certificates from a key vault using the azure key vault provider for secret store which are dependent of this setting.
This is the error we are getting from the secret store provider:
failed to process mount request" err="failed to get objectType:secret, objectName:certificate, objectVersion:: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://<keyvault>.vault.azure.net/secrets/certificate/?api-version=2016-10-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {"error":"invalid_request","error_description":"Identity not found"} Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=<clientid>&resource=https%3A%2F%2Fvault.azure.net
So the questions are:
- How can we get the system pool out of the stuck state?
- How can we get the identity back into the
azureKeyvaultSecretsProvider
?