(2) AKS clusters front-ended by Azure Traffic Manager to quickly\easily recovery from a Kubernetes upgrade gone bad

Question

Hello,

I'm looking for feedback regarding using a strategy of using (2) AKS clusters (in different regions) that are front-ended by Azure Traffic Manager as a method to quickly\easily recovery from a Kubernetes upgrade that causes unanticipated issues (even if tested with no problems in a lower/test environment for example). We saw an unanticipated issue in our lower/test AKS environment after an upgrade, so there's always the worry that it could happen in production, regardless of what we've tested in the lower/test environment.

The strategy would be:

Normal operations: Cluster-1 and Cluster-2 = Both get traffic from Traffic Manager
Kubernetes upgrade day: After upgrading AKS version on Cluster-1, disable traffic to Cluster-2 with Traffic Manager.
```
                                        Quickly Test application(s) running on Cluster-1, run other tests on Cluster-1 health.  

                                         If there is a problem, enable traffic to Cluster-2, disable traffic to Cluster-1.  

                                         Work to resolve issues with Cluster-1.  Enable it in Traffic Manager again when resolved.

                                         Repeat above process with Cluster-2 (Kubernetes Upgrade).
```
I wanted to go with a solution like the above because it would address both Disaster Recovery (having clusters in 2 different regions) and allow for quick recovery for issues inadvertently caused by our own maintenance (Kubernetes upgrade, application deployment gone bad, other maintenance/reconfig task that we may do that affects application functionality). A big advantage of this would be that in the event of a problem, a single person could go into Traffic Manager and disable the cluster having issues vs. a more involved process (trying to immediately back out changes or debug problems while customers are being affected). The downside of the solution is that we'd have (2) separate clusters that we'd have to maintain independently for config changes/app-deployments/etc. Thank you in advance for any feedback.

Accepted Answer

Hello Max!

I understand you are looking for a feedback to the above scenario. These being said, my answer will be an opinion, it doesn't mean is the right or wrong way to go.

The strategy I would follow is, before upgrading Cluster-1 to disable traffic sent to Cluster-1 (basically sent traffic only to Cluster-2) and then start upgrading Cluster-1. Test that everything is fine on Cluster-1 after the upgrade and if yes, switch the traffic to Cluster-1 and disable the traffic sent to Cluster-2 and perform upgrade on Cluster-2. After you confirm everything is fine with Cluster-2, you can distribute the traffic to both clusters.

As per your statement "We saw an unanticipated issue in our lower/test AKS environment after an upgrade, so there's always the worry that it could happen in production, regardless of what we've tested in the lower/test environment.". It would be important to understand what was the issue. If there was an issue that you could've anticipated, try to understand it and make sure you overcome it next time. I recommend you to take a look into this link ([https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/welcome-azure-kubernetes) because most common scenario that an upgrade can fail are exposed. If there was an issue caused by AKS as a platform, then it would be good to open a support case to report it, get RCA so AKS team is aware of it and fixes it.

Regarding the downside that you would need to separately maintain both clusters, it may not cover all your scenarios, but I would recommend taking in consideration Azure Kubernetes Fleet Manager. Please note this is in preview.

Reference links:
[https://learn.microsoft.com/en-us/azure/kubernetes-fleet/overview
[https://learn.microsoft.com/en-us/azure/kubernetes-fleet/quickstart-create-fleet-and-members

I hope this is helpful. If any clarification needed, let me know and I will do my best to answer.

Please "Accept as Answer" and Upvote if it helped, so that it can help others in the community looking for help on similar topics.

Thank you!

Answer

Test input here. Please ignore.

(2) AKS clusters front-ended by Azure Traffic Manager to quickly\easily recovery from a Kubernetes upgrade gone bad

1 additional answer