Uptime per proposed HA-Solution

Question

Based on this article: https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/dmz/nva-ha#changing-pip-udr I was wondering if it wouldn't make sense to improve the documentation with the following information:

I'm missing insights into uptime per solution. I tried looking it up and calculating it for myself and came to these numbers:

single VM: 99,9%
vm in availability zone: 99,99%
vnet, route table and public IP: 100% (really? could not find anything else)
load balancer: 99,99%
route server: 99,95%

So a setup like:
load balancer (lb) --> NVA (single) --> lb --> endpoint would be 99,99%*99,9%*99,99% or 99,8% uptime --> 89 minutes downtime/month
lb --> NVA (av zone) --> lb --> endpoint would be 99,99%*99,99%*99,99% or 99,97% uptime --> 13,39 minutes downtime/month
lb --> NVA(av zone) --> route server --> endpoint would be 99,99%*99,99%*99,95% or 99,93% --> 31.25 minutes downtime/month
NVA (av zone) with pip+udr changes --> 99,99% --> 4,46 minutes downtime/month

Now it becomes especially interesting, as the pip+udr solution may of course have the highest convergence time (which highly depends again if we talk about udr convergence which are quite fast or pip, which take longer). Some tests I did ended up being around 1 minute (but not representative --> too little data) for a pip + udr failover.

So if I now assume (just for simplicity) that a lb convergence would happen immediately and I compare the two solutions with highest availability (i.e., lb --> NVA (av zone) --> and NVA (az zone) with pip) and I ignore NVA av zone failures (as they may happen in both setups with equal likelyhood), what remains are these numbers:
8,92 minutes downtime per month on lb-->NVA-->lb setup
1 minutes downtime times the amount of expected failovers per month for the NVA pip+udr setup

So my conclusion would be that the NVA pip+udr setup provides the highest uptime (until we expect at least 9 failures per month).

So my first question would be: Are those calculations correct or did I miss something?

My second question would be: wouldn't it make sense to add those numbers to the different solutions?

P.S.:

Original question/feedback posted in github Issues: https://github.com/MicrosoftDocs/architecture-center/issues/4325 --> I still think it's a improvement request for documentation, but was told to put it here as it seems not to be an improvement request for the documentation.

Answer

Hello, @MENNEL Andreas !

Shouldn't the uptime SLA for my VM solution be higher?

This was one of the first things I noticed as well when I was looking at VM uptime SLAs. Doing the math yourself, you would end up with higher availability numbers than you would see on the virtual machines SLA (for example, two or more instances in the same Availability set is 99.95% but two or more instances in different Availability Zones has an SLA of 99.99%).

So my first question would be: Are those calculations correct or did I miss something?

One of the key difficulties in determining uptime is the nature of downtime. There are things that would contribute to downtime for 2 VMs (or other resources) in one scenario but not another. For example, a pair of VMs might be in different update domains. Individually, the VMs take a small hit on availability due to updates however as a pair they should theoretically never be down for updates at the same time. Compare those two VMs that are never scheduled for update downtime at the same time versus just rolling the dice that they won't update at the same time based purely on chance when looking at the availability of a single VM.

Additionally, you have the possibility of Availability Set failures that wouldn't affect VMs spread out across Availability Zones and countless other scenarios. All of these overlapping availability impacting events occur with different scopes and different frequencies which makes what seems like straightforward math much more complicated.

For a rough estimation of uptime though, I would follow a similar approach and there are several blogs on the topic if you are interested:

My second question would be: wouldn't it make sense to add those numbers to the different solutions?

The second limitation in the documentation when it comes to mentioning uptime is that there are contractual and legally binding consequences when talking about uptime. You'll notice that all of the SLA information is on Azure legal pages and the documentation is bound to what is listed there. From an official standpoint, this is what all SLAs are based on:

Service Level Agreements (SLA) for Online Services

I hope this has been helpful! Your feedback is important so please take a moment to accept answers.

If you still have questions, please let us know what is needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!

User's image

Share via

Uptime per proposed HA-Solution

1 answer