Warnings sent to customers when Azure is about to be updated
I sometimes get customers asking me about the warnings they’ll get when updates are rolled out across Azure. Well – at the bottom of this post is an example email sent to me. Notice the emphasis placed on:
- Putting multiple VMs in to availability sets
- Creating multiple instances of each role in Cloud Services
I can’t remember exactly but Igal Figlin from Microsoft did some background research in to this and found that 40% (it might be higher I can’t remember exactly – you can watch the video here) of deployments are not in availability sets. Have a read of the email below and you’ll start to realise how much risk you are putting yourself to if you don’t use multiple VMs in availability sets.
When you put VMs in to availability sets they are also distributed across up to 5 update domains. When Microsoft updates Azure, they’ll walk from one update-domain to the next. You can see what they are saying in this email – they’ll leave 30 minutes between updating each update domain. Let’s say you have 2 machines in an availability set. They’ll be spread across 2 fault domains and 2 update domains. That means if an infrastructure fault occurs (like say power or a network segment), only one of your VMs will be affected. It also means if Microsoft has to do an update, it will take one of your machines out of the configuration at a time.
If you want to be super-cautious, you could protect against the scenario that while Microsoft is walking the update domains in your availability set, you also get an infrastructure failure – that could take out a further machine. The table below shows how.
Update Domain 0 | Update Domain1 | Update Domain 2 | |
Fault Domain 0 | Instance 0 | Instance 2 | |
Fault Domain 1 | Instance 1 |
Imagine the update process had done the update on the instance in Update Domain 0, it had then walked on to Update Domain 1 and was in the middle of updating that instance. Instance 1 is now offline. At the same time a power failure occurs to the rack on Fault Domain 0. That would cause Instance 0 and instance 2 to also be taken offline. You’d now have an availability set with no running machines. You can counter this by adding a VM to the availability set. Because there can only ever be one Update Domain in an availability set undergoing an update – you are protected. Let’s say you are in the middle of updating one of the services yourself. Your update will be stalled, the Microsoft update will complete and then your update will continue. In other words updates are applied to an update domain synchronously. And if you are in the middle of updating one Update Domain, Microsoft won’t start simultaneously updating a different Update Domain. So the following table will remove all risk from simultaneous Update Domain and Fault Domain operations.
Update Domain 0 | Update Domain1 | Update Domain 2 | Update Domain 3 | |
Fault Domain 0 | Instance 0 | Instance 2 | ||
Fault Domain 1 | Instance 1 | Instance 3 |
The failure of any Fault Domain will take out 2 instances and a simultaneous update can take out only one Update Domain. This means a maximum of 3 instances can be offline because of simultaneous Update Domain/Fault Domain operations. That would leave you with one running instance.
You’d have to be very unlucky to get an infrastructure failure occur while an update is going on. The availability SLA takes the above scenarios in to consideration – you only have to have 2 instances in your availability sets to enjoy the uptime guarantee. If you are unlucky enough to suffer a double problem and the availability drops below the guarantee then Microsoft compensates you.
I made a post about Update Domains and Fault Domains a couple of weeks ago. Interesting stuff if you’re going to take the Azure Infrastructure exam.
Anyway – here’s the email:
---- cut here -----
Upcoming maintenance will affect deployments of Azure Virtual Machines in availability sets and Cloud Services. |
As part of our ongoing commitment to performance, reliability, and security, we sometimes perform maintenance operations in our Azure regions and datacenters. We want to notify you of upcoming maintenance operations that will impact Virtual Machines in an availability set and Cloud Services. Note: Currently, we’re only able to provide 2 days' advance notice for updates that impact Virtual Machines in availability sets and Cloud Services. We’re working to provide more advance notice in the future. The following are the planned start times for infrastructure-as-a-service (IaaS) and platform-as-a-service (PaaS) maintenance operations, provided in both Coordinated Universal Time (UTC) and United States Pacific Daylight Time (PDT). Impacted deployments are listed at the bottom of this email. |
|
Microsoft Azure Virtual Machines (IaaS) Maintenance operations are split between virtual machines (VMs) that are and are not in an availability set. This maintenance will impact VMs in an availability set. VM deployments referenced below will reboot during this maintenance operation, but temporary storage disk contents will be retained. We expect the update to finish within 48 hours of the start time. Note: If you have a single VM in an availability set, it will still be impacted by this maintenance operation. In addition, all VMs in the same availability set are not taken down at the same time—these VMs are spread across five update domains. Only VMs in the same update domain for the availability set may be rebooted at the same time, and there will be at least a 30-minute interval between processing each update domain. VMs that are in different availability sets may be taken down at the same time. For more information, please visit the availability sets documentation webpage. If you’re not already, we recommend using availability sets in your architecture to ensure higher availability of your service. You can read our multiple instances service level agreement (SLA) commitment for Virtual Machines. To learn more about our planned maintenance, please visit the Planned maintenance for Azure virtual machines documentation webpage. If you have questions, please visit the Azure Virtual Machines forums. To ensure higher availability, the maintenance is scheduled in region pairs. To help determine whether the reboot you observed on your VM is due to a planned maintenance event, please visit the Viewing VM Reboot Logs blog post. Microsoft Azure Cloud Services (PaaS) All Cloud Services running web and/or worker roles referenced below will experience downtime during this maintenance. Cloud Services with two or more role instances in different upgrade domains will have external connectivity at least 99.95 percent of the time. Please note that the SLA guaranteeing service availability only applies to services that are deployed with more than one instance per role. Azure updates one upgrade domain at a time. For more information about distribution of roles across upgrade domains and the update process, please visit the Update an Azure Service webpage. If you have questions, please visit the Azure Cloud Services forums. Please note that email addresses provided for any of the following account roles also received this communication: account and service administrators, and co-administrators. Thank you, Your Azure Team |
|
Have fun – Planky == @plankytronixx
Comments
Anonymous
May 29, 2015
Why do the PDT and UTC times not match?Anonymous
May 29, 2015
Are you getting mixed up with PST? UTC is 8 hours ahead of PST. So 08:00 + 8 hours = 16:00. But at the time the email was sent we were in PDT. UTC is 7 hours ahead of PDT. So 08:00 + 7 hours = 15:00. They do match.Anonymous
May 31, 2015
Why has Microsoft decided to move this to a weekday update from there previous weekend updates. As well not everything can be put in an availability set. Especially since it is very difficult to set up things like SQL Availability Groups in azure. And you can forget about other types of clusters. I don't quite see the logic.Anonymous
June 01, 2015
Their maintenance strategy is pretty useless for a large number of use-cases. e.g. how can you build a VDI platform on Azure which requires persistent connections to Single VMs? Availability Sets do nothing to alleviate the issue. Why release the Citrix Netscaler on the Marketplace? The nature of the Netscaler requires persistent connections for user sessions. Azure maintenance windows cross over into business hours in many regions, not terribly useful. AWS is vastly better in this regard - they give you control of rebooting your VM for maintenance/updates.