Focus on... Azure Planned Maintenance!
"Azure periodically performs updates to improve the #reliability, #performance, and #security of the host infrastructure for virtual machines. All you need to get up to speed, in one post!"
During a recent customer workshop, as we explored and started to map out their cloud journey, I delved deep into a really good discussion regarding how Microsoft manages the underlying fabric of the Azure platform, vs. how these would be done in typical on premises environments. One of the advantages gained by customers moving to Azure is the need to manage/patch/update the physical infrastructure is removed, along with all the maintenance and management of each of these components making up the space, power, servers, storage, network etc. typically associated with a data centre environment - within Azure, this maintenance still needs to occur, however this is managed by Microsoft.
Over the past year, a number of announcements have been made increasing the level transparency to customers of how these updates are performed, allowing our customers to better manage the availability of their core services and workloads. I wanted to take the opportunity to collate as many of theses resources currently available into this single 'Focus on...' post, such that anyone can quickly skill-up in understanding how to take advantage of these new capabilities, such as the Planned Maintenance and Scheduled Events features within your own Azure deployments.
> Bookmark this short URL! https://aka.ms/focuson/apm > Last Updated: 23rd January 2018 (periodically updated as a reference / index to relevant resources)
>> Introducing... Azure Planned Maintenance!
Azure periodically performs updates to improve the reliability, performance, and security of the host infrastructure for virtual machines. These updates range from patching software components in the hosting environment (like operating system, hypervisor, and various agents deployed on the host), upgrading networking components, to hardware decommissioning. The majority of these updates are performed without any impact to the hosted virtual machines. However, there are cases where updates do have an impact:
- If the maintenance does not require a reboot, Azure uses in-place migration to pause the VM while the host is updated;
- If maintenance requires a reboot, you get a notice of when the maintenance is planned. In these cases, you'll also be given a time window where you can start the maintenance yourself, at a time that works for you.
There have been a number of improvements to the planned maintenance experience in Azure, including better visibility and control of maintenance events that impact virtual machine availability - this introductory video covers how to create alerts, discover which virtual machines are scheduled for maintenance, and proactively start the maintenance using the Azure portal, REST API, Azure PowerShell, or Azure CLI.
Video: Virtual Machine Planned Maintenance
During a communicated window, customers can choose to start maintenance on their virtual machines. If you do not utilize the window, the virtual machines will be rebooted automatically during a scheduled maintenance window (which is visible to you). Starting the maintenance will result in the VM being redeployed to an already-updated host. While doing so, the content of the local (temporary) drive will be lost.
Native cloud applications running in a cloud service, availability set, or virtual machines scale set, are resilient to planned maintenance since only a single update domain is impacted at any given time.
You may want to use proactive-redeploy in the following cases:
- Your application runs on a single virtual machine and you need to apply maintenance during off-hours;
- You need to coordinate the time of the maintenance as part of your SLA;
- You need more than 30 minutes between each VM restart even within an availability set;
- You wish to take down the entire application (multiple tiers, multiple update domains) in order to complete the maintenance faster.
Scheduled Events is one of the subservices under Azure Metadata Service that surfaces information regarding upcoming events (for example, reboot). Scheduled events give your application sufficient time to perform preventive tasks to minimize the effect of such events. Being part of the Azure Metadata Service, scheduled events are surfaced using a REST Endpoint from within the VM. The information is available via a Non-routable IP so that it is not exposed outside the VM.
Video: Using Azure Scheduled Events to Prepare for VM Maintenance
>> Documentation
For over 18 months now, docs.microsoft.com has been running as our new unified technical documentation experience; to learn more check out our blog post: /en-us/teamblog/introducing-docs-microsoft-com. For additional documentation on Microsoft products or services, please visit MSDN (https://msdn.microsoft.com/) or TechNet (https://technet.microsoft.com/).
Planned Maintenance Documentation /
There are a number of useful articles to be aware of, dependent on the operating system of your virtual machine, as there will be some specific differences in how you can query the metadata service for upcoming scheduled events.
For Windows Virtual Machines:
- Manage the availability of Windows virtual machines in Azure /en-gb/azure/virtual-machines/windows/manage-availability
- Planned maintenance for virtual machines in Azure /en-gb/azure/virtual-machines/windows/maintenance-and-updates
- Handling planned maintenance notifications for Windows virtual machines /en-us/azure/virtual-machines/windows/maintenance-notifications
- Azure Instance Metadata service /en-us/azure/virtual-machines/windows/instance-metadata-service
- Azure Metadata Service: Scheduled Events (Preview) for Windows VMs /en-us/azure/virtual-machines/windows/scheduled-events
For Linux Virtual Machines:
- Manage the availability of Linux virtual machines /en-gb/azure/virtual-machines/linux/manage-availability
- Planned maintenance for Linux virtual machines /en-gb/azure/virtual-machines/linux/maintenance-and-updates
- Handling planned maintenance notifications for Linux virtual machines /en-us/azure/virtual-machines/linux/maintenance-notifications
- Azure Instance Metadata service /en-us/azure/virtual-machines/linux/instance-metadata-service
- Azure Metadata Service: Scheduled Events (preview) for Linux VMs /en-us/azure/virtual-machines/linux/scheduled-events
An industry-wide, hardware-based security vulnerability was disclosed on January 3 - additional guidance was published to docs.microsoft.com to cover off background and frequently asked questions related to the Azure Planned Maintenance activities:
- Guidance for mitigating speculative execution side-channel vulnerabilities in Azure (added: 23/01/2018) /en-us/azure/virtual-machines/windows/mitigate-se (Linux)
- Accelerated maintenance frequently asked questions (FAQs) (added: 23/01/2018) /en-us/azure/virtual-machines/windows/accelerated-maintenance (Linux)
Azure Architecture Center /en-us/azure/architecture/
The Azure Architecture Center is the official centre for guidance, blueprints, patterns, and best practices for building solutions with Microsoft Azure, curated by the Microsoft patterns & practices team. Specifically in the context of mitigating the potential impact of maintenance events, applications should look to take advantage of high availability options, such as availability sets and availability zones (in preview at the time of writing):
- Design for self-healing /en-us/azure/architecture/guide/design-principles/self-healing
"In a distributed system, failures happen. Hardware can fail. The network can have transient failures. Rarely, an entire service or region may experience a disruption, but even those must be planned for." - Make all things redundant /en-us/azure/architecture/guide/design-principles/redundancy
"A resilient application routes around failure. Identify the critical paths in your application. Is there redundancy at each point in the path? If a subsystem fails, will the application fail over to something else?"
There are also a number of Cloud Design Patterns regarding availability and resiliency, which where possible should be architected into your application. Availability defines the proportion of time that the system is functional and working. It will be affected by system errors, infrastructure problems, malicious attacks, and system load. It is usually measured as a percentage of uptime. Cloud applications typically provide users with a service level agreement (SLA), which means that applications must be designed and implemented in a way that maximizes availability.
- Availability - Heath Endpoint Monitoring /en-us/azure/architecture/patterns/health-endpoint-monitoring
"Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals." - Availability - Queue-Based Load Leveling /en-us/azure/architecture/patterns/queue-based-load-leveling
"Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads." - Availability - Throttling /en-us/azure/architecture/patterns/throttling
"Control the consumption of resources used by an instance of an application, an individual tenant, or an entire service."
Resiliency is the ability of a system to gracefully handle and recover from failures. The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both transient and more permanent faults will arise. Detecting failures, and recovering quickly and efficiently, is necessary to maintain resiliency.
- Resiliency - Bulkhead /en-us/azure/architecture/patterns/bulkhead
"Isolate elements of an application into pools so that if one fails, the others will continue to function." - Resiliency - Circuit Breaker /en-us/azure/architecture/patterns/circuit-breaker
"Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource." - Resiliency - Compensating Transaction /en-us/azure/architecture/patterns/compensating-transaction
"Undo the work performed by a series of steps, which together define an eventually consistent operation." - Resiliency - Heath Endpoint Monitoring /en-us/azure/architecture/patterns/health-endpoint-monitoring
"Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals." - Resiliency - Leader Election /en-us/azure/architecture/patterns/leader-election
"Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader that assumes responsibility for managing the other instances." - Resiliency - Queue-Based Load Leveling /en-us/azure/architecture/patterns/queue-based-load-leveling
"Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads." - Resiliency - Retry /en-us/azure/architecture/patterns/retry
"Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed." - Resiliency - Scheduler Agent Supervisor /en-us/azure/architecture/patterns/scheduler-agent-supervisor
"Coordinate a set of actions across a distributed set of services and other remote resources."
>> Updates & Roadmap
As the Azure platform continues to evolve, be aware of these sites so you can subscribe to the latest updates and feature releases.
Azure Blog https://azure.microsoft.com/en-us/blog/
Hear from Azure experts and developers about the latest information, insights, announcements, and news in the Microsoft Azure blog.
- Reacting to maintenance events... before they happen (3rd May 2017)
https://azure.microsoft.com/en-us/blog/get-started-with-scheduled-events/
"What if you could learn about upcoming events which may impact the availability of your VM and plan accordingly? Well, with Azure Scheduled Events you can." - A new Planned Maintenance experience for your virtual machines (25th September 2017)
https://azure.microsoft.com/en-us/blog/a-new-planned-maintenance-experience-for-your-virtual-machines/
"We’re excited to announce the availability of a new planned maintenance experience in Azure, providing you more control, better communication, and better visibility." - Get notified when Azure service incidents impact your resources (27th November 2017)
https://azure.microsoft.com/en-gb/blog/get-notified-when-azure-service-incidents-impact-your-resources/
"When an Azure service incident affects you, we know that it is critical that you are equipped with all the information necessary to mitigate any potential impact. The goal for the Azure Service Health preview is to provide timely and personalized information when needed, but how can you be sure that you are made aware of these issues?" - Securing Azure customers from CPU vulnerability (3rd January 2018) (added: 23/01/2018) https://azure.microsoft.com/en-gb/blog/securing-azure-customers-from-cpu-vulnerability/
"An industry-wide, hardware-based security vulnerability was disclosed today. Keeping customers secure is always our top priority and we are taking active steps to ensure that no Azure customer is exposed to these vulnerabilities. At the time of this blog post, Microsoft has not received any information to indicate that these vulnerabilities have been used to attack Azure customers."
Azure Updates Blog https://azure.microsoft.com/en-us/updates/
In addition to the Azure Blog, further detail on all updates into Azure are available on the Azure Updates Blog.
- Preview: Azure Service Health (10th July 2017)
https://azure.microsoft.com/en-us/updates/azure-service-health-preview/
"Azure Service Health Preview provides guidance and support when issues in Azure services affect you. It provides timely and personalized information about the impact of service issues and helps you prepare for upcoming planned maintenance."
Azure Roadmap https://azure.microsoft.com/en-us/roadmap/
As Azure continues to grow, you will want to stay informed. The product roadmap is the place to find out what’s new and what’s coming next. Let us know what you think by providing feedback and voting on items. You can also subscribe to notifications, so you’ll always be the in the know.
- Azure Service Health - In Preview (10th July 2017)
https://azure.microsoft.com/en-gb/roadmap/azure-service-health/ - Azure Availability Zones - In Preview (22nd September 2017)
https://azure.microsoft.com/en-gb/roadmap/azure-availability-zones/ - In-Virtual Machine Scheduled Events - In Preview (23rd October 2017)
https://azure.microsoft.com/en-us/roadmap/in-vm-scheduled-events/
>> Podcasts
Listening to Podcasts can be a great way to keep up to date, especially when you're out and about, perhaps in the car on the way to work for example. While much of the Channel 9 content is also available in audio format, there are a small number of podcasts that have touched on Planned Maintenance in the past.
Microsoft Cloud Show https://www.microsoftcloudshow.com/
Whether you are new to the cloud, old hat or just starting to consider what the cloud can do for you this podcast is the place to find all the latest and greatest news and information on what's going on in the cloud universe. Join long time Microsoft aficionados and SharePoint experts Andrew Connell and Chris Johnson as they dissect the noise and distil it down, read between the lines and offer expert opinion on what is really going on. Just the information … no marketing … no BS, just two dudes telling you how they see it.
- Episode 206 | Latest News from Azure, Office 365 Plus Epic Container News (1st August 2017)
https://www.microsoftcloudshow.com/podcast/Episodes/206-latest-news-from-azure-office-365-plus-epic-container-news
"In this 206th episode, AC and CJ cover some big news from Azure related to containers and the CNCF. They also look at other Azure, Office 365 and GCP cloud news."
>> Presentations
Throughout the year, Microsoft hosts a number of public events allowing both in-person and online attendance, while common to all is on-demand access to the recordings of most, if not all sessions presented. These are often given by the engineering teams working closely on the Azure platform itself, or by experienced architects who are working deep in the field in implementing Azure services to solve customer's business challenges.
Ignite 2017 - 25th to 29th September 2017 https://myignite.microsoft.com/videos/
Microsoft Ignite brings together the best of previously individual conferences - Microsoft Management Summit; Microsoft Exchange, SharePoint, Lync, Project, and TechEd conferences - into a single annual event, last held 25th to 29th September 2017 and showcases the company’s enterprise products and services, while providing incredibly valuable IT training. It also provides plentiful opportunities for IT professionals to get together for collaboration and networking.
- The new planned maintenance experience in Microsoft Azure (Level 300; 4th October 2017)
https://myignite.microsoft.com/videos/57085
Slides: https://view.officeapps.live.com/op/embed.aspx?src=https%3A%2F%2F8gportalvhdsf9v440s15hrt.blob.core.windows.net%2Fignite2017%2Fsession-presentations%2FTHR3043.PPTX
"While we perform most of the hosting environment maintenance without any impact to virtual machines in Azure, there are rare cases where we end up impacting our customers and rebooting their VMs. In this session we describe the brand new experience around VM restarting maintenance operations and guide you on how to better prepare for the next wave of planned maintenance in Azure. Stop by to learn how to set alerts for planned maintenance, discover the scope and timeline relevant to your VMs, control the exact time of the maintenance, and proactively react from within the VM to any VM impacting events."
Tuesday with Corey
https://channel9.msdn.com/Shows/Tuesdays-With-Corey
Corey Sanders answers your questions about Microsoft Azure - Virtual Machines, Web Sites, Mobile Services, Dev/Test etc. If you have a question, Corey will find the answer!
- Tuesdays with Corey: Redeploy Button and RHEL Support links (15th March 2016)
https://channel9.msdn.com/Shows/Tuesdays-With-Corey/Tuesdays-with-Corey-Redeploy-Button-and-RHEL-Support-links
"Corey Sanders, Director of Program Management on the Microsoft Azure Compute team talks about the ever expanding troubleshooting blade - now with a new "Redeploy Button" and RedHat Enterprise Linux support links." - Tuesdays with Corey: The Magic of Scheduled Maintenance (13th June 2017)
https://channel9.msdn.com/Shows/Tuesdays-With-Corey/Tuesdays-with-Corey-The-Magic-of-Scheduled-Maintenance
"Corey Sanders, Director of Program Management on the Microsoft Azure Compute team recaps some of the cool technologies and announcements recently discussed at Microsoft Build. In this episode Corey talks with Ziv Rafalovich - a Senior PM on the Azure Compute team. Ziv shows off our new capabilities for in-VM awareness of scheduled maintenance and other cool capabilities announced at Build." - Tuesdays with Corey - go TEST Scheduled Maintenance (19th September 2017)
https://channel9.msdn.com/Shows/Tuesdays-With-Corey/Tuesdays-with-Corey-go-TEST-Scheduled-Maintenance
"Corey Sanders, Director of Program Management on the Microsoft Azure Compute team sat down with Ziv Rafalovich - a Senior PM on the Azure Compute team. Ziv shows off our new capabilities around notification and scheduling maintenance on your own timeline."
Azure Friday
https://channel9.msdn.com/Shows/Azure-Friday
Join Scott Hanselman as he engages one-on-one with the engineers who build the services that power Microsoft Azure as they demo capabilities, answer Scott's questions, and share their insights. Follow us at: friday.azure.com.
- Using Azure Scheduled Events to Prepare for VM Maintenance (27th July 2017)
https://channel9.msdn.com/Shows/Azure-Friday/Using-Azure-Scheduled-Events-to-Prepare-for-VM-Maintenance
"Eric Radzikowski joins Scott Hanselman on Azure Friday to discuss how developers can increase application availability by using Azure Scheduled Events to prepare for virtual machine maintenance." - Azure Service Health (15th September 2017)
https://channel9.msdn.com/Shows/Azure-Friday/Azure-Service-Health
"Dushyant Gill joins Scott Hanselman to talk about Azure Service Health. When issues in Azure services affect your business-critical resources, Azure Service Health notifies you and your teams, helps you understand the impacts of the issue, and keeps you updated as the issue is resolved. It also helps you prepare for planned maintenance and changes that could affect the availability of your resources." - Virtual Machine Planned Maintenance (18th September 2017)
https://channel9.msdn.com/Shows/Azure-Friday/Virtual-Machine-Planned-Maintenance
"Ziv Rafalovich joins Scott Hanselman to talk about improvements to the planned maintenance experience in Azure, including better visibility and control of maintenance events that impact virtual machine availability. Learn how to create alerts, discover which virtual machines are scheduled for maintenance, and proactively start the maintenance using the Azure portal, REST API, Azure PowerShell, or Azure CLI."
Microsoft Azure on YouTube https://www.youtube.com/channel/UC0m-80FnNY2Qb7obvTL_2fA
Supporting videos and material are also posted independently to YouTube.
- Azure Service Health - Planned Maintenance (3rd November 2017)
https://www.youtube.com/watch?v=vgYqm-Y74y8
"Stay informed and prepare for maintenance activities in Azure to minimize their impact on business critical applications."
>> Code Samples
Various sample and introductory code snippets to take advantage of Planned Maintenance and Scheduled Events functionality.
Azure Code Samples https://azure.microsoft.com/en-us/resources/samples/
Learn to interact with Azure services through code. A number of code samples are published via the Azure Code Samples library.
- Collecting Scheduled Events with Event Hub (3rd May 2017)
https://azure.microsoft.com/en-us/resources/samples/virtual-machines-python-scheduled-events-central-logging/
"The sample project demonstrates how to monitor upcoming events on multiple Virtual Machines by forwarding them to a single Event Hub."
All Azure Code samples are available via GitHub.
- GitHub - Azure Samples https://github.com/Azure-Samples/
"Microsoft Azure code samples and examples in .NET, Java, Python, Node.js, PHP and Ruby"
Additionally, the following code sample can be found directly on GitHub:
- Azure Metadata Service: Scheduled Events Samples https://github.com/Azure-Samples/virtual-machines-scheduled-events-discover-endpoint-for-non-vnet-vm
"Scheduled Events is one of the subservices under the Azure Metadata Service. It is responsible for surfacing information regarding upcoming events (for example, reboot) so your application can prepare for them and limit disruption."
Azure Quickstart Templates https://azure.microsoft.com/en-us/resources/templates/
Deploy Azure resources through the Azure Resource Manager with community contributed templates to get more done. Deploy, learn, fork and contribute back. With Resource Manager, you can create a template (in JSON format) that defines the infrastructure and configuration of your Azure solution. By using a template, you can repeatedly deploy your solution throughout its lifecycle and have confidence your resources are deployed in a consistent state.
- Create an Availability Set with 3 Fault Domains https://azure.microsoft.com/en-gb/resources/templates/101-availability-set-create-3fds-20uds/
"This template creates an Availability Set with 3 Fault Domains"
All Azure Quickstart Templates are available via GitHub.
- GitHub - Azure Quickstart Templates https://github.com/Azure/azure-quickstart-templates
>> Community
There are a large number of users of Azure out in the community, with many taking the time to document and share their experiences of using the Azure services. I've included a selection of individuals and articles here, but please let me know if you've found and can recommend other good resources.
Bob Rouderbush https://roudybob.blog/
As a Cloud Solution Architect here at Microsoft, Bob Rouderbush maintains a personal blog on roudyblb.blog.
- Azure Planned Maintenance Experience (25th September 2017)
https://roudybob.blog/2017/09/25/azure-planned-maintenance-experience/
"The Azure Team recently announced the availability of the new Planned Maintenance experience."
Daniel Petri https://www.petri.com/
Launched by Daniel Petri in 1999, the The Petri IT Knowledgebase has served as a leading content and community resource for IT professionals and system administrators for more than 15 years.
- Planned Maintenance For Azure Virtual Machines (30th October 2017)
https://www.petri.com/planned-maintenance-azure-virtual-machines
"Microsoft is adding a new feature that allows you to control the forced outages that occur to virtual machines when patches are delivered to Azure’s compute hosts."
CUGC - Citrix User Group Community https://www.mycugc.org/
For the users, by the users, CUGC are dedicated to helping members and businesses excel. Members are technology professionals interested in maximizing the value of Citrix and partner products.
- NetScaler HA on Microsoft Azure “Planned Maintenance” (19th October 2017)
https://www.mycugc.org/blog/netscaler-ha-on-microsoft-azure-planned-maintenance
"Citrix NetScaler High Availability on Microsoft Azure has never been an easy subject, especially after Microsoft supported multi IP/NICs on Azure Virtual Machines a couple of months ago. The debate still rages on today about how NetScaler HA should be configured, nevertheless, a recent announcement by Microsoft on a New Planned Maintenance Experience for Azure Virtual Machines could change all that. Let's discuss NetScaler HA options on Azure before diving into the New Planned Maintenance Experience 'Proactive-Redeploy.'"
Bert Wolters https://www.azureman.com/
The personal blog of Bert Wolters, MVP, currently working as a Technical Consultant at inspark in The Netherlands.
- Azure Redeploy Feature is Released (17th March 2016)
https://www.azureman.com/infrastructure/azure-redeploy-feature-is-released/
"Today, when troubleshooting a failed deployment through a template, I stumbled upon a feature that (in my opinion) deserves a little more love than it has had before."