Windows Azure Service Disruption on February 29, 2012: Post-Mortem

On February 29th, Windows Azure experienced a service disruption that impacted the Service Management interface and some customers of the Windows Azure Compute service. The issues were triaged and resolved by the early morning of March 1st.  At that time, Bill Laing, the Corporate VP of Server & Cloud at Microsoft promised that a root cause analysis would be performed and shared within 10 days.  On Friday evening March 9th, Bill published the details of what happened on the Windows Azure Team Blog:

Summary of Windows Azure Service Disruption on Feb 29th, 2012

As an evangelist AND impacted user of the Windows Azure platform, it was a very interesting read.  I encourage you to take some time to read it.  As initially reported on the 29th, the root cause was a software bug involving an incorrect time calculation for the leap year.  A top level summary from the post-mortem:

Windows Azure comprises many different services, including Compute, Storage, Networking and higher-level services like Service Bus and SQL Azure. This partial service outage impacted Windows Azure Compute and dependent services: Access Control Service (ACS), Windows Azure Service Bus, SQL Azure Portal, and Data Sync Services. It did not impact Windows Azure Storage or SQL Azure.

While the trigger for this incident was a specific software bug, Windows Azure consists of many components and there were other interactions with normal operations that complicated this disruption. There were two phases to this incident. The first phase was focused on the detection, response and fix of the initial software bug. The second phase was focused on the handful of clusters that were impacted due to unanticipated interactions with our normal servicing operations that were underway.

Of note in the report:

  • The Compute service disruption did not impact all hosted services in Windows Azure.
  • The Service Management API was disabled worldwide to help isolate the disruption and prevent user triggered actions from impacting more hosted services.
  • While new hosted services could not be deployed, most existing hosted services continued to run during the disruption.
  • “Due to the extraordinary nature of this event, we have decided to provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted.“

As noted in the post-mortem, there are several lessons the Windows Azure team has learned from this incident that will be used to prevent future disruptions of this kind. 

What takeaways can users of Windows Azure (and other cloud services) take from this?  When Amazon Web Services experienced an extended outage in April 2011, I wrote that one of the keys to the cloud is to design for failure.  Many of the things I wrote and referenced in that post last year are just as relevant today.  An outtake from that post:

It (the cloud) truly is a whole new hosting AND programming paradigm that requires thought into how you design your system from the get go. Yes, you can “migrate” things to the cloud and they may work (very well might I add!). But if you really want to take advantage of what the cloud has to offer, then you need to design for it.

What does that mean? Well, some of the promises of the cloud are scalability, high availability, and elasticity. But those things don’t come for free. How to achieve those things are beyond the scope of this blog post. I will say here that designing your system to achieve those things in the cloud is an emerging skill set which developers & IT pros would be smart to pick up on. One key skill is the ability to design your system for failure. This is critical for high availability in light of last week’s (AWS) outage.

I appreciate the transparency of the Windows Azure team for sharing the details of this disruption.  There is much to learn in this emerging business of the cloud.