Share via


Keeping the Messenger Service Running—On a Massive Scale

A couple of weeks ago I posted an interesting article about hotmail architecture. This week a new article has been posted explaining how the Messenger cloud is kept up and running.

An extract:

In building the Messenger service we've always focused on these core principles:

  • Scale - making it easy to support more customers using a set pattern
  • Reliability - making the system redundant where needed
  • Efficiency - delivering the best service for the least cost
Back in the day

The Messenger service in its early days was built on a Unix variant system. Each time we had to upgrade the Messenger servers, we had to bring the service down, install the upgrade, and then bring the service back up. Of course, we would do this at around midnight Pacific Standard Time. As Messenger became a global service, this became untenable as it was right in the middle of the day for our customers in Europe and Asia. We also had our share of issues, and we learned as we went along.

I remember one upgrade—this was 6 or 7 years ago—when bringing the service back up caused issues for our customers for a much longer period of time than we would have liked. As you can imagine this was not a good day for the team or for our customers. But we learned from our mistakes. At this critical juncture in the evolution of Messenger, we added a new core principle to our earlier list: what we call “no cloud down.”

No cloud down

No cloud down basically means that the "cloud" servers (where information about your IM connections are stored) are never all down at the same time, so your service is never interrupted. To help us achieve this goal, first we moved all Messenger activity to Windows-based servers. We worked to avoid cascading failures from affecting the system as a whole, by making various parts of the service redundant. And as with the Hotmail backend architecture , we made it easy to build more capacity by using “clusters” of servers that can be deployed in a single data center or across multiple data centers to service all the traffic. We also made the Messenger client more resilient to network-related issues.

Full article at https://windowsteamblog.com/blogs/windowslive/archive/2010/03/02/keeping-the-messenger-service-running-on-a-massive-scale.aspx