Introducing the Microsoft.com Engineering Operations Team
The Microsoft.com Engineering Operations (MSCOM Ops) team consists of systems engineers who design, architect, deploy, manage, and sustain highly available, scalable, and secure on-line infrastructures based on Internet Information Services (IIS), SQL Server, and other Microsoft technologies. The major properties we support are www.microsoft.com, Microsoft® Update, Microsoft Download Center, MSDN® and TechNet.
The mission of the MSCOM Ops team is to achieve the highest availability on the Internet while showcasing Microsoft technologies. Another key deliverable is very early adoption of Microsoft products, which allows us to provide valuable feedback to product groups while sharing best practices with Microsoft customers. For example, we rolled Microsoft® Windows Server™ 2008 and Internet Information Services (IIS) version 7.0 into production in April 2006. This was more than ten months before the official launch date of IIS 7.0, and nearly two years before the official launch date of Windows Server 2008. During this dogfooding period we logged more than 50 bugs that helped the product team improve the final product.
MSCOM Ops supports servers in seven Internet data centers. We do not manage the physical data center environment, but we have administrative control and responsibility for the care and feeding of the server infrastructure that begins as soon as the servers are racked and stacked. This infrastructure consists of thousands of servers, databases, and Web applications that we must keep available for content providers from around the company. These content providers self-publish to these Internet-facing sites from around the world twenty-four hours a day.
The Microsoft corporate web site, www.microsoft.com, is one of the largest and most heavily visited sites on the Internet. The site averages more than 56 million unique visits in the U.S. and 250 million unique visits worldwide monthly. This traffic generates 70 million page views daily, averages 15,000 connection requests per second, and maintains an average of 35,000 concurrent connections to a total of 80 Web servers.
Microsoft.com also leverages Content Delivery Network (CDN) partners to extend its reach and improve performance by globally load balancing and edge caching selected static content. For example, we cache image files with our CDN partners to improve download time, and ultimately, to enhance the experience of users when they browse the www.microsoft.com Web site.
In addition to supporting more than two thousand production servers, the MSCOM Ops team supports hundreds of non-production servers in various environments such as pre-production, staging, labs, and dozens of infrastructure servers that support the site.
During the past five years, Microsoft.com has achieved one of the highest rankings on the Internet in terms of site availability as measured by Keynote™, a worldwide leader in e-business performance management services.
www.microsoft.com Infrastructure Architecture
Figure 1 and Figure 2 provide a high-level view of the physical architecture of www.microsoft.com and similar properties.
Figure 1 illustrates the top level of the physical architecture stack, from the Internet to the hardware load balancers. Access layer networking is accomplished by locking down router access control lists (ACLs) to only enable ports 80 and 443. If additional services require other ports to be opened (for example, FTP, SMTP, and so on), the services and ports they need opened are segmented onto separate isolated LANs.
Our CDN partners are an important component in delivering content. They have a worldwide infrastructure that we leverage to provide the site with edge caching of static content (for example, .gif, .jpeg, and .css files), health checking (verifying that each of the clusters are able to accept traffic), and global load balancing. By caching selected static content, we are able to provide a measure of geo-targeting for that content.
Figure 1. Top of the Physical Architecture Stack
Figure 2 illustrates the physical architecture from the data center level down. Although this illustration generally represents www.microsoft.com, there are elements that are common to several other properties too. A property is defined as the front end, back end, and network that host all the infrastructure and code that makes up a Web entity.
One of our key design considerations is the cookie cutter approach to configuration. The cookie cutter approach means we try as much as possible to configure each property in a data center as identically as possible. MSDN, TechNet, Microsoft Update, the Microsoft Download Center, and www.microsoft.com are examples of separate properties. While each of these properties may have many infrastructure similarities in common, they each fulfill a specific business need. The cookie cutter approach allows us great agility in that we can quickly repurpose a LAN or cluster, if necessary, to address changing business needs. For example, we can quickly repurpose a www.microsoft.com cluster into an MSDN cluster if the need arises.
Figure 2. Physical Architecture: Data Center Stack
Windows Update Infrastructure Architecture
One of the responsibilities of the Microsoft.com Engineering Operations Team is to manage the infrastructure that supports the Windows Update and Microsoft Update services, which have growing client bases currently in the hundreds of millions. The Windows Update site provides critical updates, security fixes, software downloads, and device drivers for Windows operating systems. Microsoft Update is the service that brings you all the features and benefits of Windows Update plus downloads for other Microsoft applications, including Microsoft Office. Also supported is automatic update, a major feature of the update services, which enables your PC to automatically check for important updates and download (and possibly, install) them for you.
Windows Update is a very large volume site, with more than a few hundred million clients worldwide, 350 million unique scans per day, 60,000 ASP.Net requests per second, and 1.5 million concurrent connections. Microsoft releases security patches and other updates the second Tuesday of each month. During “Patch Tuesday”, egress can exceed 500 gigabits per second through the Microsoft and CDN partner networks.
The Ops Team in partnership with the Windows Update development team leverages a key construct called a scalable unit. A scalable unit is defined as a specified number of front end Web servers and back end SQL servers that have been carefully benchmarked to provide a known performance metric. Scalable units are an integral part of the architectural design of Microsoft Update. Knowing the capacity of a single scalable unit simplifies the process of scaling out the infrastructure to meet projected traffic needs. Figure 3 illustrates the infrastructure of Microsoft Update.
Figure 3. Microsoft Update Infrastructure Architecture
MSDN and TechNet Infrastructure Architecture
The Microsoft Developer Network (MSDN) is the Microsoft site where developers can find information about key topics such as XML, Windows, Microsoft Visual Studio™, Microsoft .Net, and Microsoft Silverlight™. One of the key features is the MSDN subscription service. This service provides a family of Microsoft quarterly subscription software and documentation packages that provide up-to-date resources for development, implementation, and maintenance of applications and services, based on Microsoft products and technologies. The MSDN subscription service also provides links to news, technical articles, training, events, and community resources, and includes forums, blogs, chats, events, and webcasts.
TechNet is the sister site of MSDN, with the target audience consisting of IT professionals instead of developers. Both sites have very similar architectures with the key differences being the target audience and the content that they deliver. TechNet focuses on providing resource kits, service packs, knowledge base articles, deployment guides, and training materials to IT professionals.
MSDN and TechNet share the same infrastructure, and the Ops Team supports them as a single entity. Because both sites are heavily back-end dependent, SQL Server replication scenarios are of paramount importance. Almost all content is dynamically rendered from the SQL Server back end using XSL transforms and the results are then cached. A key challenge that our SQL Server systems engineers had to solve was how to most efficiently replicate data among geographically dispersed data centers. A primary consideration was replicating large binary objects. We optimized the replication process by carefully testing and tuning the OLDB stream thresholds. This scenario is a good example of one of the many engineering problems that MSCOM Ops faces on a daily basis.
Summary
The Microsoft.com Operations team consists of the following sub-teams: Engineering, Service Management, Support, Hosting and Debugging. This document introduced the Microsoft.com Engineering Operations team, and described our mission and the environment that we support. It also described the services and infrastructure of some of the larger sites that we run and some of the challenges that we encounter.