How Microsoft operates reliable systems with DevOps

Article
11/28/2022

Microsoft has been operating complex online platforms since the earliest days of the commercial internet. Along the way, we've evolved a substantial set of practices to keep systems available, healthy, and secure. These practices are part of a larger initiative to maintain and improve a live site culture.

Live site culture

Live site culture is the focus of an organization to prioritize the experience and reliability of the live site over everything else. After all, customers can move across service providers fairly easily nowadays with the cloud and internet-based services, greatly amplifying the importance of customer trust. The live site must always be available and perform as-promised to customers.

There are various factors that contribute to a successful live site culture.

Diagram of Microsoft's live site culture.

Live site first

Putting the live site experience first is integral to a successful platform. Teams can't put all of their focus on new, shiny features and disregard the avenue in which those features are presented to users. We rely on safe deployment practices that help ensure that our customers enjoy uninterrupted platform access. This can get especially complicated when it comes to releasing versioned service updates with no downtime.

Control exposure through feature flags

As we deploy out through our rings and stages, controlling exposure with feature flags, we occasionally discover an issue in production. Despite all our automation and reviews, things sometimes still happen. As they say, there's no place like production!

Usually, health monitoring and telemetry alert us when something isn't right. A developer can create a branch off main, make a fix, and pull request it into main. Keeping the same general workflow means that developers don't have to context-switch or learn a different process for a different code change.

To address a hotfix deployment, one more step is required, which is to cherry-pick the change into the release branch. We run a hotfix deployment out of the current release branch each weekday morning, though we can also do this on demand for urgent fixes. The fix actually hits production out of the release branch first. But because we develop in main first, we know it won't regress the next sprint when a new release branch is created from main.

Releases of on-premises products are largely the same, though without the deployment rings and stages. Also, because we do more manual testing on different configurations and data shapes, there's a longer tail between cutting the release branch and putting the product in the hands of customers.

Security should be taken personally

The focus is to make vulnerabilities real and personal. This ensures that people really care. We also make extensive use of war games to find and address security risks throughout the system, whether in code or not. When the red team can show that they got into code by turning a dialog box upside down, it really motivates the code owner to address the issue and make sure it doesn't happen again anywhere else. That sort of competition is much more real and personal than a static analysis warning about a potential XSS risk. We create this kind of culture and dynamic through war games and other security exercises. People take pride in hacking into each other's code or being able to block the attempts. This instills a secure code culture.

We can't plan for every attack vector, but what we can do is assume that there's going to be a breach, and plan how fast we can react to that breach. A lot of the security work has been around that for our teams.

Finally, humans make mistakes. They sometimes get lazy and do things like store passwords on file shares. We can tell them not to and we can send them to security training and we can do all sorts of other things. Most people learn, but it only takes one person to break the system. You can have all sorts of lists of best practices but unless you're making that real, you have to assume that people are going to make mistakes. This requires a certain level of oversight to ensure critical processes are being followed.

Engineering is more than an ops partner

We learned early on to make the live site an important part of the engineering team's responsibilities. That was huge for us because, in the past, one person could go deploy something, leave for the weekend, and return Monday to find 900 customer issues that the customer support and ops teams were dealing with all weekend. It's important that engineering pay the price for live site issues. Otherwise there's no incentive to build systems that avoid those problems. When you get called at 2 A.M. to fix something you broke, you remember.

As we evolved this responsibility, Live site is the most important thing that we do became the whole team's mantra. It's the customer experience they have right now and it's not just a tax. It's actually something people count on from us and we take pride in it. It needs to be a differentiating feature of our product.

Production telemetry is the heartbeat of your service

In order to survive in the fast-paced world where virtually anything can go wrong, we need great alerting systems. Unactionable alerts, redundant alerts, or overwhelming alert volumes make you ignore all the alerts. It's easy to create way too many alerts, so the process really distills down to a simple question: Is this alert actionable? This ensures we're engaging on the right customer issues and handling them as quickly as possible.

As the engineering team zeroed in on actionable alerts, they noticed that a lot of problems that come up, especially in the middle of the night, tend to have similar fixes, at least temporarily. This resulted in a focus on systems that were better at failing over and self-healing. Now the issues happen, raise alerts, and then fix themselves well enough for the engineering team to wait until morning to fix. This wouldn't have happened if the engineering team just pushed out bits that kept other people up at night. Now they work to balance these improvements as a part of not just feature velocity, but engineering improvement velocity.

Summary

Adopting a live site culture has impacted the way Microsoft builds and delivers software. By making engineering teams a key part of security and operations, the quality of our code and end-user experience have improved drastically. Being a full participant in operations has made engineering a key stakeholder, resulting in systems that are designed for better operations.