Operational excellence design principles

The principles of operational excellence are a series of considerations that can help achieve superior operational practices.

To achieve a higher competency in operations, consider and improve how software is:

  • Developed
  • Deployed
  • Operated
  • Maintained

Equally important, provide a team culture, which includes:

  • Experimentation and growth
  • Solutions for rationalizing the current state of operations
  • Incident response plans

To assess your workload using the tenets found in the Azure Well-Architected Framework, reference the Microsoft Azure Well-Architected Review.

The following design principles provide:

  • Context for questions
  • Why a certain aspect is important
  • How an aspect is applicable to Operational excellence

These critical design principles are used as lenses to assess the Operational excellence of an application deployed on Azure. These lenses provide a framework for the application assessment questions.

Optimize build and release processes

Embrace software engineering disciplines across your entire environment, which include the following disciplines:

  • Provision with Infrastructure as Code
  • Build and release with continuous integration and continuous delivery (CI/CD) pipelines
  • Use automated testing methods
  • Avoid configuration drift through configuration as code

This approach ensures the creation and management of environments throughout the software development lifecycle. It enables:

  • Consistency
  • Repetition
  • Early detection of issues

Understand operational health

Implement systems and processes to monitor all aspects of your workload. Including:

  • Build and release processes
  • Infrastructure health
  • Application health

Robust monitoring ensures the observability of a workload and allows you to correlate events and take proactive mitigating issues.

In addition, customer data is critical to understanding the health of a workload and whether the service is meeting the business goals.

Important

The Health Modeling section of the Azure Mission-Critical framework contains further in-depth guidance and examples on how to build a health model for a given workload.

Rehearse recovery and practice failure

Rehearse recovery and practice failure using the following methods:

  • Run disaster recovery (DR) drills on a regular cadence.
  • Use chaos engineering practices to identify and remediate weak points in application reliability.
  • Rehearse failure to validate the effectiveness of recovery processes and ensure teams are familiar with their responsibilities.
  • Document past failures and automate their remediation where possible.

Embrace continuous operational improvement

Teams that embrace continuous operational improvement continuously evaluate and refine operational procedures and tasks. They strive to reduce complexity and ambiguity whenever possible.

Adopting a continuous improvement culture helps organizations:

  • Evolve processes over time.
  • Optimize inefficiencies and associated processes.
  • Learn from failures.
  • Continuously evaluate new opportunities.

Use loosely coupled architecture

Use modern architecture patterns such as:

  • microservices
  • loosely coupled
  • serverless

and pair this with cloud design patterns such as:

  • Circuit breakers
  • Load-Leveling
  • Throttling

and advanced deployment strategies like:

  • Canary
  • Blue-green
  • Staggered

to enable teams to build and deploy services independently and minimize the impact if there is a service failure.

This principle also extends to procedural decoupling. Teams will be able to take full advantage of their loosely coupled architecture if they do not have to depend on partner teams to support, approve, or operate their workloads.

Next step