Overview of the reliability pillar

The reliability pillar of Well-Architected for Industry focuses on ensuring that workloads are resilient and can operate seamlessly even in the face of various disruptions or failures. It includes minimizing downtime, reducing the impact of disruptions on users, and ensuring business continuity.

The reliability pillar includes the following key areas:

  1. Foundations: Design a strong foundation that supports high availability and fault tolerance. Includes the following actions:

    • Use multiple availability zones
    • Implement network and data redundancy
    • Design for failure
  2. Change management: Implement change management practices to minimize the risk of disruptions caused by changes to the system. Includes the following actions:

    • Use automation to deploy changes
    • Implement testing and validation processes
    • Use canary or blue-green deployment techniques
  3. Failure management: Design for failure and implementing processes to detect, respond to, and recover from failures. Includes the following actions:

    • Implement automated monitoring and alerting
    • Use fault isolation techniques
    • Implement disaster recovery and business continuity plans
  4. Scalability: Design solutions that can scale to meet changing demands. Includes the following actions:

    • Use autoscaling
    • Design for elasticity
    • Implement capacity planning and management processes

Industry cloud solutions are built on top of Azure, Power Platform, Microsoft 365, and Dynamics 365. The division of responsibility for the reliability pillar across infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) components vary. The following table summarizes the division of responsibility:

Type of service Microsoft responsibility Customer responsibility Some components used in Microsoft Cloud industry solutions
On-premises N/A Responsible for the whole stack. On-premises data gateway
IaaS Ensure the availability and reliability of the underlying infrastructure, such as physical servers, storage, and networking components. Configure and deploy the applications to maximize their reliability and availability. It includes ensuring that their applications are properly architected to handle failure scenarios, such as using load balancing and autoscaling features. Azure Virtual Network (VNet), Azure Virtual Machines (VMs)
PaaS Ensure the reliability of the platform, including the runtime environment and associated services, such as databases and messaging systems. Configure and deploy their applications to maximize their reliability and availability, such as using load balancing and failover mechanisms. Power Platform, Azure Health Data Services, Azure Storage Services, Azure Analytics Services, Azure Logic Apps, Azure Kubernetes Service (AKS)
SaaS Ensure the reliability of the entire software application and associated services. Includes ensuring that the application is available and responsive to users, and that the processed data is stored securely and reliably. Configure their user accounts and access controls to ensure that their users can access the application as needed, and report any issues to Microsoft on time. Dynamics 365, Microsoft 365

See also