Overview of the reliability pillar

The reliability pillar of Well-Architected for Industry focuses on ensuring that workloads perform consistently at an acceptable service level in accordance with business continuity requirements. It encompasses minimizing downtime, reducing the impact of disruptions on users, and restoring normal operations when disruptions occur.

Two key approaches to achieving reliability in a workload are:

Resiliency: The ability to withstand and continue operating when things go wrong, such as temporary errors, infrastructure outages, or unexpected spikes in demand. Resiliency helps you to avoid disruptions.
Recoverability: The ability to restore normal operations after a disruption. If a disruption does occur, recoverability helps you to restore back to a reliable state.

Reliability also incorporates other elements of your solution design, including how you deploy changes safely.

The reliability pillar includes the following key areas:

Foundations: Design a strong foundation that supports resiliency through high availability and fault tolerance. Includes the following actions:
- Use multiple availability zones
- Implement network and data redundancy
- Design for failure
Change management: Implement change management practices to minimize the risk of disruptions caused by changes to the system. Includes the following actions:
- Use automation to deploy changes
- Implement testing and validation processes
- Use canary or blue-green deployment techniques
Failure management: Design for failure and implement processes to detect, respond to, and recover from failures. This area encompasses both resiliency and recoverability. Includes the following actions:
- Implement automated monitoring and alerting
- Use fault isolation techniques
- Implement disaster recovery and business continuity plans
Scalability: Design solutions that can scale to meet changing demands and avoid downtime due to high load. Includes the following actions:
- Use autoscaling
- Design for elasticity
- Implement capacity planning and management processes

Industry cloud solutions are built on top of Azure, Power Platform, Microsoft 365, and Dynamics 365. The division of responsibility for the reliability pillar across infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS) components vary. The following table summarizes the division of responsibility:

Type of service	Microsoft responsibility	Customer responsibility	Some components used in Microsoft Cloud industry solutions
On-premises	N/A	Responsible for the whole stack.	On-premises data gateway
IaaS	Ensure the availability and reliability of the underlying infrastructure, such as physical servers, storage, and networking components.	Configure and deploy the applications to maximize their reliability and availability. It includes ensuring that their applications are properly architected to handle failure scenarios, such as using load balancing and autoscaling features.	Azure Virtual Network (VNet), Azure Virtual Machines (VMs)
PaaS	Ensure the reliability of the platform, including the runtime environment and associated services, such as databases and messaging systems.	Configure and deploy their applications to maximize their reliability and availability, such as using load balancing and failover mechanisms.	Power Platform, Azure Health Data Services, Azure Storage Services, Azure Analytics Services, Azure Logic Apps, Azure Kubernetes Service (AKS)
SaaS	Ensure the reliability of the entire software application and associated services. Includes ensuring that the application is available and responsive to users, and that the processed data is stored securely and reliably.	Configure their user accounts and access controls to ensure that their users can access the application as needed, and report any issues to Microsoft on time.	Dynamics 365, Microsoft 365

Feedback

War dës Säit hëllefräich?

Last updated on 2026-02-19

Deelen iwwer

Overview of the reliability pillar

See also

Feedback

Zousätzlech Ressourcen