Reliability patterns

Availability

Availability is measured as a percentage of uptime, and defines the proportion of time that a system is functional and working. Availability is affected by system errors, infrastructure problems, malicious attacks, and system load. Cloud applications typically provide users with a service-level agreement (SLA), which means that applications must be designed and implemented to maximize availability.

Pattern Summary
Deployment Stamps Deploy multiple independent copies of application components, including data stores.
Geode Deploy backend services into a set of geographical nodes, each of which can service any client request in any region.
Health Endpoint Monitoring Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a service that it invokes to smooth intermittent heavy loads.
Throttling Control the consumption of resources by an instance of an application, an individual tenant, or an entire service.

To mitigate against availability risks from malicious distributed denial of service (DDoS) attacks, implement the native Azure DDoS protection service or a third-party capability.

High availability

Azure infrastructure is composed of geographies, regions, and availability zones. These divisions limit the radius of a failure and therefore limit potential effect on customer applications and data. The Azure availability zones construct was developed to provide a software and networking solution to protect against datacenter failures and to provide increased high availability. With high availability architecture, there's a balance between high resilience, low latency, and cost.

Pattern Summary
Deployment Stamps Deploy multiple independent copies of application components, including data stores.
Geode Deploy backend services into a set of geographical nodes. Each node can service any client request in any region.
Health Endpoint Monitoring Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals.
Bulkhead Isolate elements of an application into pools. If one element fails, the others continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource.

Resiliency

Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and malicious.

In cloud hosting, applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware. This situation means there's an increased likelihood for both transient and permanent faults to arise. The connected nature of the internet and the rise in sophistication and volume of attacks increase the likelihood of a security disruption.

To detect failures and recovering quickly and efficiently, it's necessary to maintain resiliency.

Pattern Summary
Bulkhead Isolate elements of an application into pools. If one element fails, the others continue to function.
Circuit Breaker Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource.
Compensating Transaction Undo the work performed by a series of steps, which together define an eventually consistent operation.
Health Endpoint Monitoring Implement functional checks in an application that external tools can access through exposed endpoints at regular intervals.
Leader Election Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader. The leader assumes responsibility for managing the other instances.
Queue-Based Load Leveling Use a queue that acts as a buffer between a task and a service that it invokes. This queue smooths intermittent heavy loads.
Retry Enable an application to handle anticipated, temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that's previously failed.
Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of services and other remote resources.