Resiliency design patterns
This unit focuses on resiliency design best practices for various application deployments with varied resiliency requirements. Typically, customers classify the applications into various categories or tiers based on their resiliency requirements. In this unit, we take an example application from each category and discuss how you can design that application to be resilient from various types of failures. The applications we address are:
- Tier 4 application. 99% Application SLA, 24-hour RPO, 72-hour RTO
- Tier 3 application. 99.95 application SLA, 4-hour RPO, 8-hour RTO
- Tier 2 application. 99.99 application SLA, 30-minute RPO, 4-hour RTO
- Tier 1 application. 99.99 application SLA, 5-minute RPO, 1-hour RTO
Tier 4 application design pattern
Tier 4 applications have less stringent availability requirements. The example application has 99% Application SLA, 24-hour RPO, and a 72-hour RTO. Tier 4 applications can be internal applications, such as tooling applications, build servers, and project document share websites.
A multi-tier web application in this category can be deployed within an Azure region as single instance VM for each tier. If you want an explicit SLA guarantee at the VM level, you can use premium storage for your VMs. You should use premium storage for database VM if the application is relatively important within this category after doing the trade-off with premium storage cost. The databases can be backed up using any backup software. For example, use Azure Backup and configure database backups for SQL servers and SAP HANA running on Azure VMs. You also can keep your Azure Resource Manager (ARM) templates precreated for your VMs so that you can redeploy VMs if there's an issue with the single instance VM in any tier in that region. For database VM, you can use database backup copy to recreate databases.
You can use Azure Backup on your single instance VMs to protect your data and test backups using restore feature of Azure Backup. If there's any data or VM level corruption, you can recover the file, folder, disk, or VM using restore capabilities of Azure Backup. With Azure Backup, you can also protect your application data stored across Azure Blobs, Azure Disks, and Azure File Shares providing recovery against data corruption and or loss.
For disaster recovery during regional failure scenario, you can consider replicating only the database VM with Azure Site Recovery. The web and app tier VMs can be redeployed in another Azure region using ARM templates if they're stateless or can be recovered from the backup copy. Note that the recovery time could be high for this approach. If you can take higher trade-off on cost, then you should replicate the VMs across all tiers to another region. Azure Site Recovery doesn't require running additional VM instances in the disaster recovery region. The VMs are created only the user performs failover operations.
You can monitor the health of the web application by using an automation script that periodically checks if the website endpoint is reachable. Create a custom endpoint that reports on the overall health of the application. The endpoint should return an HTTP error code if any critical dependency is unhealthy or unreachable. Don't report errors for noncritical services, however.
You can have a precreated script to monitor the health metrics of the VMs to see if there's any issue with the VM. You can also troubleshoot any issues by checking the VM health metrics in the Azure portal. You can also check for component health such as CPU, Memory disk to see if there are any potential load issues. If you consistently see issues with components such as CPU or RAM, consider increasing the VM to a higher size or consider scaling the application by deploying more VMs at each tier.
You can deploy application software updates on the VMs using automation scripts during a weekend maintenance window. Make sure you have the automation script to roll back if any issues are hit during the deployment process.
The following table provides a resiliency strategy for a Tier 4 application.
| Failure type | Resilience strategy |
|---|---|
| Hardware failure | - Use ready-to-use templates to deploy another instance using the backup copies (if required). - Test your templates by deploying VMs into a test subnet or a test virtual network. |
| Datacenter failure | - Use ready-to-use templates to deploy another instance using the backup copies (if required) in another zone. - Test your templates by deploying VMs into a test subnet or a test virtual network in another availability zone. |
| Regional failure | - Use Azure Site Recovery to replicate the database VM. - Test the disaster recovery using Test failover and Azure Site Recovery recovery plans. - Perform disaster recovery failover in the event of an extended outage in source region. - Azure Backup's Cross Region Restore (CRR) lets you restore Azure VMs in a secondary paired region. You can restore your data in the secondary region anytime, during partial or full outages, or at the time you choose. |
| Heavy load | - Use monitoring tools to identify any load surges on the VM. - Increase the size of the VM or scale up by adding more instances. |
| Accidental data deletion or corruption | - Use Azure Backup to back up the Azure Virtual Machine, protect the SQL Server and SAP Databases running on Azure VM and to back up the data stored in Azure Disks, Azure Blobs, and Azure File Shares and restore them during data loss or corruption. |
| Application deployment failure | - Use automation scripts to deploy updates. If there’s an issue observed during the update process or after the update, roll back to the previous version with an automated script. - Also, use Azure Backup's on-demand backup of protected resources before an application deployment or upgrade activity. Use it to quickly restore to the previous known state, in case of deployment failure. |
Tier 3 application design pattern
Tier 3 applications are those that require a high application SLA and which are critical to the business, but acceptable to have some downtime. This example tier 3 application has a 99.95 application SLA, 4-hour RPO, and 8-hour RTO requirement. Tier 3 applications can be internal applications such as expense management or travel management applications that can have some impact if the applications are down for a few hours but won't result in significant revenue impact. Lower revenue-generating, customer-facing applications can also be part of this category.
You can build redundancy for the applications in this category by deploying as VMs (at least two) at each tier part of an availability set. Availability sets ensure that the VMs are placed in different fault domains and that guarantees hardware failures such as cluster or rack failure doesn't impact the end application. If you keep two or more VMs in an availability set, you get 99.95% availability for each tier. This helps you in getting the overall composite SLA of the application to within 99.9%. For the database VMs, you can use in-built synchronous replication to get high availability and avoid data loss. For example, you can use SQL Always On availability groups with asynchronous replication for the SQL databases.
Use load balancers between each tier so that traffic can be load balanced as well as routed to the healthy VM instances. If there's an issue with one of the VMs in a tier, the application continues to work without any impact.
Azure Backup helps restore the Azure Virtual Machines quickly from recent backups using instant restore feature thus eliminating need to transfer backup from vault storage to customer's subscription there by lowering RTO. It also offers various restore choices like restoring to create new virtual machine or restore the disks or a specific file/folder that fits different needs. By configuring backups on Azure Blobs, Azure Disks, Azure File Shares, you also get the ability to restore data stored by your application on various storage solutions on Azure. Databases are critical to many applications. With Azure Backup, you can gain point in time recovery of the SQL Server (with support for backing up SQL Server always on availability groups) and SAP HANA databases running on Azure Virtual Machines providing RPO as low as 15 minutes.
For disaster recovery during regional failure scenario, you can consider replicating only the database VM with Azure Site Recovery. The web and app tier VMs can be redeployed in another Azure region using ARM templates if they're stateless or can be recovered from the backup copy. Note that the recovery time could be high for this approach. If you can accept a higher trade-off on cost, then it's recommended to replicate the VMs across all tiers to another region. Azure Site Recovery doesn't require running additional VM instances in DR region. The VMs are created only the user performs failover operations.
You can monitor the health of the web application by using an automation script that periodically checks if the website endpoint is reachable. Create a custom endpoint that reports on the overall health of the application. The endpoint should return an HTTP error code if any critical dependency is unhealthy or unreachable.
You can set up a pre-created script to monitor the health metrics of the VMs to see if there's any issue with the VM. You also can troubleshoot any issues by checking the VM health metrics in the Azure portal. You also can check for component health such as CPU and Memory disk to see if there are any potential load issues. If you consistently see issues with components such as CPU or RAM, consider increasing the VM to a higher size or consider scaling the application by deploying more VMs at each tier. You should also monitor advanced metrics for VM health and activities such as database failover if you are using asynchronous replication.
You can deploy application software updates on the VMs using automation scripts during a weekend maintenance window. You should ensure you have the automation script to rollback if any issues happen during the deployment process.
The following table provides a resiliency strategy for a Tier 3 application.
| Failure type | Resiliency strategy |
|---|---|
| Hardware failure | - Build redundancy by deploying two or more instances in an availability set within a datacenter. |
| Datacenter failure | - Use ready to use templates to deploy another instance using the backup copies (if required) in another availability zone. - Test your templates by deploying VMs into a test subnet or a test virtual network in another zone. |
| Regional failure | - Use Azure Site Recovery to replicate the database VM. - Test the disaster recovery using Test failover and Azure Site Recovery recovery plans. - Perform disaster recovery failover in the event of an extended outage in source region. - Azure Backup's Cross Region Restore (CRR) lets you restore Azure VMs in a secondary paired region. You can restore your data in the secondary region anytime, during partial or full outages, or at the time you choose. |
| Heavy load | - Use monitoring tools to identify any load surges on the VM. - Increase the size of the VM or scale up by adding more instances. |
| Accidental data deletion or corruption | - Use Azure Backup to back up the Azure Virtual Machine, protect the SQL Server and SAP Databases running on Azure VM and to back up the data stored in Azure Disks, Azure Blobs and Azure File Shares and restore them during data loss/corruption. |
| Application deployment failure | - Use safe deployment practices to roll out the updates to a minimal set of customers before deploying them widely. - Use automation scripts to deploy updates with the automatic rollback capability built in if there’s an issue with the update deployment. - Configure alerts to send alarms/notifications if there is an issue occurs after an update deployment. If so, have the automated rollback script ready to execute. - Additionally, use Azure Backup's on-demand backup feature to take backup of protected resources before an application deployment or upgrade activity. Use it to quickly restore to the previous known state, in case of deployment failure. |
Tier 2 application
Tier 2 applications are business critical and can have significant impact on revenues if there is downtime. The sample application has a 99.99 application SLA, 30-minute RPO, and a 4-hour RTO. These can be external customer facing e-commerce websites, content streaming platform, financial transaction handling applications. The application should be highly available with resilience to all component failures.
Factors to consider in application design for Tier 2 applications when deployed in Azure include:
| Failure type | Resiliency strategy |
|---|---|
| Hardware failure | Build redundancy by deploying two or more instances across availability zones within a region. |
| Datacenter failure | Build redundancy by deploying two or more instances across availability zones within a region. |
| Regional failure | Use Azure Site Recovery to replicate the database VM. Test the disaster recovery using Test failover and Azure Site Recovery recovery plans. Perform disaster recovery failover in the event of an extended outage in source region. Azure Backup's Cross Region Restore (CRR) lets you restore Azure VMs in a secondary paired region. You can restore your data in the secondary region anytime, during partial or full outages, or at the time you choose. |
| Heavy load | Provision enough capacity into the application. Use tools to monitor the load and add more instances automatically using scripts if the threshold is hit (say, 70%). |
| Accidental data deletion or corruption | Use Azure Backup to back up the Azure Virtual Machine, protect the SQL Server and SAP Databases running on Azure VM and to back up the data stored in Azure Disks, Azure Blobs and Azure File Shares and restore them during data loss/corruption. |
| Application deployment failure | Use safe deployment practices to roll out the updates to a minimal set of customers before deploying them widely. Use automation scripts to deploy updates with the automatic rollback capability built in if there's an issue with the update deployment. Configure alerts to send alarms/notifications if an issue occurs after an update deployment. If so, have the automated rollback script ready to execute. Additionally, use Azure Backup's on-demand backup feature to take backup of protected resources before an application deployment or upgrade activity. Use the backup to restore quickly to the previous known state, in case of deployment failure. |
Tier 1 application design pattern
Tier 1 applications are business and mission critical, like Tier 2. They have stricter requirements for data loss and stricter recovery time requirements. In this example, a Tier 1 application is assigned a 99.99 application SLA, 5-minute RPO, and 1-hour RTO. With this type of application, more than a few minutes of data loss can significantly impact the revenues and business. Customer facing applications such as order processing systems and banking applications fall into this category.
Factors to consider in application design for Tier 1 applications when deployed in Azure include:
| Failure type | Resiliency strategy |
|---|---|
| Hardware failure | - Build redundancy by deploying two or more instances across availability zones within a region. |
| Datacenter failure | - Build redundancy by deploying two or more instances across availability zones within a region. |
| Regional failure | - Use Azure Site Recovery to replicate all the VMs in web tier and middle tier. - Use native replication technologies such as SQL Always On. - Test the disaster recovery of the complete application including SQL Always On failover using Azure Site Recovery recovery plans and Test failover capabilities. - Perform disaster recovery failover in the event of an extended outage in source region. |
| Heavy load | - Provision enough capacity into the application. - Use tools to monitor the load and add more instances automatically using scripts if the threshold is hit (say, 70%). |
| Accidental data deletion or corruption | - Use Azure Backup to back up the Azure Virtual Machine, protect the SQL Server and SAP Databases running on Azure VM and to back up the data stored in Azure Disks, Azure Blobs and Azure File Shares and restore them during data loss/corruption. |
| Application deployment failure | - Use safe deployment practices to roll out the updates to a minimal set of customers before deploying it widely. Use automation scripts to deploy updates with the automatic rollback capability built in if there’s an issue with the update deployment. - Have alerts configured to send alarms if an issue occurs after an update deployment. If any occur, have the automated rollback script ready to execute. - Additionally, use Azure Backup's on-demand backup feature to take backup of protected resources before an application deployment or upgrade activity. Use it to quickly restore to the previous known state, in case of deployment failure. |