Set up disaster recovery at scale for VMware VMs/physical servers
บทความ
This article describes how to set up disaster recovery to Azure for large numbers (> 1000) of on-premises VMware VMs or physical servers in your production environment, using the Azure Site Recovery service.
Define your BCDR strategy
As part of your business continuity and disaster recovery (BCDR) strategy, you define recovery point objectives (RPOs) and recovery time objectives (RTOs) for your business apps and workloads. RTO measures the duration of time and service level within which a business app or process must be restored and available, in order to avoid continuity issues.
Site Recovery provides continuous replication for VMware VMs and physical servers, and an SLA for RTO.
As you plan for large-scale disaster recovery for VMware VMs and figure out the Azure resources you need, you can specify an RTO value that will be used for capacity calculations.
Best practices
Some general best practices for large-scale disaster recovery. These best practices are discussed in more detail in the next sections of the document.
Identify target requirements: Estimate out capacity and resource needs in Azure before you set up disaster recovery.
Plan for Site Recovery components: Figure out what Site Recovery components (configuration server, process servers) you need to meet your estimated capacity.
Set up one or more scale-out process servers: Don't use the process server that's running by default on the configuration server.
Run the latest updates: The Site Recovery team releases new versions of Site Recovery components on a regular basis, and you should make sure you're running the latest versions. To help with that, track what's new for updates, and enable and install updates as they release.
Monitor proactively: As you get disaster recovery up and running, you should proactively monitor the status and health of replicated machines, and infrastructure resources.
Disaster recovery drills: You should run disaster recovery drills on a regular basis. These don't impact on your production environment, but do help ensure that failover to Azure will work as expected when needed.
Gather capacity planning information
Gather information about your on-premises environment, to help assess and estimate your target (Azure) capacity needs.
For VMware, run the Deployment Planner for VMware VMs to do this.
For physical servers, gather the information manually.
Run the Deployment Planner for VMware VMs
The Deployment Planner helps you to gather information about your VMware on-premises environment.
Run the Deployment Planner during a period that represents typical churn for your VMs. This will generate more accurate estimates and recommendations.
We recommend that you run the Deployment Planner on the configuration server machine, since the Planner calculates throughput from the server on which it's running. Learn more about measuring throughput.
If you don't yet have a configuration server setup:
By default, the tool is configured to profile and generates report for up to 1000 VMs. You can change this limit by increasing the MaxVMsSupported key value in the ASRDeploymentPlanner.exe.config file.
Plan target (Azure) requirements and capacity
Using your gathered estimations and recommendations, you can plan for target resources and capacity. If you ran the Deployment Planner for VMware VMs, you can use a number of the report recommendations to help you.
Compatible VMs: Use this number to identify the number of VMs that are ready for disaster recovery to Azure. Recommendations about network bandwidth and Azure cores are based on this number.
Required network bandwidth: Note the bandwidth you need for delta replication of compatible VMs.
When you run the Planner you specify the desired RPO in minutes. The recommendations show you the bandwidth needed to meet that RPO 100% and 90% of the time.
The network bandwidth recommendations take into account the bandwidth needed for total number of configuration servers and process servers recommended in the Planner.
Required Azure cores: Note the number of cores you need in the target Azure region, based on the number of compatible VMs. If you don't have enough cores, at failover Site Recovery won't be able to create the required Azure VMs.
Recommended VM batch size: The recommended batch size is based on the ability to finish initial replication for the batch within 72 hours by default, while meeting an RPO of 100%. The hour value can be modified.
You can use these recommendations to plan for Azure resources, network bandwidth, and VM batching.
Plan Azure subscriptions and quotas
We want to make sure that available quotas in the target subscription are sufficient to handle failover.
Task
Details
Action
Check cores
If cores in the available quota don't equal or exceed the total target count at the time of failover, failovers will fail.
For VMware VMs, check you have enough cores in the target subscription to meet the Deployment Planner core recommendation.
For physical servers, check that Azure cores meet your manual estimations.
To check quotas, in the Azure portal > Subscription, click Usage + quotas.
The number of failovers mustn't exceed Site Recovery failover limits.
If failovers exceed the limits, you can add subscriptions, and fail over to multiple subscriptions, or increase quota for a subscription.
Failover limits
The limits indicate the number of failovers that are supported by Site Recovery within one hour, assuming three disks per machine.
What does comply mean? To start an Azure VM, Azure requires some drivers to be in boot start state, and services like DHCP to be set to start automatically.
Machines that comply will already have these settings in place.
For machines running Windows, you can proactively check compliance, and make them compliant if needed. Learn more.
Linux machines are only brought into compliance at the time of failover.
Machine complies with Azure?
Azure VM limits (managed disk failover)
Yes
2000
No
1000
Limits assume that minimal other jobs are in progress in the target region for the subscription.
Some Azure regions are smaller, and might have slightly lower limits.
Plan infrastructure and VM connectivity
After failover to Azure you need your workloads to operate as they did on-premises, and to enable users to access workloads running on the Azure VMs.
Learn more about failing over your Active Directory or DNS on-premises infrastructure to Azure.
Learn more about preparing to connect to Azure VMs after failover.
Plan for source capacity and requirements
It's important that you have sufficient configuration servers and scale-out process servers to meet capacity requirements. As you begin your large-scale deployment, start off with a single configuration server, and a single scale-out process server. As you reach the prescribed limits, add additional servers.
หมายเหตุ
For VMware VMs, the Deployment Planner makes some recommendations about the configuration and process servers you need. We recommend that you use the tables included in the following procedures, instead of following the Deployment Planner recommendation.
Set up a configuration server
Configuration server capacity is affected by the number of machines replicating, and not by data churn rate. To figure out whether you need additional configuration servers, use these defined VM limits.
CPU
Memory
Cache disk
Replicated machine limit
8 vCPUs 2 sockets * 4 cores @ 2.5 Ghz
16 GB
600 GB
Up to 550 machines Assumes that each machine has three disks of 100 GB each.
These limits are based on a configuration server setup using an OVF template.
The limits assume that you're not using the process server that's running by default on the configuration server.
If you need to add a new configuration server, follow these instructions:
Set up a configuration server manually for physical servers, or for VMware deployments that can't use an OVF template.
As you set up a configuration server, note that:
When you set up a configuration server, it's important to consider the subscription and vault within which it resides, since these shouldn't be changed after setup. If you do need to change the vault, you have to disassociate the configuration server from the vault, and reregister it. This stops replication of VMs in the vault.
If you want to set up a configuration server with multiple network adapters, you should do this during set up. You can't do this after the registering the configuration server in the vault.
Set up a process server
Process server capacity is affected by data churn rates, and not by the number of machines enabled for replication.
For large deployments you should always have at least one scale-out process server.
To figure out whether you need additional servers, use the following table.
We recommend that you add a server with the highest spec.
For physical machines, we recommend you identify batches based on machines that have a similar size and amount of data, and on available network throughput. The aim is to batch machines that are likely to finish their initial replication in around the same amount of time.
If disk churn for a machine is high, or exceeds limits in Deployment thePlanner, you can move non-critical files you don't need to replicate (such as log dumps or temp files) off the machine. For VMware VMs, you can move these files to a separate disk, and then exclude that disk from replication.
Add Azure Automation runbook scripts to recovery plans, to automate any manual tasks on Azure. Typical tasks include configuring load balancers, updating DNS etc. Learn more
Before failover, prepare Windows machines so that they comply with the Azure environment. Failover limits are higher for machines that comply. Learn more about runbooks.