Site Recovery - Azure Regions - Failback

Question

Site Recovery - Azure Regions - Failback

Nathan Farrar 67

I'm finding that failover (and test failover) in Azure Site Recovery makes total sense and works well.. but failback is a convoluted process and really makes Site Recovery difficult to work with. Looking for input on this.

What I'd expect to be able to do is to use a recovery plan to failover a group of VMs into a separate region. I'd also expect to be able to fail those resource back simply, but that doesn't appear to be the case at all. Test failover is a useful tool but can cause issues and has limitations. Testing a single server isn't very useful and in order to get a true test, you'd need to make sure DNS is working and on-prem sites can access resources in Azure. Test failovers keep both machines running which can cause DNS issues and prevents the ability to truly test a DR scenario. It's a theoretical test IMO, not a true test (and doesn't pass as a true test for most companies)

Here is how I understand the only way to do a true failover and failback:

Assumptions - two regions with non-overlapping VNETs, an existing domain controller in both and VPN access to both.

Failover VM(s) to secondary location (source VM is shutdown and new VM is brought online and registers with local DC for DNS.
Confirm everything is working for application and remote access via VPN is good. - failover works well -
Commit failover - unable to change recovery point after committing -
Now VM is running in backup/recovery region.
Must now delete source VM resources <- this is a pain point
Re-protect VM(s) running in backup/recovery region <- VMs will be replicated back to original source region, must wait for replication to sync

-- "Failback" Process --

Failover VM(s) again to original source Region
Confirm everything is working again
Delete VM resources in backup/recovery region.
Re-protect VM(s) now running in the original source region - VMs will be replicated to the original backup/recovery region
Delete VM resources in backup/recovery region.

Depending on the size of the VMs and number of VMs, testing failover could be a very involved event with significant risk of issues due to all the steps. Recovery plans would seem to be a good solution but they do not survive the "failback" process and are essentially useless after they failover VMs, they only do one direction and seem to simply lose track of what's going on after one use.

I'm using this Microsoft guide:
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failover-failback

Seems Site Recovery is a half-baked solution for BC/DR, at least in this use case. I'm sure I'll be told to use the test failover option, but that won't work as a true test.

Thoughts? Is there a failback process I'm missing? I'm sure I can script some automation but I want to exhaust native built-in processes first.

Thanks!

Nathan Farrar 67 Reputation points

2021-08-10T14:42:15.34+00:00

Thank you for the response. This clean up of resources isn't clearly outlined in the Microsoft document, perhaps it is wrong or needs to have a note about this topic. It states on step 10 of failover:

"Site Recovery doesn't clean up the source VM after failover. You need to do that manually." Perhaps it should note (* Do not manually clean up source VM if you plan to fail resources back) ?
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failover-failback

I understand now that it is not really a "Failback" it is more of a rebuilding of the recovery in the opposite direction. Once you 're-protect' a resource, you lose any previous sync points because a re-protect also initiates a commit (locks in the current recovery point). In order to return the resource back to the original location it needs to be re-protected which really just starts the whole sync process all over again.

It makes sense. It also looks like the process does delete resources as failover/failback occurs. Previously I did end up with a stopped machine and a running machine with the same name in different regions. Using recovery plans seems to also cause a clean up of resources. When I failover an individual VM it does leave resources remaining. That is an interesting detail that isn't documented, at least I haven't found it yet.

Thanks for your help, I'll dig into automating this a bit.
SadiqhAhmed-MSFT 49,331 Reputation points Microsoft Employee Moderator

2021-08-10T19:26:41.95+00:00

@Anonymous Thanks for the feedback. We will review the documentation with the content owner and update as appropriate.

Thanks once again for your valuable feedback! :)

Accepted answer

0 additional answers

Your answer

Nathan Farrar 67 Reputation points

2021-08-10T14:42:15.34+00:00

Thank you for the response. This clean up of resources isn't clearly outlined in the Microsoft document, perhaps it is wrong or needs to have a note about this topic. It states on step 10 of failover:

"Site Recovery doesn't clean up the source VM after failover. You need to do that manually." Perhaps it should note (* Do not manually clean up source VM if you plan to fail resources back) ?
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failover-failback

I understand now that it is not really a "Failback" it is more of a rebuilding of the recovery in the opposite direction. Once you 're-protect' a resource, you lose any previous sync points because a re-protect also initiates a commit (locks in the current recovery point). In order to return the resource back to the original location it needs to be re-protected which really just starts the whole sync process all over again.

It makes sense. It also looks like the process does delete resources as failover/failback occurs. Previously I did end up with a stopped machine and a running machine with the same name in different regions. Using recovery plans seems to also cause a clean up of resources. When I failover an individual VM it does leave resources remaining. That is an interesting detail that isn't documented, at least I haven't found it yet.

Thanks for your help, I'll dig into automating this a bit.
SadiqhAhmed-MSFT 49,331 Reputation points Microsoft Employee Moderator

2021-08-10T19:26:41.95+00:00

@Anonymous Thanks for the feedback. We will review the documentation with the content owner and update as appropriate.

Thanks once again for your valuable feedback! :)

Answer 1

Hello @Anonymous - Thank you for reaching out!

#5. Must now delete source VM resources <- this is a pain point

This is not needed. If you leave the source VM in shut down state, it can be reused as the failed back VM, saving you replication dollars – as we identify and only replicate back the changes.

"Failback" Process --

# 3. Delete VM resources in backup/recovery region.

This is an unnecessary step. When you click on ‘Re-protect’ after failback, ASR cleans up the DR region for you. # 5. Delete VM resources in backup/recovery region.
- You should not delete VM resources in the DR region after you re-protect from the primary region to the DR region as your failover will fail if you do so.

Depending on the size of the VMs and number of VMs, testing failover could be a very involved event with significant risk of issues due to all the steps. Recovery plans would seem to be a good solution but they do not survive the "failback" process and are essentially useless after they failover VMs, they only do one direction and seem to simply lose track of what's going on after one use.

Recovery Plans do allow you to re-protect and fail back VMs to your source region. It is a bi-directional tool.

Hope this answers your questions!

If the response helped, do "Accept Answer" and up-vote it.

Share via

Site Recovery - Azure Regions - Failback

0 additional answers

Your answer