I'm finding that failover (and test failover) in Azure Site Recovery makes total sense and works well.. but failback is a convoluted process and really makes Site Recovery difficult to work with. Looking for input on this.
What I'd expect to be able to do is to use a recovery plan to failover a group of VMs into a separate region. I'd also expect to be able to fail those resource back simply, but that doesn't appear to be the case at all. Test failover is a useful tool but can cause issues and has limitations. Testing a single server isn't very useful and in order to get a true test, you'd need to make sure DNS is working and on-prem sites can access resources in Azure. Test failovers keep both machines running which can cause DNS issues and prevents the ability to truly test a DR scenario. It's a theoretical test IMO, not a true test (and doesn't pass as a true test for most companies)
Here is how I understand the only way to do a true failover and failback:
Assumptions - two regions with non-overlapping VNETs, an existing domain controller in both and VPN access to both.
- Failover VM(s) to secondary location (source VM is shutdown and new VM is brought online and registers with local DC for DNS.
- Confirm everything is working for application and remote access via VPN is good. - failover works well -
- Commit failover - unable to change recovery point after committing -
- Now VM is running in backup/recovery region.
- Must now delete source VM resources <- this is a pain point
- Re-protect VM(s) running in backup/recovery region <- VMs will be replicated back to original source region, must wait for replication to sync
-- "Failback" Process --
- Failover VM(s) again to original source Region
- Confirm everything is working again
- Delete VM resources in backup/recovery region.
- Re-protect VM(s) now running in the original source region - VMs will be replicated to the original backup/recovery region
- Delete VM resources in backup/recovery region.
Depending on the size of the VMs and number of VMs, testing failover could be a very involved event with significant risk of issues due to all the steps. Recovery plans would seem to be a good solution but they do not survive the "failback" process and are essentially useless after they failover VMs, they only do one direction and seem to simply lose track of what's going on after one use.
I'm using this Microsoft guide:
https://learn.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-failover-failback
Seems Site Recovery is a half-baked solution for BC/DR, at least in this use case. I'm sure I'll be told to use the test failover option, but that won't work as a true test.
Thoughts? Is there a failback process I'm missing? I'm sure I can script some automation but I want to exhaust native built-in processes first.
Thanks!