Azure Database for MySQL | Fail-back to primary region

Question

Let us say we have a general purpose Azure Database for MySQL instance in a primary region (lets say AU East) with geo-replication set up on backups and the region went down went down. This resulted a customer executing their fail over process, which amongst other things would spin up an Azure Database for MySQL instance in the secondary region (playing along with my example AU Southeast) with the geo-replicated backup.

How does failback to primary work in this scenario? After a few hours, AU East comes back online and the customer wants to execute their failback to primary process. How do they ensure the transactions, changes etc. to the database which it was being served from AU Southeast gets “synced” to the AU East instance before the failback is executed? Is there any data loss to be expected and if yes, what are the influencing parameters and best practices to minimise them?

(I have posted this against the docs page but thought it would be useful to also post here as other customers may have similar experiences / scenarios)

Accepted Answer

Hi @SayanGhosh - With the fail-back to the primary region (after the disruptive event), the customer would essentially reverse any manual or scripted fail-over steps to restore the secondary database to the primary. The amount of data lost is pursuant of the amount of time/steps taken to fail-over to secondary. In the document, Understand business continuity in Azure Database for MySQL (link), there are 3 key indicators with regard to estimating a recovery time and how long can the solution in question tolerate an outage, with the business continuity method being discussed here.

The amount of data potentially missing from the restored database would be from the period in which the last backup was taken before the outage occurrence and the time the solution went down. As an example, a backup may have been taken 8 hours prior to the outage and the solution may have been down for 30 mins as a result of the outage (30 mins to perform a geo-restore of the database backup taken 8 hours prior). In this case, you are missing roughly 8.5hrs of solution activity/transactions in the secondary server instance.

If very little to no data loss is a requirement of the solution, please consider using Cross-region read replicas (link). This is a near real-time capability where secondary (read replicas) exist in a paired region, where the replica is manually switch to a master node, and the secondary can now support primary solution workloads. The geo-restore from geo-replicated backups option is suitable if an hour of data loss is permissible.

Please let me know if you have additional questions.

Regards,
Mike

Answer

(This really is a follow-up question / ideas to the answer above as opposed to another answer, posted because comments have a character limit.)

Hi @Zagato36 -

Many thanks for your detailed response, it is sincerely appreciated. I have read through your response a few times and based on that my takeaway is the following -

Scenario 1

The initial primary instance (let's call it mysql-primary-a) has geo-redundancy enabled, but no cross-region replicas.

Primary region fails and goes down. As part of customer's DR plan, they provision a secondary mysql instance (let's call it mysql-secondary) in secondary region from a geo-redundant backup. The recovery point will be the time for the last backup, which for general purpose is guaranteed to be less than an hour. The time to recovery from the initiation of the DR failover (discounting any manual delays or errors) is basically the time taken to spin up mysql-secondary and restoring it from the geo-redundant backup.

Then, once the customer knows the primary region is backup and is failing back other workloads, such as ASR VMs, there is essentially no clean approach (short of manual dump / restore) to go back to mysql-primary-a and retain the data while the "DR mode" was working. So a cleaner option will be to treat mysql-primary-a as a cattle, not a pet. We would be better off to go to mysql-secondary and create a cross-region read replica in primary region mysql-primary-b. Then we just reverse the roles where mysql-primary-b becomes the master, and mysql-secondary the subordinate instance, which can then be deleted to save cost. This way, the data loss can be minimised while failing back.

Scenario 2

Basically, we keep mysql-secondary as a cross-region replica at all times so that we save the time to provision the instance, but also the data loss between last geo-replicated backup (the data loss of less than one hour). The other bits remain quite the same.

Could you kindly confirm if the interpretation reflects your message above and how the service operates?

What I also could not find is any article outlining cross-region read replicas and to what quantifiable / SLA measured levels the asynchronous binary log replication method improves the RPO which you have alluded to. I would understand if this is not guaranteed but is expected to improve the RPO purely based on the architectural parameters, but good to have all the available information to accurately bring all options to the customer as you'd understand. Anything, if available, on this in terms of documentation will be appreciated.

Azure Database for MySQL | Fail-back to primary region

1 additional answer