Azure Database for MySQL | Recovery time objective guarantees with GRS backup replication

Sayan Ghosh 306 Reputation points Microsoft Employee

This question is loosely related with a previous question which was about RPO while this is about RTO. We want to understand if there are any RTO guarantees when we use the GRS backup of Azure Database for MySQL to restore in a secondary region.

Let me elaborate on that. In general, spinning up a new service takes a few mins (in our case, 16 mins), and it is quite acceptable in our scenario. However, what we are concerned with is -

  • Our secondary Azure region pair is not under unusual load when we test
  • However, in the event of a primary region failure, that will change. Everyone on our primary region will of course try to bring up their apps and services in our secondary, which is the designated region pair for the primary.
  • Thus, we expect capacity constraints to be at play at the paired region.
  • The question is, does this service provide any SLA for RTO for an Azure MySQL database that has GRS backup enabled?
  • A good reference / model where this is available is Azure Site Recovery. As per the SLA, "For each Protected Instance configured for Azure-to-Azure Failover, we guarantee a two-hour Recovery Time Objective." - which to me reads like "if you use ASR, you have a guaranteed 2 hour RTO so the VM provisioning for ASR protected VMs / images are guaranteed to be provisioned within that window"

Any pointers will be great!

Azure Database for MySQL
Azure Database for MySQL
An Azure managed MySQL database service for app development and deployment.
438 questions
No comments
1 vote

Accepted answer
  1. Sayan Ghosh 306 Reputation points Microsoft Employee

    ** This is a comment, since I cannot seem to post one against the answer using edge, I had to post as an answer **

    Hi @Navtej Singh Saini - thanks for your response. After reviewing it, I understand you are referring to the scenario where the customer keeps a read replica of MySQL in secondary and then basically promotes it to primary in the event of a disaster. I understand that a capacity concern will not come into picture when this option is chosen, as the replica is already provisioned. Definitely a consideration we have in mind. However, the client sees this as a premium option - i.e. a Hot DR where the secondary is always provisioned and they pay for both the compute and store.

    The question is more about the other option, i.e. a MySQL database provisioned in the primary with only GRS backup (with no replicas in secondary), so we pay for storage in the secondary region and hence, there is a need to provision a secondary instance when disaster strikes the primary. We understand the time to provision a MySQL instance and attaching that to GRS backup is part of the RTO - that's all good. What we are after is if there are any capacity guarantees in the same line as ASR.

    So essentially we are trying to understand the trade-offs, after all architecture is a business of trade-offs :). If there are really no additional guarantees of capacity in the secondary because the impact was on an outage of primary (which is something ASR seems to provide) - we will probably go with the read replica and the additional cost will be well worth it, but we do want to ensure we've understood our options before making that decision. Any further clarifications will be sincerely appreciated.

1 additional answer

Sort by: Most helpful
  1. Navtej Singh Saini 4,171 Reputation points

    @Sayan Ghosh RTO as explained here is Maximum acceptable time before the application fully recovers after the disruptive event - this is your Recovery Time Objective (RTO)

    As this article explains the failover the RTO will depend on how much time you take to complete failover.

    Now we come to the capacity constraints where the recommendation is as following for the Replica pair.

    As the recommendation is to keep the replica server to be of equal or greater value resources so that the scenario that you are saying will not happen. Additionally if the resources are not matched there would be replica lag as well.

    Hope this helps. Please elaborate your concern further so we can discuss about the same.

    Navtej S

    No comments