Azure SQL Failover Group and Geo Replication Usage

Question

Azure SQL Failover Group and Geo Replication Usage

Oleksandr Tedikov 26

Hello,

I have the following application and database configuration:

4 Regions - Region A (main), Region B, Region C, Region D (close to Region A)
3 Web APIs - Region A, Region B, Region C
4 Azure SQL Servers - Region A, Region B, Region C, Region D
4 Azure SQL Databases - Region A, Region B, Region C, Region D
1 Azure Front Door - Routes Requests to Geographically Closest Web API

I want to achieve High Availability with near Zero Down Time and near Zero Data Loss

Failover Group between SQL Server in Region A (SQL Database Region A (Primary)) and SQL Server in Region D (SQL Database Region D (Secondary))
Geo-Replication between SQL Database Region A (Primary) and SQL Database Region B (Secondary) and SQL Database Region C (Secondary)

Web API A has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and Failover Group Read Listener Endpoint
Web API B has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and SQL Database Region B Endpoint
Web API C has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and SQL Database Region C Endpoint

When either Web API Region A, B or C fails, Azure Front Door will not route Request to Web API until Web API gets back.
When either SQL Database Region B or C fails, Web API Region B or C will fail, so Azure Front Door will not route Request to Web API B or C until SQL Database Region B or C gets back.
When SQL Database Region A fails, SQL Database Region D will take over as part of Failover Group until SQL Database Region A gets back.

Is this application and database configuration sufficient?

Also one more question, in case SQL Database Region A fails, and SQL Database Region D will take over as part of Failover Group how SQL Database Region B and SQL Database Region C could still be replicated, since they were configured as Replicas from SQL Database Region A, not SQL Database Region D?

Thank you

Anurag Sharma 17,631 Reputation points

2021-07-30T06:27:22.097+00:00

Hi @Oleksandr Tedikov , just wanted to follow up on the issue to check if any more details needed. If answer helped you can mark it 'Accepted'
Oleksandr Tedikov 26 Reputation points

2021-08-15T18:04:07.277+00:00
Thank you @AnuragSharma-MSFT,

Please clarify the following:

SQL Database Region A fails, and comes back in 30 minutes.

In thise case SQL Database Region D does not take over?

Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?

If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

SQL Database Region A fails, and does not come back in 1 hour.

SQL Database Region D does takes over?

Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?

If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

Thank you

Accepted answer

3 additional answers

Your answer

Anurag Sharma 17,631 Reputation points

2021-07-30T06:27:22.097+00:00

Hi @Oleksandr Tedikov , just wanted to follow up on the issue to check if any more details needed. If answer helped you can mark it 'Accepted'
Oleksandr Tedikov 26 Reputation points

2021-08-15T18:04:07.277+00:00

Thank you @AnuragSharma-MSFT,

Please clarify the following:

SQL Database Region A fails, and comes back in 30 minutes.

In thise case SQL Database Region D does not take over?

Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?

If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

SQL Database Region A fails, and does not come back in 1 hour.

SQL Database Region D does takes over?

Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?

If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

Thank you

Answer 1

Hi @Oleksandr Tedikov , thanks for the reply.

(Posting as answer as it was crossing the characters limit)

Please find the responses highlighted below:

SQL Database Region A fails, and comes back in 30 minutes.

In thise case SQL Database Region D does not take over? : SQL Database Region D will not take over automatically but if you do manual failover then switch happens
Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?: Yes till the time switch/mitigation happens they will fail
If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?: Yes RTO of manual failover is 30 seconds, so it will be very quick

SQL Database Region A fails, and does not come back in 1 hour.

SQL Database Region D does takes over?: Yes after grace period, switch will happen
Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?: Yes after failover, all of them write to Database D
If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?: Yes RTO of manual failover is 30 seconds, so it will be very quick

My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

To minimize the failover time, fastest way is to do manual failover of database.

Please let us know if this helps or we can discuss further.

----------

If answer helps, please mark it 'Accept Answer'

Anurag Sharma 17,631 Reputation points

2021-08-20T06:09:55.327+00:00

Hi @Oleksandr Tedikov , just wanted to check if you need any more details.
Oleksandr Tedikov 26 Reputation points

2021-08-21T18:27:33.073+00:00

@AnuragSharma-MSFT Thank you much for all clarifications. I am fine for now :)

Answer 2

Hi @Oleksandr Tedikov , welcome to Microsoft Q&A forum.

The plan that your provided looks really good and should cover for almost all the scenarios. However you must be aware about the recovery time objective(RTO) and recovery point objective(RPO) for the business continuity. Both RPO and RTO for these recovery options have some threshold as mentioned in the below article.

Recover a database to the existing server

Also, the data sync for failover and geo-replication is asynchronous replication. In case of planned failover, there is no data loss but in case of unplanned failover, there could be data loss. I would suggest you to go through the below article which has ways to minimize the data loss:

Use auto-failover groups to enable transparent and coordinated failover of multiple databases

Regarding your last query, yes after failover Server D becomes the primary server and both the Server B and C (which were geo-replicas to Server A) will become replica of Server D now. If you want to test this, you can try this by manually failing the Server A and you will observe both Servers B and C now will point to Server D.

Please let me know if this helps or else we can discuss further.

----------

If answer helps, please mark it 'Accept Answer'

Oleksandr Tedikov 26 Reputation points

2021-08-02T06:47:18.757+00:00

Thank you @AnuragSharma-MSFT,

Few following questions:

Since RTO for Auto-Failover Group is 1h and RPO is 5s, what does it practically mean in the following case:

Web API A, B, C receive ~1-5 Web API Requests per second. All these Web API Requests can either Write or Read to SQL Database.
In normal case Web API A, B, C Write to SQL Database Region A, and Read from SQL Database Region A, B, C.

In case SQL Database Region A fails, what will happen and when? When SQL Database Region D take over as part of Failover Group, in up to 1 hour according to RTO?
Does this mean that Web API A, B, C Requests which Write to SQL Database Region A will be failing for up to 1 hour?
If so, should we consider Manual Database Failover as soon as we understand that SQL Database Region A failed? Is there any reliable Health Check we could use?

Regarding "If you want to test this, you can try this by manually failing the Server A and you will observe both Servers B and C now will point to Server D." - What do you mean under manually failing the Server A? Is it Failover or Forced Failover buttons on SQL Server Failover Groups within Azure Portal?

Thank you

Answer 3

Hi @Oleksandr Tedikov , Sorry for it. I think response got missed I posted the response on very next day. Re-posting it again below. Please let me know if this helps.

(Re-posting as answer as it was crossing the characters limit)
In case of outage or failure of Primary region, failover occurs only when the grace period is expired. "By default, a failover group is configured with an automatic failover policy. Azure triggers failover after the failure is detected and the grace period has expired. The system must verify that the outage cannot be mitigated by the built-in high availability infrastructure due to the scale of the impact. If you want to control the failover workflow from the application or manually, you can turn off automatic failover."

This means system still waits for the primary region to get available because of high availability infra. Now this outage verification involves human action hence it needs sometime to come back. But for almost all cases it comes back earlier than an hour. If it does not then failover to another region happens which is 1 hour as RTO.

If needed, we can switch off auto failover to save this grace period time and do a manual failover as it would take few minutes and switch the primary and secondary.

Now to answer your query, yes it could take 1 hour before Server D takes over in case of auto-failover. If we switch the auto-failover off, we can set the alerts to check service health but as it involves manual failover there is a risk factor.

Through portal we can do a manual failover. Forced failover could result in data loss as it triggers instantly while failover would make sure there is no data loss.

Answer 4

Thank you @AnuragSharma-MSFT,

Please clarify the following:

SQL Database Region A fails, and comes back in 30 minutes.

In thise case SQL Database Region D does not take over?
Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?
If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

SQL Database Region A fails, and does not come back in 1 hour.

SQL Database Region D does takes over?
Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?
If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

Thank you

Share via

Azure SQL Failover Group and Geo Replication Usage

3 additional answers

Your answer