Azure SQL Failover Group and Geo Replication Usage

Oleksandr Tedikov 26 Reputation points
2021-07-25T13:02:12.62+00:00

Hello,

I have the following application and database configuration:

4 Regions - Region A (main), Region B, Region C, Region D (close to Region A)
3 Web APIs - Region A, Region B, Region C
4 Azure SQL Servers - Region A, Region B, Region C, Region D
4 Azure SQL Databases - Region A, Region B, Region C, Region D
1 Azure Front Door - Routes Requests to Geographically Closest Web API

I want to achieve High Availability with near Zero Down Time and near Zero Data Loss

Failover Group between SQL Server in Region A (SQL Database Region A (Primary)) and SQL Server in Region D (SQL Database Region D (Secondary))
Geo-Replication between SQL Database Region A (Primary) and SQL Database Region B (Secondary) and SQL Database Region C (Secondary)

Web API A has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and Failover Group Read Listener Endpoint
Web API B has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and SQL Database Region B Endpoint
Web API C has both Read-Write and Read-Only Connections to Failover Group Read-Write Listener Endpoint and SQL Database Region C Endpoint

When either Web API Region A, B or C fails, Azure Front Door will not route Request to Web API until Web API gets back.
When either SQL Database Region B or C fails, Web API Region B or C will fail, so Azure Front Door will not route Request to Web API B or C until SQL Database Region B or C gets back.
When SQL Database Region A fails, SQL Database Region D will take over as part of Failover Group until SQL Database Region A gets back.

Is this application and database configuration sufficient?

Also one more question, in case SQL Database Region A fails, and SQL Database Region D will take over as part of Failover Group how SQL Database Region B and SQL Database Region C could still be replicated, since they were configured as Replicas from SQL Database Region A, not SQL Database Region D?

Thank you

Azure SQL Database
{count} vote

Accepted answer
  1. Anurag Sharma 17,631 Reputation points
    2021-08-16T13:30:26.673+00:00

    Hi @Oleksandr Tedikov , thanks for the reply.

    (Posting as answer as it was crossing the characters limit)

    Please find the responses highlighted below:

    SQL Database Region A fails, and comes back in 30 minutes.

    • In thise case SQL Database Region D does not take over? : SQL Database Region D will not take over automatically but if you do manual failover then switch happens
    • Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?: Yes till the time switch/mitigation happens they will fail
    • If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?: Yes RTO of manual failover is 30 seconds, so it will be very quick

    SQL Database Region A fails, and does not come back in 1 hour.

    • SQL Database Region D does takes over?: Yes after grace period, switch will happen
    • Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?: Yes after failover, all of them write to Database D
    • If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?: Yes RTO of manual failover is 30 seconds, so it will be very quick

    My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

    To minimize the failover time, fastest way is to do manual failover of database.

    Please let us know if this helps or we can discuss further.

    ----------

    If answer helps, please mark it 'Accept Answer'


3 additional answers

Sort by: Most helpful
  1. Anurag Sharma 17,631 Reputation points
    2021-07-27T11:09:18.9+00:00

    Hi @Oleksandr Tedikov , welcome to Microsoft Q&A forum.

    The plan that your provided looks really good and should cover for almost all the scenarios. However you must be aware about the recovery time objective(RTO) and recovery point objective(RPO) for the business continuity. Both RPO and RTO for these recovery options have some threshold as mentioned in the below article.

    Recover a database to the existing server

    118247-image.png

    Also, the data sync for failover and geo-replication is asynchronous replication. In case of planned failover, there is no data loss but in case of unplanned failover, there could be data loss. I would suggest you to go through the below article which has ways to minimize the data loss:

    Use auto-failover groups to enable transparent and coordinated failover of multiple databases

    Regarding your last query, yes after failover Server D becomes the primary server and both the Server B and C (which were geo-replicas to Server A) will become replica of Server D now. If you want to test this, you can try this by manually failing the Server A and you will observe both Servers B and C now will point to Server D.

    Please let me know if this helps or else we can discuss further.

    ----------

    If answer helps, please mark it 'Accept Answer'


  2. Anurag Sharma 17,631 Reputation points
    2021-08-09T07:13:15.717+00:00

    Hi @Oleksandr Tedikov , Sorry for it. I think response got missed I posted the response on very next day. Re-posting it again below. Please let me know if this helps.

    (Re-posting as answer as it was crossing the characters limit)
    In case of outage or failure of Primary region, failover occurs only when the grace period is expired. "By default, a failover group is configured with an automatic failover policy. Azure triggers failover after the failure is detected and the grace period has expired. The system must verify that the outage cannot be mitigated by the built-in high availability infrastructure due to the scale of the impact. If you want to control the failover workflow from the application or manually, you can turn off automatic failover."

    This means system still waits for the primary region to get available because of high availability infra. Now this outage verification involves human action hence it needs sometime to come back. But for almost all cases it comes back earlier than an hour. If it does not then failover to another region happens which is 1 hour as RTO.

    If needed, we can switch off auto failover to save this grace period time and do a manual failover as it would take few minutes and switch the primary and secondary.

    Now to answer your query, yes it could take 1 hour before Server D takes over in case of auto-failover. If we switch the auto-failover off, we can set the alerts to check service health but as it involves manual failover there is a risk factor.

    Through portal we can do a manual failover. Forced failover could result in data loss as it triggers instantly while failover would make sure there is no data loss.

    0 comments No comments

  3. Oleksandr Tedikov 26 Reputation points
    2021-08-15T18:06:18.36+00:00

    Thank you @AnuragSharma-MSFT,

    Please clarify the following:

    SQL Database Region A fails, and comes back in 30 minutes.

    • In thise case SQL Database Region D does not take over?
    • Web API A, B, C Requests which Write to SQL Database Region A are failing for 30 minutes?
    • If it is not appropriate behaviour for us to have 30 minutes Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

    SQL Database Region A fails, and does not come back in 1 hour.

    • SQL Database Region D does takes over?
    • Web API A, B, C Requests which Write to SQL Database Region A then Write to SQL Database Region D instead?
    • If it is not appropriate behaviour for us to have 1 hour Failed Requests, should we as soon as we want make Manual Failover through Portal, so that SQL Database Region D takes over?

    My main question is how to react to SQL Database Region A Failure, so to have always Web API A, B, C Requests Write to either SQL Database Region A or SQL Database Region D, and business not to wait up to 1 hour until either SQL Database Region A comes back or SQL Database Region D takes over.

    Thank you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.