Large Transaction Interrupted by Failover, Secondary Database Reports REVERTING
If a large transaction in an availability database on the primary replica is interrupted by a failover of the availability group, once failover has occurred, the database state on your secondary (old primary replica) reports to be in a NOT SYNCHRONIZING or REVERTING state for a long period of time. For example, consider a primary replica hosted on SQLNODE1 and a secondary replica on SQLNODE2 and failover occurs while a large ALTER INDEX transaction is running.
After failover has occurred, the secondary is reported as RECOVERING with a REVERTING synchronization state, and access to the readable database during this time will fail with message 922:
The large transaction rollback and secondary recovery events must complete before the readable secondary resumes synchronization and receiving changes from the new primary replica. The following lists three 'phases' that can be tracked to monitor the secondary database as it progresses towards a SYNCHRONIZING/SYNCHRONIZED state.
Rollback the large transaction
In our example, the ALTER INDEX transaction must roll back. Even though failover has succeeded, the large transaction is still rolling back on the secondary (SQLNODE1). This progress can be monitored in the SQL error log on the secondary (SQLNODE1) and looks like this with the message 'Remote harden of transaction 'ALTER INDEX'...failed.' punctuating completion of rollback.
2014-12-10 07:33:47.70 spid39s Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
...
2014-12-10 07:35:12.77 spid39s Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2014-12-10 07:35:14.77 spid39s Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.64 spid52 Remote harden of transaction 'ALTER INDEX' (ID 0x000000000000447e 0000:000a40f4) started at Dec 10 2014 7:27AM in database 'agdb' at LSN (464:78600:3) failed.
2014-12-10 07:35:15.64 spid39s Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.96 spid36s The state of the local availability replica in availability group 'ag' has changed from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.
Here is the recovery state as reported by sys.dm_hadr_database_replica_states during rollback of the large transaction:
Transition database to SECONDARY and recover database
Once the transaction has rolled back, the availability database transitions to the SECONDARY role and recovery begins - there is a large transaction that must be rolled back:
2014-12-10 07:35:15.97 spid39s The availability group database "agdb" is changing roles from "RESOLVING" to "SECONDARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
2014-12-10 07:35:15.97 spid23s Nonqualified transactions are being rolled back in database agdb for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
2014-12-10 07:35:15.97 spid34s A connection for availability group 'ag' from availability replica 'SQLNODE1' with id [203436EF-1D14-4F94-A9B4-8F3F25042328] to 'SQLNODE2' with id [DF629A4A-7CF9-4C5E-BCF3-3F592766F61E] has been successfully established. This is an informational message only. No user action is required.
2014-12-10 07:35:18.34 spid23s Starting up database 'agdb'.
2014-12-10 07:35:41.22 spid23s Recovery of database 'agdb' (16) is 0% complete (approximately 50 seconds remain). Phase 1 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:35:42.82 spid23s Recovery of database 'agdb' (16) is 3% complete (approximately 46 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2014-12-10 07:35:44.83 spid23s Recovery of database 'agdb' (16) is 7% complete (approximately 44 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:36:11.61 spid23s Recovery of database 'agdb' (16) is 28% complete (approximately 76 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
...
2014-12-10 07:39:00.90 spid23s Recovery of database 'agdb' (16) is 99% complete (approximately 1 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2014-12-10 07:39:00.96 spid23s AlwaysOn Availability Groups connection with primary database established for secondary database 'agdb' on the availability replica with Replica ID: {df629a4a-7cf9-4c5e-bcf3-3f592766f61e}. This is an informational message only. No user action is required.
2014-12-10 07:39:00.96 spid23s The recovery LSN (464:71760:1) was identified for the database with ID 16. This is an informational message only. No user action is required.
2014-12-10 07:45:16.78 spid23s Error: 35278, Severity: 17, State: 1.
2014-12-10 07:45:16.78 spid23s Availability database 'agdb', which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.
Here is the recovery state as reported by sys.dm_hadr_database_replica_states during recovery of the large transaction on the secondary:
Reviewing the performance monitor chart below, the above two rollback and recovering states are represented by the red line, which is the 'SQL Server: Database Replica: Recovery Queue' counter on the new primary, SQLNODE2:
Secondary Database in REVERTING state
After the rollback of the large transaction and the recovery of the availability database at the secondary replica, the availability database at SQLNODE1 is still not accessible nor receiving new changes from the primary. At this time, querying sys.dm_hadr_database_replica_states on SQLNODE1 will report the synchronization state description as 'REVERTING.' This is a part of the UNDO phase of recovery in which the secondary must actively request and receive pages from the primary to complete.
The REVERTING phase can take a long time to complete. The good news is we can track the progress of REVERTING with performance monitor using the 'SQL Server: Database Replica: Log remaining for undo' counter on the secondary (SQLNODE1). We can see an example of this in the performance monitor log above (click to enlarge) - the blue line tracks 'SQL Server: Database Replica: Log remaining for undo' tracked on SQLNODE1. Once transaction rollback and database recovery complete (red line monitored with SQL Server: Database Replica: Recovery Queue on the primary), log remaining for undo inflates and begins to drain.
In conclusion, secondary replica synchronization can be delayed following the interruption of a large transaction at the primary replica. But using the SQL Server error log and performance monitor, we monitor progress, to give some assurance that it is progressing towards completion so that the secondary is once again providing redundancy for our primary replica and read access for our reports, backups, etc.
Comments
- Anonymous
August 08, 2017
Awesome stuff - Anonymous
July 09, 2018
Thanks for the details...specially for perfmon counter.