Problems with AlwaysOn, HADR_LOGPROGRESS_SYNC waits and missing log block

Question

Hi Folks!

Need help from you !

We have some issues with AlwaysOn synchronisation on high load system and we can it repeat on PROD and LT environmets(not depend from hardware).

Configuration:

SQL Seerver 2019 EE CU24

Prod- 4 nodes (2 synchronous mode replica, 2 async RO replicas) , 2 Tb memory, 160 cores, We use 2 async replicases for RO queris, 1 sync node for active failover(without RO queries)

test Lt env - 3 nodes (sync not async it is not matter), 2 Tb, 160 cores

Availability Group include 3 Dbs, main db has size about 20 Tb, other dbs has small size about 100-400 Gb

type of load -OLTP.

Problems and symproms:

In generally, all is works good without problems, but in high load moment(usually above 100000 transaction per sec on Primary\secondary node, we have issues with alwaysOn)

At the one moment starting grow in Recovery Queue on secondary nodes,

all queries has wait HADR_LOGPROGRESS_SYNC ,

afret 5-7 minutes secondary replca has error in sql log:

AlwaysOn on Availabilty Groups transport has detected a missing log block foo availabilty database ....Log scan will be restarted- sometime has one records, mostly a lot of rows of this error

after 1-3 seconds next message in the sql log- AlwaysOn Availabilty Group connection with primary database established....

It is happened with one replica at one moment or with two replicas in moment, but the replica without RO queries(active reserve) continue working without problems.

Additionally

during missing block error starting work Flow Control mechanism on primary node

Growing recovery Queue on secondary node perf counter

The db has delay about 10-20 seconds

when load begins to decrease (transaction per seconda below 100 000)the system start work stability , without problems and waits.

Usually, the problem reproduce whe transaction above 100000 per second on primary and secondary node.

Duration wait time of HADR_LOGPROGRESS_SYNC sometimes reaches up to 120 ms

CPU - on primary about 40% , on secondary 60-80%, sometimes about 90% during the error missing block.

Have any idea what it is?

Any advice or help will be appreciated.

Answer

Hi @Viacheslav Nikonorov ,

Thank you for reaching out and welcome to Microsoft Q&A.

The error message:

AlwaysOn on Availabilty Groups transport has detected a missing log block foo availabilty database .... Log scan will be restarted- sometime has one records.

I learned that This issue is fixed in the Cumulative Updates 4 for SQL Server 2019, but there may have been a defect that caused this condition to occur unexpectedly.
For databases utilizing Availability Groups. The wait HADR_LOGPROGRESS_SYNC accumulates when replicas are processing log records and SQL Server needs to wait for replicas to synchronize internally. This is a common wait; small values are acceptable but excessive values should be investigated.
You can check the SQL Server error log for more information and examine wait statistics on the instance for other Availability Group issues. Investigate accordingly. In addition, here is an official document about how to Troubleshoot High HADR_SYNC_COMMIT wait type with Always On Availability Groups, you can learn more about the wait statistic and follow the steps in this article. Hope this can help you resolve the issue.

Feel free to share your issue here if you have any confused.

Best regards,

Lucy Chen

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our Documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

https://docs.microsoft.com/en-us/answers/support/email-notifications

Answer

What you are experiencing is most likely due to some issues on the secondary in hardening the log block. In order for the AG to ack the write (sync commit), it needs to write to the place that keeps track of the last hardened lsn. So with high contention that you mentioned it could be likely that there is enough activity that log blocks are missing and need to be rescanned just because of the activity.

You are probably also experiencing HADR_SYNC_COMMIT waits along with this, which is the wait that indicates that you are having problems. The PROGRESS waits are associated with this as well.

You may want to check the Network to ensure that you are not maxing out the bandwidth on that network when these high numbers of transactions start. Also consider faster storage on the replica to facilitate faster writes to durable media.

The redo queue will obviously be large if you are putting that much data through the engine and replicas, so that is a different issue and really is a time based thing. Faster storage helps to get the writes to happen faster, but for the error you are getting I would first check the network.

It also could be a regression bug from the CU 4 in SQL 2019 - https://support.microsoft.com/en-gb/topic/kb4541309-fix-missing-log-block-may-occur-when-you-use-always-on-availability-group-in-sql-server-69d89ad7-ba12-a438-ea63-61159b3e30c5

References:

https://learn.microsoft.com/en-us/archive/blogs/psssql/lesson-learned-from-an-availability-group-performance-case

https://learn.microsoft.com/en-us/archive/blogs/psssql/troubleshooting-data-movement-latency-between-synchronous-commit-always-on-availability-groups

https://learn.microsoft.com/en-us/archive/blogs/sql_server_team/troubleshooting-high-hadr_sync_commit-wait-type-with-always-on-availability-groups

https://learn.microsoft.com/en-us/archive/blogs/psssql/alwayson-hadron-learning-series-hadr_sync_commit-vs-writelog-wait

https://learn.microsoft.com/en-us/archive/blogs/saponsqlserver/sql-server-2012-alwayson-part-12-performance-aspects-and-performance-monitoring-ii

Answer

Hi @Viacheslav Nikonorov ,

Thanks for your kindly feedback.

I saw the article KB4541309, errors are same but we have CU24.

Yes, I noticed you have CU24, so I think there may have been a defect that caused this condition to occur unexpectedly.

Probably , need a deep analysis.

I agree with you. Did you check out this article I provided last answer? You should compare and troubleshoot the issue according to the data in this known thread.

Note: If none of the resources has performance issue, please involve Microsoft Customer Support Team for further investigation.

Best regards,

Lucy Chen

If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our Documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

https://docs.microsoft.com/en-us/answers/support/email-notifications

Share via

Problems with AlwaysOn, HADR_LOGPROGRESS_SYNC waits and missing log block

3 answers