Why ALWAYSON AG goes to 'NOT SYNCRONIZING' when db are busy

Jiong AN 146 Reputation points
2020-10-27T05:12:05.563+00:00

Everytime my alwayson db is busy, ALL the secondary dbs goes into 'NO SYNCRONIZING' mode. and stays until the server goes quiet again.

From some times ago, this is no longer the case. it stays in 'NO SYNCRONIZING' unless i remote desktop to the db server and open 'Failover Cluster Manager', the app opened and it took about 1 mins for the cluster info, nodes to show up. Then I checked the AG again, dbs start to go back to 'SYNCRONIZED'.

I know MS had a fix about 'NO SYNCRONIZING' at

https://support.microsoft.com/en-au/help/3034679/kb3034679-fix-alwayson-availability-groups-are-reported-as-not-synchro

but I have already patched my 2016 enterprise to the latest CU2, and still having problems.

I have checked error logs, it shows timeout happened.

e.g.
A connection timeout has occurred on a previously established connection to availability replica 'server1' with id [*******]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
35206

and

Always On Availability Groups connection with primary database terminated for secondary database 'db2' on the availability replica 'server1' with Replica ID: {******}. This is an informational message only. No user action is required.
35267

....

Strange thing is: it ONLY happens when db is busy. when db is not busy, nothing happens. So the although the error seems like a network issue but in fact is not.

My Questions:

  1. Why does db go to 'NO SYNCRONIZING' and cluster connection timeout when db is busy?
  2. when system is quiet, it stays as 'NO SYNCRONIZING' now, to fix it, i needs to open 'FAILOVER CLUSTER MANAGER', why is it a fix. does it restart some services or rejoined node to cluster? (since I can't see anything is the system logs and cluster logs also shows nothing). If so, can I run a powerscript command to do the same as open app.
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,664 questions
0 comments No comments
{count} votes

Accepted answer
  1. m 4,271 Reputation points
    2020-10-27T07:34:43.183+00:00
    1. Why does db go to 'NO SYNCRONIZING' and cluster connection timeout when db is busy?

    You have already said db is busy, when db is busy, you can think that there are limited resources but for busy db, the process and
    its threads need more time to wait, to recovering and restoring.

    Availability database status:

    Not synchronizing

    If the primary database is in this state, it means that the database is not ready to synchronize its transaction log with the corresponding secondary database.

    If the secondary database is in this state, the database may

    (1) Log synchronization is no longer performed due to connection problems or redo failure,
    (2) The log synchronization with the main database is "suspended",
    (3) Due to role switching, it is in the middle of transition.

    You can check as that run cmd, and then ping listener's ip -t, then failover, there you can see one timeout, then connected, means that it is in switching time.
    35245-20201027ping-ip-t1.jpg

    1. when system is quiet, it stays as 'NO SYNCRONIZING' now, to fix it, i needs to open 'FAILOVER CLUSTER MANAGER', why is it a fix. does it restart some services or rejoined node to cluster? (since I can't see anything is the system logs and cluster logs also shows nothing). If so, can I run a powerscript command to do the same as open app.

    Why do you use latest SP and CU to have a test. Check why your db is always busy, you can classify them in different dbs to sharing the load.

    Note:

    A connection timeout has occurred on a previously established connection to availability replica 'server1' with id [***]. Either a networking or a firewall issue exists or >the availability replica has transitioned to the resolving role.35206

    Note This problem might occur only on very powerful computers and when SQL Server is very busy. For example, in one scenario, this problem occurred on a very busy system with 24 cores.

    Always On Availability Groups connection with primary database terminated for secondary database 'db2' on the availability replica 'server1' with Replica ID: {**}. This is >an informational message only. No user action is required.35267

    This issue occurs because of a deadlock condition between a redo thread and a switch-role thread.

    If your verison is meet the fixed versions, always have these issues, I  recommend you to submit your issue to the Microsoft feedback at this link https://feedback.azure.com/forums/908035-sql-server . This site will serve as a connecting point between you and Microsoft, and ultimately the large community for you and Microsoft to interact with. Your feedback enables Microsoft to offer the best software and deliver superior services, meanwhile you can learn more about and contribute to the exciting projects on Microsoft feedback.

    Or

    If the issue still exists after you have tested, please don't hesitate to ask professional engineers for help, and they will deal with your problem separately and confidentially: https://support.microsoft.com/en-us/assistedsupportproducts

    BR,
    Mia


4 additional answers

Sort by: Most helpful
  1. Jiong AN 146 Reputation points
    2020-10-27T08:54:53.46+00:00

    Hi Mia, thank you very much for your detail analytical explanation:

    In fact my link between primary and secondary has some network delays.
    35295-networkdelay.jpg

    The 'no syncronizing' is happened on secondary.

    Both primary and secondary are VM with 32Gb memory , 16 Cores
    Microsoft SQL Server 2016 (SP2-CU14) (KB4564903) - 13.0.5830.85 (X64) Enterprise (64-bit) on Windows Server 2016 Standard 10.0 <X64> (Build 14393: ) (Hypervisor)
    yes, they are powerful VMs.

    Why my db busy? I see a lot of 'HADR_SYNC_COMMIT' and 'HADR_GROUP_COMMIT'. this could be contributed from the network latency. when a INSERT, DELETE,UPDATE process spend a lot of time waiting in 'HADR_SYNC_COMMIT', it caused a chain reactions for others to wait.

    Before I submit a ticket to MS feedback, i would like to get some info from you see if this is relating to something simple. If not, i will certainly submit it to MS feedback.

    0 comments No comments

  2. m 4,271 Reputation points
    2020-10-27T11:05:17.8+00:00

    Hi @Jiong AN ,

    Why my db busy? I see a lot of 'HADR_SYNC_COMMIT' and 'HADR_GROUP_COMMIT'. this could be contributed from the network latency. when a INSERT, DELETE,UPDATE process spend a lot of time waiting in 'HADR_SYNC_COMMIT', it caused a chain reactions for others to wait.

    Yes. You are right.
    So what you need to do is to upgrade your network.
    Similar case: hadrsyncommit-wait-types

    And more information: sys-dm-os-wait-stats-transact-sql

    HADR_GROUP_COMMIT: Transaction commit processing is waiting to allow a group commit so that multiple commit log records can be put into a single log block. This wait is an expected condition that optimizes the log I/O, capture, and send operations.

    HADR_SYNC_COMMIT: Waiting for transaction commit processing for the synchronized secondary databases to harden the log. This wait is also reflected by the Transaction Delay performance counter. This wait type is expected for synchronized availability groups and indicates the time to send, write, and acknowledge log to the secondary databases.

    BR,
    Mia


    If the answer is helpful, please click "Accept Answer" and upvote it.


  3. m 4,271 Reputation points
    2020-10-27T11:26:06.527+00:00

    Hi @Jiong AN ,

    And if there is one big transaction on your primary is running, you can consider optimize the t_sql or execute them in batch.

    BR,
    Mia


    If the answer is helpful, please click "Accept Answer" and upvote it.

    0 comments No comments

  4. Jiong AN 146 Reputation points
    2020-10-28T02:53:19.227+00:00

    hi Mia
    Thanks again, the transaction was relatively simple INSERT or DELETE, UPDATEs, i have also checked the table rows also small, few hundred rows. Its just a 'HADR_SYNC_COMMIT' that takes time, I have also run query to show total number of waits vs wait type. The HARD*_COMMIT type waits is about more than 50% of total wait time. At times, i took db out of the AG, it just takes queries a lot faster just by itself without Alwayson.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.