salves avatar image
0 Votes"
salves asked prmanhas-MSFT commented

Timeout errors for asynchronous secondary replica


we set up an Always One Cluster a few months ago and everything is working very well.

We did simulations of faiolver for both the synchronous replicaca and also for the asynchronous and everything is fine.

But after setting up some alarms, we observe recurring error events at different times during the day, dawn or night.

We did some research and the root cause is usually either a problem on the network or some backup routine that is occurring freezing the SQL processes.

In our case, as the event occurs at different times and is now more constant with only the server that is in Azure (remote) everything points to something directed.

The two servers that are connected in synchronous mode on the same network do not generate connection error events, only the error is generated for the server that is in Azure.

This remote connection to the Azure server has a 10ms latency because we have a direct connection between the datacenter and Azure with a (1Gbps) link, that is, practically a LAN network.

We replicated approximately 15 databases that have a total of approximately 2TB between servers.

We have been monitoring the connection of the onpremise servers to the Azure com server to assess whether during the times we receive alarms there is also a connectivity failure, but so far we have not identified any failure.

Our surprise is that the event occurs for a server with asynchronous configuration due to the multisubnet network.

I also read that this can happen when there are queues of many LOG transactions on the primary side, which causes delays in replication. But if we replicate the same bases for the onpremises server in synchronous mode it should also happen, as it does for the server that is in Azure. So I understand that it is something directed only at the Azure server.

We have identified that the failure occurs for all AGs that are on the onpremises servers. Only for the server that is on Azure on another network.

I want to hear from friends if increasing session timeout time would help? Even if no connectivity failure is identified? Or increased latency?

Or should we look at some other information to understand the root cause?

References found:


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

There is not any singular root cause of these types of issues. These tend to take a long time to diagnose as the data collection tends to be on the large side, needing network from primary and disconnecting secondary, perfmon from both, XE and trace data, and DMV output to truly figure out the root of the issue. Since one of the replicas is in Azure, I'd check your dashboard to make sure that you aren't hitting disk, cpu, or any other type of throttling at an individual level or a VM level. Once that's ruled out, you'll want to start collecting massive amounts of data to narrow it down until the root cause is found. These are generally non-trivial.

0 Votes 0 ·

@salves Any update on the issue?

Can you please "Accept as Answer" if it helped so it can help others in community looking for help on similar topics.


0 Votes 0 ·

1 Answer

AnshulFarkya-7260 avatar image
0 Votes"
AnshulFarkya-7260 answered

Hello, One possible cause for this type of issue could be Azure VM backup or Azure Site Recovery configured for your VM hosting SQL server in Azure.

Please make sure SQL VSS writer service is disabled, if you have Azure VM backups/ASR configured for your server hosting SQL server. SQL VSS writer freezes I/O for a brief period while it performs app consistent VM backup in Azure. You will still be able to achieve File consistent VM backup after SQL VSS writer service is disabled.

Best Regards,

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.