we set up an Always One Cluster a few months ago and everything is working very well.
We did simulations of faiolver for both the synchronous replicaca and also for the asynchronous and everything is fine.
But after setting up some alarms, we observe recurring error events at different times during the day, dawn or night.
We did some research and the root cause is usually either a problem on the network or some backup routine that is occurring freezing the SQL processes.
In our case, as the event occurs at different times and is now more constant with only the server that is in Azure (remote) everything points to something directed.
The two servers that are connected in synchronous mode on the same network do not generate connection error events, only the error is generated for the server that is in Azure.
This remote connection to the Azure server has a 10ms latency because we have a direct connection between the datacenter and Azure with a (1Gbps) link, that is, practically a LAN network.
We replicated approximately 15 databases that have a total of approximately 2TB between servers.
We have been monitoring the connection of the onpremise servers to the Azure com server to assess whether during the times we receive alarms there is also a connectivity failure, but so far we have not identified any failure.
Our surprise is that the event occurs for a server with asynchronous configuration due to the multisubnet network.
I also read that this can happen when there are queues of many LOG transactions on the primary side, which causes delays in replication. But if we replicate the same bases for the onpremises server in synchronous mode it should also happen, as it does for the server that is in Azure. So I understand that it is something directed only at the Azure server.
We have identified that the failure occurs for all AGs that are on the onpremises servers. Only for the server that is on Azure on another network.
I want to hear from friends if increasing session timeout time would help? Even if no connectivity failure is identified? Or increased latency?
Or should we look at some other information to understand the root cause?