SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Question

Virtual Enviroment description:

WSFC composed by two nodes with Windows server 2016 Standard
SQL Server AlwaysOn with synchronous replica and automatic failover
SQL Server 2014 (SP3-CU2) 12.0.6214.1
Following the cluster thresholds:
CrossSubnetDelay : 4000
CrossSubnetThreshold : 40
PlumbAllCrossSubnetRoutes : 0
SameSubnetDelay : 2000
SameSubnetThreshold : 20
AG properties:
LeaseTimeout: 20000
FailureConditionLevel: 3
HealthCheckTimeout: 30000
VerboseLogging: 0
Issue:

RHS.exe process on primary node lost heartbeat with availability group and it initiated the failover. Immediately after the lost of heartbeat, primary node loses the connection with the secondary node and automatic failover fail. Shortly after, hearbeat with primary node works again and primary node takes over the resources again.

Log Details

In chronological order:

Primary node randomly shows signs of disconnection with the secondary replica already in the days leading up to failover:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The primary node storage randomly shows signs of distress already in the days leading up to failover and just before failover:
...Long Sync IO: ... IOs in nonpreemptive mode longer than 1000ms
FlushCache cleaned up 146070 bufs with 20273 writes in 80387ms...
SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...
RHS.exe lost heartbeat with AG and WSFC send failover request to AG:
[hadrag] Failure detected, diagnostics heartbeat is lost
The local replica of availability group ... is preparing to transition to the resolving role..
Shortly after the failover request, the primary replica loses connection with the secondary:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The listener stops working and there is a checkpoint failure on a specific database:
One or more recovery units belonging to database .. failed to generate a checkpoint
Eleven seconds after the attempted failover, heartbeat works again and primary replica
takes over the resources

Questions

Million dollar questions:

why heartbeat was lost?
why connection with secondary replica was lost ?
Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

I understand that it is difficult to answer but it can help to only have a strategy or hypotheses that can help me find the problem

Answer

I answered a question about is+it+safe+to+run+a+windows+failover+cluster+public+and+heartbeat+on+a+single+nic.

Is this Vmware ? Are you running snapshot backups on VMware. Please see below links. This is knows issue and you must rule out any VMware error before moving to network part

Nodes being removed from Failover Cluster membership on VMWare ESX?.

Large packet loss at the guest operating system level on the VMXNET3 vNIC in ESXi

Troubleshooting Event ID 1135

Noticed any Network congestion

To start with please understand that "Heartbeat communication is used for the Health monitoring between the nodes to detect node failures. Heartbeat packets are Lightweight (134 bytes) in nature and sensitive to latency. If the cluster heartbeats are delayed by a Saturated NIC, blocked due to firewalls, etc, it could cause the cluster node to be removed from Cluster membership". By default your WFC connection will fail when 5 pings are lost (1 ping per second for a total of 5 seconds).

In your case you have set SameSubnetThreshold=20 and SameSubnetDelay=2 which means The heartbeat will NOT give up unless 20 pinges each send after 2 sec fail to get any response from the servers. Which mean the heartbeat would wait 40 seconds before initiating failover.

why heartbeat was lost?

Because for 40 seconds the ping did not responded or packet was lost due to network congestion. This forced WSFC to initiate a failover. Its quite possible that network is so much congested than even if it is online and connected the congestion is causing delay or there is packed lost.

why connection with secondary replica was lost ?

Answer is same as above, but it seems line you are using same NIC for both public and private communication.

Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

Network is AG's Achilles Heel. If you have poor badnwidth or choked network you will face issues with AG not matter you how much you ramp up hardware. A workaround I see is separating cluster heartbeat on private network and one more NIC card. The beauty of heartbeat is if private is down SQL Server will use public network to establish connection for heartbeat. Please take advise from your network team how to go with this, My network knowledge is limited.

The thing is you are having choked or bad network it seems you must focus on resolving this.

SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...

This is other thing which can add problems, your storage is slow. You need to upgrade to fast storage.

SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Virtual Enviroment description:

Issue:

Log Details

Questions

1 answer