SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Question

SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Gioele 1

Virtual Enviroment description:

WSFC composed by two nodes with Windows server 2016 Standard
SQL Server AlwaysOn with synchronous replica and automatic failover
SQL Server 2014 (SP3-CU2) 12.0.6214.1
Following the cluster thresholds:
CrossSubnetDelay : 4000
CrossSubnetThreshold : 40
PlumbAllCrossSubnetRoutes : 0
SameSubnetDelay : 2000
SameSubnetThreshold : 20
AG properties:
LeaseTimeout: 20000
FailureConditionLevel: 3
HealthCheckTimeout: 30000
VerboseLogging: 0
Issue:

RHS.exe process on primary node lost heartbeat with availability group and it initiated the failover. Immediately after the lost of heartbeat, primary node loses the connection with the secondary node and automatic failover fail. Shortly after, hearbeat with primary node works again and primary node takes over the resources again.

Log Details

In chronological order:

Primary node randomly shows signs of disconnection with the secondary replica already in the days leading up to failover:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The primary node storage randomly shows signs of distress already in the days leading up to failover and just before failover:
...Long Sync IO: ... IOs in nonpreemptive mode longer than 1000ms
FlushCache cleaned up 146070 bufs with 20273 writes in 80387ms...
SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...
RHS.exe lost heartbeat with AG and WSFC send failover request to AG:
[hadrag] Failure detected, diagnostics heartbeat is lost
The local replica of availability group ... is preparing to transition to the resolving role..
Shortly after the failover request, the primary replica loses connection with the secondary:
...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
The listener stops working and there is a checkpoint failure on a specific database:
One or more recovery units belonging to database .. failed to generate a checkpoint
Eleven seconds after the attempted failover, heartbeat works again and primary replica
takes over the resources

Questions

Million dollar questions:

why heartbeat was lost?
why connection with secondary replica was lost ?
Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

I understand that it is difficult to answer but it can help to only have a strategy or hypotheses that can help me find the problem

Cris Zhan-MSFT 6,661 Reputation points

2020-11-30T02:22:24.603+00:00

Hi,
Just following up. Is there any update on this case?
Did the following answers help you solve the problem?If the answer is helpful, please click "Accept Answer" and upvote it.

1 answer

Your answer

Cris Zhan-MSFT 6,661 Reputation points

2020-11-30T02:22:24.603+00:00

Hi,
Just following up. Is there any update on this case?
Did the following answers help you solve the problem?If the answer is helpful, please click "Accept Answer" and upvote it.

Answer 1

I answered a question about is+it+safe+to+run+a+windows+failover+cluster+public+and+heartbeat+on+a+single+nic.

Is this Vmware ? Are you running snapshot backups on VMware. Please see below links. This is knows issue and you must rule out any VMware error before moving to network part

Nodes being removed from Failover Cluster membership on VMWare ESX?.

Large packet loss at the guest operating system level on the VMXNET3 vNIC in ESXi

Troubleshooting Event ID 1135

Noticed any Network congestion

To start with please understand that "Heartbeat communication is used for the Health monitoring between the nodes to detect node failures. Heartbeat packets are Lightweight (134 bytes) in nature and sensitive to latency. If the cluster heartbeats are delayed by a Saturated NIC, blocked due to firewalls, etc, it could cause the cluster node to be removed from Cluster membership". By default your WFC connection will fail when 5 pings are lost (1 ping per second for a total of 5 seconds).

In your case you have set SameSubnetThreshold=20 and SameSubnetDelay=2 which means The heartbeat will NOT give up unless 20 pinges each send after 2 sec fail to get any response from the servers. Which mean the heartbeat would wait 40 seconds before initiating failover.

why heartbeat was lost?

Because for 40 seconds the ping did not responded or packet was lost due to network congestion. This forced WSFC to initiate a failover. Its quite possible that network is so much congested than even if it is online and connected the congestion is causing delay or there is packed lost.

why connection with secondary replica was lost ?

Answer is same as above, but it seems line you are using same NIC for both public and private communication.

Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

Network is AG's Achilles Heel. If you have poor badnwidth or choked network you will face issues with AG not matter you how much you ramp up hardware. A workaround I see is separating cluster heartbeat on private network and one more NIC card. The beauty of heartbeat is if private is down SQL Server will use public network to establish connection for heartbeat. Please take advise from your network team how to go with this, My network knowledge is limited.

The thing is you are having choked or bad network it seems you must focus on resolving this.

SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...

This is other thing which can add problems, your storage is slow. You need to upgrade to fast storage.

Gioele 1 Reputation point

2020-11-25T22:09:11.097+00:00

Yes It's VMWare, I'll check the configuration showed at links, thank you, but why are you sure it's a network problem? RHS.exe is a process on the primary node that "pings" SQL Server AG, isn't it? lease mechanism Could it be a resource related issue since sql server did not respond to ping for 40 seconds?
The lease timeout in my case happened locally on primary node (as error said) so I think it doesn't use network connection.
My cluster nodes loss connection with secondary replica too but maybe they are different issues with separate root cause ?
Shashank Singh 6,251 Reputation points

2020-11-26T08:14:28.63+00:00

From what you have posted it seems like network related. Unless you find out issue in the VMware. Lease mechanism is limited to communication between Cluster resource DLL and SQL Server instance, in AG this is kind of hearbeat created for better monitoring of AG. Where is your quorum located, what kind of quorum you are having ?
Gioele 1 Reputation point

2020-11-26T14:51:20.91+00:00

I configured a File Share Witness and the quorum is located on primary node (node has in charge AG too). The file share witness is a domain controller that resides on the same subnet, same datacenter and same VMWare infrastructure

Share via

SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Virtual Enviroment description:

Issue:

Log Details

Questions

1 answer

Your answer