SQL Server AlwaysOn: Lost of heartbeat and connection with secondary replica

Gioele 1 Reputation point
2020-11-25T10:35:30.567+00:00

Virtual Enviroment description:

  • WSFC composed by two nodes with Windows server 2016 Standard
  • SQL Server AlwaysOn with synchronous replica and automatic failover
  • SQL Server 2014 (SP3-CU2) 12.0.6214.1
  • Following the cluster thresholds:
    CrossSubnetDelay : 4000
    CrossSubnetThreshold : 40
    PlumbAllCrossSubnetRoutes : 0
    SameSubnetDelay : 2000
    SameSubnetThreshold : 20
  • AG properties:
    LeaseTimeout: 20000
    FailureConditionLevel: 3
    HealthCheckTimeout: 30000
    VerboseLogging: 0

    Issue:

RHS.exe process on primary node lost heartbeat with availability group and it initiated the failover. Immediately after the lost of heartbeat, primary node loses the connection with the secondary node and automatic failover fail. Shortly after, hearbeat with primary node works again and primary node takes over the resources again.

Log Details

In chronological order:

  1. Primary node randomly shows signs of disconnection with the secondary replica already in the days leading up to failover:
    ...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
  2. The primary node storage randomly shows signs of distress already in the days leading up to failover and just before failover:
    ...Long Sync IO: ... IOs in nonpreemptive mode longer than 1000ms
    FlushCache cleaned up 146070 bufs with 20273 writes in 80387ms...
    SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...
  3. RHS.exe lost heartbeat with AG and WSFC send failover request to AG:
    [hadrag] Failure detected, diagnostics heartbeat is lost
    The local replica of availability group ... is preparing to transition to the resolving role..
  4. Shortly after the failover request, the primary replica loses connection with the secondary:
    ...AlwaysOn Availability Groups connection with secondary database terminated for primary database 'nameofDB'..
  5. The listener stops working and there is a checkpoint failure on a specific database:
    One or more recovery units belonging to database .. failed to generate a checkpoint
  6. Eleven seconds after the attempted failover, heartbeat works again and primary replica
    takes over the resources

Questions

Million dollar questions:

  1. why heartbeat was lost?
  2. why connection with secondary replica was lost ?
  3. Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

I understand that it is difficult to answer but it can help to only have a strategy or hypotheses that can help me find the problem

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,814 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
962 questions
Windows Server Storage
Windows Server Storage
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Storage: The hardware and software system used to retain data for subsequent retrieval.
632 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Shashank Singh 6,246 Reputation points
    2020-11-25T12:18:01.777+00:00

    I answered a question about is+it+safe+to+run+a+windows+failover+cluster+public+and+heartbeat+on+a+single+nic.

    Is this Vmware ? Are you running snapshot backups on VMware. Please see below links. This is knows issue and you must rule out any VMware error before moving to network part

    Nodes being removed from Failover Cluster membership on VMWare ESX?.

    Large packet loss at the guest operating system level on the VMXNET3 vNIC in ESXi

    Troubleshooting Event ID 1135

    Noticed any Network congestion

    To start with please understand that "Heartbeat communication is used for the Health monitoring between the nodes to detect node failures. Heartbeat packets are Lightweight (134 bytes) in nature and sensitive to latency. If the cluster heartbeats are delayed by a Saturated NIC, blocked due to firewalls, etc, it could cause the cluster node to be removed from Cluster membership". By default your WFC connection will fail when 5 pings are lost (1 ping per second for a total of 5 seconds).

    In your case you have set SameSubnetThreshold=20 and SameSubnetDelay=2 which means The heartbeat will NOT give up unless 20 pinges each send after 2 sec fail to get any response from the servers. Which mean the heartbeat would wait 40 seconds before initiating failover.

    why heartbeat was lost?

    Because for 40 seconds the ping did not responded or packet was lost due to network congestion. This forced WSFC to initiate a failover. Its quite possible that network is so much congested than even if it is online and connected the congestion is causing delay or there is packed lost.

    why connection with secondary replica was lost ?

    Answer is same as above, but it seems line you are using same NIC for both public and private communication.

    Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?

    Network is AG's Achilles Heel. If you have poor badnwidth or choked network you will face issues with AG not matter you how much you ramp up hardware. A workaround I see is separating cluster heartbeat on private network and one more NIC card. The beauty of heartbeat is if private is down SQL Server will use public network to establish connection for heartbeat. Please take advise from your network team how to go with this, My network knowledge is limited.

    The thing is you are having choked or bad network it seems you must focus on resolving this.

    SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...

    This is other thing which can add problems, your storage is slow. You need to upgrade to fast storage.

    1 person found this answer helpful.