I answered a question about is+it+safe+to+run+a+windows+failover+cluster+public+and+heartbeat+on+a+single+nic.
Is this Vmware ? Are you running snapshot backups on VMware. Please see below links. This is knows issue and you must rule out any VMware error before moving to network part
Nodes being removed from Failover Cluster membership on VMWare ESX?.
Large packet loss at the guest operating system level on the VMXNET3 vNIC in ESXi
Noticed any Network congestion
To start with please understand that "Heartbeat communication is used for the Health monitoring between the nodes to detect node failures. Heartbeat packets are Lightweight (134 bytes) in nature and sensitive to latency. If the cluster heartbeats are delayed by a Saturated NIC, blocked due to firewalls, etc, it could cause the cluster node to be removed from Cluster membership
". By default your WFC connection will fail when 5 pings are lost (1 ping per second for a total of 5 seconds).
In your case you have set SameSubnetThreshold=20
and SameSubnetDelay=2
which means The heartbeat will NOT give up unless 20 pinges each send after 2 sec fail to get any response from the servers. Which mean the heartbeat would wait 40 seconds
before initiating failover.
why heartbeat was lost?
Because for 40 seconds the ping did not responded or packet was lost due to network congestion. This forced WSFC to initiate a failover. Its quite possible that network is so much congested than even if it is online and connected the congestion is causing delay or there is packed lost.
why connection with secondary replica was lost ?
Answer is same as above, but it seems line you are using same NIC for both public and private communication.
Any workaround ? increase lease timeout, performance of storage or RAM and CPU can help?
Network is AG's Achilles Heel. If you have poor badnwidth or choked network you will face issues with AG not matter you how much you ramp up hardware. A workaround I see is separating cluster heartbeat on private network and one more NIC card. The beauty of heartbeat is if private is down SQL Server will use public network to establish connection for heartbeat. Please take advise from your network team how to go with this, My network knowledge is limited.
The thing is you are having choked or bad network it seems you must focus on resolving this.
SQL Server has encountered 4 occourence(s) of I/O request taking longer than 15 seconds to complete...
This is other thing which can add problems, your storage is slow. You need to upgrade to fast storage.