Could I please clarify a few things about failover cluster behaviour which I cannot glean from the docs?
On a 2019 failover cluster, we have the following default settings:
SameSubnetDelay : 1000
SameSubnetThreshold : 10
CrossSubnetDelay : 1000
CrossSubnetThreshold : 20
ResiliencyLevel : 2
ResiliencyPeriod: :2 40
These settings are fairly well documented and I take the first four to be things which control:
- The frequency 'heartbeats' are sent
and
- The tolerance for how many heartbeats can be missed in a row.
I am assuming this effectively controls how long a node can exhibit a problem before it is deemed down?
The second two settings control the resiliency behaviour when the clustered roles are compute based (i.e. virtual machines)? With ResiliencyLevel 2, if a node is running compute roles and these roles go unmonitored, the roles will continue to run or enter a paused- critical state, depending on if the storage is block, SMB and if the storage network is entirely disrupted (but crucially they will do not failover if within the ResiliencyPeriod)?
If anyone ever find this post, some of the stuff I mention is documented here: https://techcommunity.microsoft.com/t5/failover-clustering/tuning-failover-cluster-network-thresholds/ba-p/371834
And here: https://techcommunity.microsoft.com/t5/failover-clustering/virtual-machine-compute-resiliency-in-windows-server-2016/ba-p/372027#:~:text=The%20node%20is%20no%20longer%20allowed%20to%20join%20the%20cluster,and%20the%20overall%20cluster%20health&text=No%20more%20than%2025%25%20of,quarantined%20at%20any%20given%20time
Can I please return to first principles quickly and check that my assessment of how this works is correct?
Moving on from this, can we consider a second scenario?
Let's say for the sake of argument we have an amount of clustered hyper-v nodes, they connect to a single switch or a stack with a single control pane. Let's say this switch just disappears for a period of time, let's say it crashes or is rebooted. The cluster loses quorum. What appears to happen with a quorum loss is a total failure of everything even if the network disruption is transitory (lets say 30 seconds). According to the documentation there have been advances in failover cluster design to make clusters more tolerant of network disruption, but is a network black hole event something impossible to ride without disruption?
I think a quicker way of asking this question is if there is a total loss of cluster network comms and quorum is lost, what is the expected behaviour? Are all nodes supposed to go isolated for a time, or something else? Also are there any settings controlling this?
Many thanks.