Hyper-V Compute Failover Behavior Clarification

hobgadling 1 Reputation point
2020-11-18T20:35:30.2+00:00

Could I please clarify a few things about failover cluster behaviour which I cannot glean from the docs?

On a 2019 failover cluster, we have the following default settings:

SameSubnetDelay : 1000

SameSubnetThreshold : 10

CrossSubnetDelay : 1000

CrossSubnetThreshold : 20

ResiliencyLevel : 2

ResiliencyPeriod: :2 40

These settings are fairly well documented and I take the first four to be things which control:

  1. The frequency 'heartbeats' are sent

and

  1. The tolerance for how many heartbeats can be missed in a row.

I am assuming this effectively controls how long a node can exhibit a problem before it is deemed down?

The second two settings control the resiliency behaviour when the clustered roles are compute based (i.e. virtual machines)? With ResiliencyLevel 2, if a node is running compute roles and these roles go unmonitored, the roles will continue to run or enter a paused- critical state, depending on if the storage is block, SMB and if the storage network is entirely disrupted (but crucially they will do not failover if within the ResiliencyPeriod)?

If anyone ever find this post, some of the stuff I mention is documented here: https://techcommunity.microsoft.com/t5/failover-clustering/tuning-failover-cluster-network-thresholds/ba-p/371834

And here: https://techcommunity.microsoft.com/t5/failover-clustering/virtual-machine-compute-resiliency-in-windows-server-2016/ba-p/372027#:~:text=The%20node%20is%20no%20longer%20allowed%20to%20join%20the%20cluster,and%20the%20overall%20cluster%20health&text=No%20more%20than%2025%25%20of,quarantined%20at%20any%20given%20time

Can I please return to first principles quickly and check that my assessment of how this works is correct?

Moving on from this, can we consider a second scenario?

Let's say for the sake of argument we have an amount of clustered hyper-v nodes, they connect to a single switch or a stack with a single control pane. Let's say this switch just disappears for a period of time, let's say it crashes or is rebooted. The cluster loses quorum. What appears to happen with a quorum loss is a total failure of everything even if the network disruption is transitory (lets say 30 seconds). According to the documentation there have been advances in failover cluster design to make clusters more tolerant of network disruption, but is a network black hole event something impossible to ride without disruption?

I think a quicker way of asking this question is if there is a total loss of cluster network comms and quorum is lost, what is the expected behaviour? Are all nodes supposed to go isolated for a time, or something else? Also are there any settings controlling this?

Many thanks.

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,735 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
1,010 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Xiaowei He 9,906 Reputation points
    2020-11-19T06:31:49.317+00:00

    Hi,

    Based on my understanding, you would like to know the cluster behavior if the Cluster nodes and witness network interrupt. If I misunderstood, please feel free to correct me.

    1. If the cluster has two cluster nodes and witness, if the Cluster network is down between both of them, that the cluster nodes are unable to communicate with each other and unable to communicate with witness at the same time, then the cluster may down.
    2. If the network's downtime is very short, the heartbeat packet lost times during the network downtime hasn't reached the heartbeat threshold, then the cluster nodes will not down, just the witness will be accessible, when the network recovery, the witness will back too, we will not notice anything from Cluster level.
    3. If the cluster network down firstly between the nodes and witness, then between the cluster nodes, here, the network not down at the same time, then the cluster will online when the last man standing. The cluster will online with one node.

    Thanks for your time!
    Best Regards,
    Anne

    -----------------------------

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  2. hobgadling 1 Reputation point
    2020-11-19T08:05:39.097+00:00

    Thank you very much for getting back to me.

    So in your first example if we lose the two nodes and the disk witness, the cluster is instantly down and the cluster service down the second it loses quorum? Is there any setting that can control this? In my experience it seems to happen in 10 seconds and i wonder if this can be tuned?

    Many thanks

    0 comments No comments

  3. Xiaowei He 9,906 Reputation points
    2020-11-23T08:04:50.153+00:00

    Hi,

    So in your first example if we lose the two nodes and the disk witness, the cluster is instantly down and the cluster service down the second it loses quorum?

    It seems we only have settings to configure the heartbeat frequency and the tolerance for how many heartbeats can be missed between the cluster nodes.

    As for the heartbeat check between the Cluster nodes and the quorum, if we use file share witness, as far as I know, it also use UDP 3343, however, it seems not use the heartbeat threshold settings. From the Cluster logs, it seems the heartbeat check between the nodes and quorum will retry 30 times, while there's no deep information about the frequency and I didn't find the information about how to change it.

    So, I'm afraid to say we are unable to control the check threshold between the nodes and the quorum.

    Thanks for your time!
    Best Regards,
    Anne

    -----------------------------

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.