2 Node Cluster with FSW Stops When Secondary Reboots

James Allcock 0 Reputation points
2023-01-16T14:27:17.38+00:00

We've had Windows clusters configured on various versions of Windows Server for a few years to support SQL Availability Groups and things have generally worked well. Recently, I've seen quite a few clusters stop when the secondary server (for SQL) reboots. I'll give details of the most recent one that happened over the weekend.

There are 2 Windows Server 2016 servers running SQL 2016 and a file share witness on Windows Server 2019. The SQL AG is asynchronous with manual failover. I never want the secondary to take over unless I do it manually. The primary server (Node 2) and FSW are at 1 site, the secondary (Node 1) is at another.

An issue with a VMware host caused the secondary server to reboot on Saturday night. Although the cluster should still have had quorum, the cluster service on the primary stopped. I don't have much experience with troubleshooting clusters, but used Get-ClusterLog to generate a detailed log. In there, I found this:

00001910.00000954::2023/01/14-23:26:37.287 INFO [RES] File Share Witness <File Share Witness>: Read 88 bytes from the witness file share.

00001910.00000954::2023/01/14-23:26:37.288 INFO [RES] File Share Witness <File Share Witness>: Releasing temporary lock on witness file share.

00000bdc.00000718::2023/01/14-23:26:37.288 INFO [QUORUM] Node 2: received request for quorum witness info. Replying with paxos tag 120:120:30042

00000bdc.00001b24::2023/01/14-23:26:37.289 ERR Quorum witness has better epoch than local node, this node must have been on the losing side of arbitration!

00000bdc.00001b24::2023/01/14-23:26:37.289 ERR [GUM] Node 2: failed to update epoch after one of the nodes went down. error 995

00000bdc.00001b24::2023/01/14-23:26:37.289 ERR mscs::GumAgent::ChangeEpoch: (995)' because of 'Quorum witness has better epoch than local node ABORT!'

00000bdc.00001b24::2023/01/14-23:26:37.289 ERR Changing Epoch failed (status = 1359)

00000bdc.00001b24::2023/01/14-23:26:37.289 INFO [NETFT] Cluster Service preterminate succeeded.

Event logs on the primary server include event ID 1073 with this message:

The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was '1359'.

It looks like it thens stops the cluster service, which took the SQL AG down as well. The secondary server rebooted quickly and everything came back within a minute, but I'm concerned that this is happening more frequently.

When patching secondary servers, they're quite open on the automated schedule, so we do nothing other than relying on the primary server and FSW keeping quorum. For the primary servers, we fail everything over to the secondary before patching in a maintenance window, then fail back to the primary at the end of the window. Several of the secondary servers rebooting for patching has caused the same issue recently.

Today I've manually shut down the secondary server and left it off for 10 minutes. The cluster behaved as expected, the primary continued with no interruptions. I've not been able to investigate while a cluster is in a failed state and can't replicate it, so not sure how to proceed.

Windows
Windows
A family of Microsoft operating systems that run across personal computers, tablets, laptops, phones, internet of things devices, self-contained mixed reality headsets, large collaboration screens, and other devices.
134 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
726 questions
No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Edwin M Sarmiento 6 Reputation points
    2023-01-16T17:52:52.87+00:00

    Without going through the internals of how the cluster service works and the relevance of the paxos tag, the reboot of your secondary VMWare server (taking down Node 1 in the process) caused the cluster to lose quorum, despite still having 2 out of 3 votes (Node 2 and the FSW). This is by design. It's frustrating to have that false sense of security despite having 2 out of 3 votes. But if you understand the internals, it really is by design.

    The way that the cluster determines quorum is by having majority of votes. The votes are decided by the voting members, if they are still available. Note that dynamic quorum and dynamic witness don't work in a 3-voting member (2-node + witness) scenario. It only works when you have more than 3 voting members.

    Depending on how far away those voting members are from each other, there could be delays on updating the paxos tag. The paxos tag is a way of guaranteeing the consistency of the cluster configuration across all the cluster nodes in order to prevent split brain. The paxos tag gets updated anytime a cluster configuration is made. This can be as simple as a node being removed from the voting members or as complex as changing the AG listener to have multiple IP addresses. I refer to paxos tag, although the cluster log uses the word epoch. Take a look at this entry in the cluster log.

    00000bdc.00001b24::2023/01/14-23:26:37.289 ERR Quorum witness has better epoch than local node, this node must have been on the losing side of arbitration!

    Note that the paxos tag should be the same across all nodes in the cluster in order to maintain consistency. In this case, it isn't (FSW has a more updated paxos tag than Node 2). I could be wrong but my guess is that the secondary node (Node 1) was able to update the paxos tag in the FSW but not the primary node (Node 2). So, when the primary node lost connectivity with the secondary node, it checked the FSW and saw that the paxos tag was inconsistent. It tried to update it but failed.

    00000bdc.00001b24::2023/01/14-23:26:37.289 ERR [GUM] Node 2: failed to update epoch after one of the nodes went down. error 995

    00000bdc.00001b24::2023/01/14-23:26:37.289 ERR Changing Epoch failed (status = 1359)

    Then, it sopped the cluster service

    00000bdc.00001b24::2023/01/14-23:26:37.289 INFO [NETFT] Cluster Service preterminate succeeded.

    I call this the cluster's self-preservation mechanism in action: "I can't make a decision!!! I'll just shutdown so I don't have to make a decision!!!"

    Here's a reference documentation if you want to understand the internals of the paxos tag. Ignore the version of Windows Server. It's the same algorithm for newer versions.

    >Today I've manually shut down the secondary server and left it off for 10 minutes. The cluster behaved as expected, the primary continued with no interruptions.

    Remember the false sense of security of having 2 out of 3 members, thinking you have quorum? This has the same risk the moment you lose the FSW.

    Action Item: IGNORE

    You don't have high availability anyway. Your AG is in async so there's no way for automatic failover to ever happen. Improve patching process and set this as expected behavior. Just make sure you are still meeting your recovery objectives (RPO/RTO).

    You might be tempted to dig deeper as there could be something going on with your NTP, network, VMWare hosts, VM guests, AD, etc. But since high availability isn't a priority based on your setup, it's not worth the time investment to investigate further.