Windows 2016 Failover Cluster Node Down

Question

Hyper-V Cluster Windows Server 2016

Nodes: 2

Doing a network firmware update on the core networks switches at location with a Cisco UCS chassis with 2 blades running Windows 2016 with Hyper-V cluster and two nodes. Prior to the network firmware update all roles were moved to Node 01 and then Node 02 was paused and drain roles for maintenance. There were no issues with the firmware update on the staked switches and after the firmware completes the cluster shows offline and right clicking on the Cluster to make online brings the Cluster Online, but Node 2 is down and there is not an option to resume the node via Failover Cluster Manager. Cluster Event Logs show Event ID 5120, Event ID 1135, Event ID 1177.

The Live Migration NIC’s show as Unavaiable on the Node that is not active in the Cluster. Both nodes can ping each other on the Live Migration VLAN and IP’s as well as the Management IP’s. Validating Cluster shows issue with port 3343 on the Live Migration but from each telnet to the other on the port 3343 works. Connectivity between the two nodes seems to be back, both nodes can ping each other on all addresses.

Answer

Hi，

Please first install the latest NIC driver and check if any heavy task running when the issue happened.

If anti-virus software is installed, exclude below items from scanning:

C:\Windows\Cluster

C:\ClusterStorage

Clussvc.exe If the issue still occurs, you can increase the heartbeat tolerance. Open Windows PowerShell as administrator and run below commands.

(get-cluster). SameSubnetDelay=2000 
(get-cluster). SameSubnetThreshold=10
(get-cluster). CrossSubnetDelay=4000
(get-cluster). CrossSubnetThreshold=20

Best Regards,

Ian Xue

If the Answer is helpful, please click "Accept Answer" and upvote it.

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

Hello there,

Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be accompanied by the following symptoms:

Cluster Failover odes being removed from active Failover Cluster membership: Having a problem with nodes being removed from active Failover Cluster membership

Event ID 1069 Event ID 1069 — Clustered Service or Application Availability

Event ID 1177 for Quorum loss Event ID 1177 — Quorum and Connectivity Needed for Quorum

Event ID 1006 for Cluster service halted: Event ID 1006 — Cluster Service Startup

This article helps you diagnose and resolve Event ID 1135, which may be logged during the startup of the Cluster service in Failover Clustering environment.

https://learn.microsoft.com/en-us/windows-server/troubleshoot/troubleshooting-cluster-event-id-1135

Hope this resolves your Query !!

--If the reply is helpful, please Upvote and Accept it as an answer--

Answer

The issue occurred when a firmware update was applied to the core switch the UCS chassis is connected to, causing a network disruption for the Cluster. Reviewing the Cluster Log Node 1 and Node 2 are attempting to communicate but there is a stale route on Node 1 due to the network update and Node 1 closes the connection. Running the Validate Cluster running the Network Test shows the Live Migration network interfaces as unreachable on each node, but telnet and ping show the connectivity is ok.

Is there a way to clear the Stale Route without shutting down the cluster and rebooting Node 1 and Node 2?

00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Got new TCP connection. Exchanging version data.

00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Checking version compatibility for node node02 id 2 with following versions: highest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008], lowest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008].

00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Version check passed: node and cluster highest supported versions match. Other node only supports highest level, so joining in uplevel mode.

00001208.00004338::2023/05/23-11:50:46.670 INFO [SV] Negotiating message security level.

00001208.00004338::2023/05/23-11:50:46.670 INFO [SV] Already protecting connection with message security level 'Sign'.

00001208.00004338::2023/05/23-11:50:46.670 INFO [FTI] Got new raw TCP/IP connection.

00001208.00004338::2023/05/23-11:50:46.670 INFO [FTI][Initiator] This node (1) is initiator

00001208.00004338::2023/05/23-11:50:46.670 WARN [FTI][Initiator] Ignoring duplicate connection: stale route not yet cleaned up

00001208.00004338::2023/05/23-11:50:46.670 INFO [CHANNEL 0.100.10.5:~60724~] graceful close, status (of previous failure, may not indicate problem) (0)

00001208.00004338::2023/05/23-11:50:46.670 INFO [CORE] Node 1: Clearing cookie efe55ffe-9b1e-4783-a345-7972a322fb77

00001208.00004338::2023/05/23-11:50:46.670 INFO [CORE] Node 1: Cookie Cache f0204db4-91b5-4421-abf4-7e43c17e79ab [Node02]

00001208.00004338::2023/05/23-11:50:46.670 WARN mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.100.10.5:~60724~ is closed'

Windows 2016 Failover Cluster Node Down

3 answers