Windows 2016 Failover Cluster Node Down

robaw1700 1 Reputation point
2023-05-22T12:10:31.4366667+00:00

Hyper-V Cluster Windows Server 2016

Nodes: 2

 

Doing a network firmware update on the core networks switches at location with a Cisco UCS chassis with 2 blades running Windows 2016 with Hyper-V cluster and two nodes. Prior to the network firmware update all roles were moved to Node 01 and then Node 02 was paused and drain roles for maintenance. There were no issues with the firmware update on the staked switches and after the firmware completes the cluster shows offline and right clicking on the Cluster to make online brings the Cluster Online, but Node 2 is down and there is not an option to resume the node via Failover Cluster Manager. Cluster Event Logs show Event ID 5120,  Event ID 1135, Event ID 1177.

 

The Live Migration NIC’s show as Unavaiable on the Node that is not active in the Cluster. Both nodes can ping each other on the Live Migration VLAN and IP’s as well as the Management IP’s. Validating Cluster shows issue with port 3343 on the Live Migration but from each telnet to the other on the port 3343 works. Connectivity between the two nodes seems to be back, both nodes can ping each other on all addresses.

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,530 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
956 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Ian Xue (Shanghai Wicresoft Co., Ltd.) 29,486 Reputation points Microsoft Vendor
    2023-05-23T09:46:13.13+00:00

    Hi,

    Please first install the latest NIC driver and check if any heavy task running when the issue happened.

    If anti-virus software is installed, exclude below items from scanning:

    C:\Windows\Cluster

    C:\ClusterStorage

    Clussvc.exe If the issue still occurs, you can increase the heartbeat tolerance. Open Windows PowerShell as administrator and run below commands.

    (get-cluster). SameSubnetDelay=2000 
    (get-cluster). SameSubnetThreshold=10
    (get-cluster). CrossSubnetDelay=4000
    (get-cluster). CrossSubnetThreshold=20   
    

    Best Regards,

    Ian Xue

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  2. Limitless Technology 43,926 Reputation points
    2023-05-23T11:47:38.98+00:00

    Hello there,

    Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be accompanied by the following symptoms:

    Cluster Failover\nodes being removed from active Failover Cluster membership: Having a problem with nodes being removed from active Failover Cluster membership

    Event ID 1069 Event ID 1069 — Clustered Service or Application Availability

    Event ID 1177 for Quorum loss Event ID 1177 — Quorum and Connectivity Needed for Quorum

    Event ID 1006 for Cluster service halted: Event ID 1006 — Cluster Service Startup

    This article helps you diagnose and resolve Event ID 1135, which may be logged during the startup of the Cluster service in Failover Clustering environment.

    https://learn.microsoft.com/en-us/windows-server/troubleshoot/troubleshooting-cluster-event-id-1135

    Hope this resolves your Query !!

    --If the reply is helpful, please Upvote and Accept it as an answer--

    0 comments No comments

  3. robaw1700 1 Reputation point
    2023-05-23T12:14:59.5633333+00:00

    The issue occurred when a firmware update was applied to the core switch the UCS chassis is connected to, causing a network disruption for the Cluster. Reviewing the Cluster Log Node 1 and Node 2 are attempting to communicate but there is a stale route on Node 1 due to the network update and Node 1 closes the connection. Running the Validate Cluster running the Network Test shows the Live Migration network interfaces as unreachable on each node, but telnet and ping show the connectivity is ok.

    Is there a way to clear the Stale Route without shutting down the cluster and rebooting Node 1 and Node 2?

    00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Got new TCP connection. Exchanging version data.

    00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Checking version compatibility for node node02 id 2 with following versions: highest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008], lowest [Major 9 Minor 1 Upgrade 8 ClusterVersion 0x00090008].

    00001208.00004338::2023/05/23-11:50:46.669 INFO [VER] Version check passed: node and cluster highest supported versions match. Other node only supports highest level, so joining in uplevel mode.

    00001208.00004338::2023/05/23-11:50:46.670 INFO [SV] Negotiating message security level.

    00001208.00004338::2023/05/23-11:50:46.670 INFO [SV] Already protecting connection with message security level 'Sign'.

    00001208.00004338::2023/05/23-11:50:46.670 INFO [FTI] Got new raw TCP/IP connection.

    00001208.00004338::2023/05/23-11:50:46.670 INFO [FTI][Initiator] This node (1) is initiator

    00001208.00004338::2023/05/23-11:50:46.670 WARN [FTI][Initiator] Ignoring duplicate connection: stale route not yet cleaned up

    00001208.00004338::2023/05/23-11:50:46.670 INFO [CHANNEL 0.100.10.5:~60724~] graceful close, status (of previous failure, may not indicate problem) (0)

    00001208.00004338::2023/05/23-11:50:46.670 INFO [CORE] Node 1: Clearing cookie efe55ffe-9b1e-4783-a345-7972a322fb77

    00001208.00004338::2023/05/23-11:50:46.670 INFO [CORE] Node 1: Cookie Cache f0204db4-91b5-4421-abf4-7e43c17e79ab [Node02]

    00001208.00004338::2023/05/23-11:50:46.670 WARN mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 10.100.10.5:~60724~ is closed'

    0 comments No comments