Hyper-V cluster nightmare: 252, 5120 events and more

vafran 121 Reputation points
2021-02-13T20:35:55.177+00:00

Hi all,

We are having a nightmare issue, occurring more or less once per month.

Suddenly one or more nodes of our Windows 2016 Hyper-V cluster stop responding completely, so we must reboot them. Last time 4 out of 8 nodes went down, first two and then another two.

All I can see are 252 events for the VMSwitches and 5120 for lost CSVs.

252 events are crazy, or maybe I do not understand them, but the amount of memory is very high, for example the one below:

Memory allocated for packets in a vRss queue (on CPU 0) on switch 8893CCCF-B197-4A55-A3D6-7350D9D44731 (Friendly Name: LAN) due to low resource on the physical NIC has increased to 66049MB. Packets will be dropped once queue size reaches 512MB.

The cluster is 8 node connected to storage on an fibre channel SAN. We opened cases with MS, the Stroage vendor, the switches, but never found anything.
i cannot figure out if the issue are CSVs being lost, and the CSV traffic going over ethernet on the failed nodes is collpasing the network cards, or maybe the issue is with the network cards having a bottleneck.

Network is like below:

1Gbps NICS
Teaming with 3 NICs for management, vmswitch and cluster traffic.
Teaming with 2 NICs for a backup network and live migration (DPM uses this one)
1 NIC for HB network, second in line for live migration in priority.

I do NOT have VMQ enabled, but I do have RSS enabled on the hosts.

We also notices that between the nodes, pings are lost more often than it should, both between nodes and between VMs in different nodes.

Get-NetAdapterRss

Name : vEthernet (BCK)
InterfaceDescription : Hyper-V Virtual Ethernet Adapter #2
Enabled : True
NumberOfReceiveQueues :
Profile :
BaseProcessor: [Group:Number] : :
MaxProcessor: [Group:Number] : :
MaxProcessors :
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : vEthernet (LAN)
InterfaceDescription : Hyper-V Virtual Ethernet Adapter
Enabled : True
NumberOfReceiveQueues :
Profile :
BaseProcessor: [Group:Number] : :
MaxProcessor: [Group:Number] : :
MaxProcessors :
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : LAN 2
InterfaceDescription : Intel(R) Gigabit 4P I350-t rNDC #4
Enabled : True
NumberOfReceiveQueues : 2
Profile : Closest
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 8
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : HB
InterfaceDescription : Intel(R) Gigabit 4P I350-t rNDC #2
Enabled : True
NumberOfReceiveQueues : 2
Profile : Closest
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 8
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : BCK1
InterfaceDescription : Intel(R) Gigabit 4P I350-t rNDC
Enabled : True
NumberOfReceiveQueues : 2
Profile : Closest
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 8
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : LAN 1
InterfaceDescription : Intel(R) Gigabit 4P I350-t rNDC #3
Enabled : True
NumberOfReceiveQueues : 2
Profile : Closest
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 8
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : LAN
InterfaceDescription : Microsoft Network Adapter Multiplexor
Enabled : True
NumberOfReceiveQueues :
Profile :
BaseProcessor: [Group:Number] : :
MaxProcessor: [Group:Number] : :
MaxProcessors :
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : SLOT 3 Puerto 4
InterfaceDescription : Broadcom NetXtreme Gigabit Ethernet #4
Enabled : True
NumberOfReceiveQueues : 1
Profile :
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 16
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : SLOT 3 Puerto 3
InterfaceDescription : Broadcom NetXtreme Gigabit Ethernet
Enabled : True
NumberOfReceiveQueues : 1
Profile :
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 16
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : BCK2
InterfaceDescription : Broadcom NetXtreme Gigabit Ethernet #3
Enabled : True
NumberOfReceiveQueues : 1
Profile :
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 16
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Name : LAN3
InterfaceDescription : Broadcom NetXtreme Gigabit Ethernet #2
Enabled : True
NumberOfReceiveQueues : 1
Profile :
BaseProcessor: [Group:Number] : :0
MaxProcessor: [Group:Number] : :
MaxProcessors : 16
RssProcessorArray: [Group:Number/NUMA Distance] :
IndirectionTable: [Group:Number] :

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,597 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
969 questions
0 comments No comments
{count} votes

Accepted answer
  1. vafran 121 Reputation points
    2021-04-18T15:58:04.883+00:00

    I finally found the culprit, which was the Exchange CAS servers on NLB being set to Multicast IGMP, but the switches wer enot enabled for IGMP. As soon as IGMP was enabled on the switches everything started to go smoothtly, actually performance improved a lot in general.

    0 comments No comments

2 additional answers

Sort by: Most helpful
  1. vafran 121 Reputation points
    2021-02-13T21:03:19.663+00:00

    Actually it cannot be, because the node where it failed yesterday, the cluster network is not on the same teaming as the virtual switch.

    Unfortunately not all nodes are the same, so the newer ones have more NICs, and we separated the vmswitch in a dedicated teaming with no cluster traffic, and we are still having this issue.

    Anyone can make sense to this? Do I make sense in my attempt to explain the issue?

    Thank you.

    0 comments No comments

  2. Xiaowei He 9,876 Reputation points
    2021-02-17T03:09:55.167+00:00

    Hi,

    According to your description, it seems the issue is related to the cluster network, I would suggest you try to disable RSS and then restart the Cluster. Check if the issue still exists without RSS.

    Besides, what is your Network Card on the nodes, we may check if there's any known issue with the Network card. Please also try to update the network card driver, and install updates on the nodes.

    Thanks for your time!
    Best Regards,
    Anne

    -----------------------------

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.