Hyper-V Cluster Issues - Event ID 252 & 10400

Monarch 66 Reputation points
2020-09-25T15:16:09.997+00:00

I have a 4 node Hyper-V Cluster running on Windows Server 2016 Datacenter. It is connected to a Tegile/Tintri T4700 storage array via Fibre Channel utilizing Cluster Shared Volumes (CSVs).

NIC resetting has been occurring as far back as I can trace and it would generate one of the following warnings in the System Event Log. The NIC resetting has occurred during general operations and during Hyper-V Live Migrations of VMs. The last one occurred on 5/17/2020 at 8:57am during Live Migration of VMs to the node that NIC reset occurred on.

The network interface "HPE Ethernet 10Gb 2-port 560FLR-SFP+ Adapter #2" has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets.
Reason: The network driver did not respond to an OID request in a timely fashion.

The network interface "HPE Ethernet 10Gb 2-port 562FLR-SFP+ Adapter" has begun resetting. There will be a momentary disruption in network connectivity while the hardware resets.
Reason: The network driver detected that its hardware has stopped responding to commands.

Starting in August of 2019, when the NIC resets occurred it would cause issues with the node of the cluster it occurred on detailed below.

In my research I found this TechNet forum posting talking about this Nic resetting issue. Specifically this section about 1/3 of the way down. https://social.technet.microsoft.com/Forums/en-US/7b95bc5b-02d1-4dbb-a341-0517ae30cd9e/vms-will-get-stuck-stopping-and-unable-to-migrate-servers-from-that-host?forum=winserverhyperv “I had a ticket lodged with Microsoft support. While they didn't fix the issue, I ended up finding the root cause. One of the SFP+ Adapters was generating a 10400 NDIS event stating that the driver detected that the hardware wasn't responding to instructions, so Windows would then reset the adapter. The Adapter was part of a NIC team which was then used for a vswitch in Hyper-V. For some reason when the adapter gets reset, it generates an error with the vswitch which then seems to completely break the VMMS service.
Microsoft has offered no explanation as to why this happens. The point of NIC teaming is so that if one adapter drops, everything can keep working. We ended up updating drivers, and I logged a call with the OEM to get firmware and other updates done. All we can do now is cross our fingers that it doesn't error again.”

This is what is happening. The VMMS service is so broken that I have to shut down every VM on the node with the issue. I then try to restart the node but it gets stuck trying to shutdown and I have to force a power off and when it resets the VMs move to a different node and start back up. Not good.

I have also received Event ID 252 in the System Event Log regarding “Memory allocated for packets in a vRss queue (on CPU 28) on switch C0978781-75EF-47B4-B9BC-6463064735A0 (Friendly Name: Team_Trunked) due to low resource on the physical NIC has increased to 256MB. Packets will be dropped once queue size reaches 512MB." which has occurred before some NIC resets have occurred.

I continued to have 252 events and 10400 nic resets particularly during live migrations after switching to a converged networking model. I decided to move the live migration traffic to a separate team of nics in an attempt to avoid live migrations causing Hyper-V to go into an unusable state. Nics resets had stopped during live migration since I made the change in May. My HPE engineer also recommended setting the "Maximum Number of RSS Queues" to a higher number to help aleviate the 252 events.

From 5/18/20 - 8/23/20 I had zero issues and thought I finally was in the clear. Wrong!

On 8/24/at 2:55 PM on Node 4, one of the 10Gb Nics of the team for the Hyper-V-VmSwitch reset (10400 event), no issues occurred with Hyper-V or the Cluster because it was only 1 nic of the team. One thing to note was that this was the 1st day of classes for the fall semester on our campus.

As you can see by the list of events below, I continued to have some 252 and 10400 events but they did not break the Hyper-V Virtual Machine Management service until 9/23/20. On this day 2 nodes of the cluster, Virtualsrv3 and Virtualsrv4 experienced Nic resets on both 10Gb Nics of the Nic team used by the Hyper-V-VmSwitch.

I have had support cases open with Microsoft and HPE but no one has been able to find the answer to why this continues to happen. Microsoft said to increase the "Receive Buffers" Nic setting from 512 to 2048 but that did not help either.

I also had a TechNet forum post going for some time on this as well:
https://social.technet.microsoft.com/Forums/en-US/ad05bf98-2a2f-423f-83a6-284b5fd1265e/cluster-node-event-252-cluster-service-crashed?forum=winserverhyperv

If anyone has had this issue and found an answer to it please let me know. Thank you.

8/28/2020
Virtualsrv4 – 252 (256MB) – 4:52:04 PM CDT
Virtualsrv1 – 252 (256MB) – 4:52:04 PM
Virtualsrv3 – 252 (256MB) – 4:52:05 PM
Virtualsrv1 – 252 (512MB) – 4:52:07 PM
Virtualsrv4 – 252 (512MB) – 4:52:08 PM
Virtualsrv4 – 10400 - 4:52:13 PM (Team_Trunked)
Virtualsrv3 – 10400 - 4:52:14 PM (Team_Trunked)
Virtualsrv1 – 10400 - 4:52:17 PM (Team-LM)
One 1 Nic reset per team so no cluster issues came of it
There were no Live Migrations going on at the time of these events
Veeam backups were occurring at this time

8/31/2020
Virtualsrv1 – 252 (256MB) – 2:51:10 PM
Virtualsrv2 – 252 (256MB) – 2:51:10 PM
No Veeam backups were occurring at this time

9/9/2020
Virtualsrv1 – 252 (256MB) – 7:13:54 PM
Virtualsrv2 – 252 (256MB) – 7:13:54 PM
Virtualsrv3 – 252 (256MB) – 7:13:55 PM
Virtualsrv4 – 252 (256MB) – 7:13:54 PM
No Veeam backups were occurring at this time

9/16/2020
Virtualsrv1 – 252 (256MB) – 4:55:05 PM
Virtualsrv2 – 252 (256MB) – 4:55:05 PM
Virtualsrv3 – 252 (256MB) – 4:55:05 PM
Veeam backups started at 4:30pm

9/23/2020
Virtualsrv1 – 252 (256MB) – 7:03:31 PM (CPU 12)
Virtualsrv1 – 252 (512MB) – 7:03:34 PM (CPU 12)
Virtualsrv1 – 252 (256MB) – 7:04:23 PM (CPU 26)
Virtualsrv1 – 252 (512MB) – 7:04:26 PM (CPU 26)
Virtualsrv1 – 10400 - 7:04:33 PM (Team-LM)
Virtualsrv2 – 252 (256MB) – 7:03:31 PM (CPU 86)
Virtualsrv2 – 252 (512MB) – 7:03:33 PM (CPU 86)
Virtualsrv2 – 252 (256MB) – 7:04:23 PM (CPU 80)
Virtualsrv2 – 252 (512MB) – 7:04:26 PM (CPU 80)
Virtualsrv3 – 252 (256MB) – 7:03:39 PM (CPU 2)
Virtualsrv3 – 10400 - 7:03:40 PM (Team-Trunked) different nics
Virtualsrv3 – 10400 - 7:03:41 PM (Team-Trunked) different nics
Virtualsrv3 – 252 (512MB) – 7:03:52 PM (CPU 2)
Virtualsrv3 – 252 (256MB) – 7:04:24 PM (CPU 54)
Virtualsrv3 – 252 (512MB) – 7:04:27 PM (CPU 54)
Virtualsrv3 – 252 (256MB) – 7:04:33 PM (CPU 12)
Virtualsrv3 – 252 (512MB) – 7:04:36 PM (CPU 12)
Virtualsrv3 – 10400 - 7:04:41 PM (Team-Trunked) same nic
Virtualsrv3 – 10400 - 7:05:05 PM (Team-Trunked) same nic
Virtualsrv4 – 252 (256MB) – 7:03:31 PM (CPU 126)
Virtualsrv4 – 252 (512MB) – 7:03:34 PM (CPU 126)
Virtualsrv4 – 10400 - 7:03:37 PM (Team-Trunked) different nics
Virtualsrv4 – 10400 - 7:03:39 PM (Team-Trunked) different nics
Virtualsrv4 – 252 (256MB) – 7:04:24 PM (CPU 126)
Virtualsrv4 – 252 (512MB) – 7:04:27 PM (CPU 126)
Veeam backups had just started at 7:00pm

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,538 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
958 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Paul B 5 Reputation points
    2023-03-09T15:16:30.11+00:00

    We had this problem too. X710 cards in Poweredge 650s on Windows 2019 connected to Cisco Nexus 9k.

    Our fix was to disable VMQ on each 10G card.

    Dell's advice also included advice to disable Proset which we didn't have. The full advice from Dell is below:
    To implement the workaround, run the following commands on each cluster node (as admin):

    1. "C:\Program Files\Intel\Umb\Winx64\PROSETDX\DxSetup.exe" DMIX=0 /qn
    2. Disable-NetAdapterQos -Name <adapter name>
    3. Disable-NetAdapterVmq -Name <adapter name>

    The first command disables Intel DMIX, also known as Intel PROSet. The network driver continues to function; only the PROSet feature is disabled. Due to its parameters, this command will return an error if run from a PowerShell prompt but will run correctly from a command prompt.

    The second and third commands should be run multiple times, with the -Name parameter referencing each management adapter in turn.

    1 person found this answer helpful.