Failover cluster: Connecting VLANs across switches causes FC CSV to fail. Why?

Question

I have a rather strange problem.

Situation:
I have a 5 node Server 2016 Cluster for Hyper-V. Consists of 5 old standalone HP servers. Storage is Fiber connected 3PAR array. Networking is provided by 10Gb HP TOR switches. We recently bought a bunch of new Dell blade servers to replace these aging servers. I have installed 2019 on the new servers. The plan is to add them to the existing cluster. Dell servers have their own 25GB ethernet Dell blade switches. Have FC switches also.
I have VLANs for cluster comms and live migration on both HP and Dell switches. Right now these 2 VLANs are not set to traverse the 40+40 Gb Bridge-Aggregation/Port-Channel connection between the two switches.

Problem:
The moment I set these two VLANs to traverse the aggregated connection...the cluster starts having problems. FCM complains of losing connectivity to CSVs. VMs go down.

Interesting points to note:

> All my CSVs are connected over Fiber, so it should have nothing to do with the network.

Although I have heard that having the CSVs formatted as ReFS can cause trouble, because ReFS formatted CSVs always run in redirected mode......I guess that means network is involved. The thing is, only one of my 4 CSVs is the old ReFS style (which I am getting rid of soon). All the other CSVs are NTFS formatted....but they still fail.

> I am NOT adding Dell servers as new nodes to the cluster.

At least not yet. I merely have to connect the network between the HP and Dell switches to make the CSVs fail. The CSVs fail even if I keep the new Dell servers powered down. That means something is wrong with the networking. Loop?

> Networking seems to work properly

When I connect the VLANs, I can ping perfectly fine between the HP and Dell servers, over the 2 VLANs. If I remove the config (to traverse) I can no longer ping...which is expected.

> I ran the cluster validation wizard while this was happening, but it did not find anything obvious.

I did have a fail.... "List cluster resources". The report basically says it was not able to get config of some VMs. Here is the exact error message:

An error occurred while executing the test.
The operation has failed. An error occurred while retrieving the private properties for the resource 'Virtual Machine XXXXXX'.
A cluster resource failed

I am guessing this is because the CSVs is not accessible by the cluster, so it cannot read the VM's configuration.

Any tips/ideas on why this is happening?

Thanks,
Raj

Answer

Hi,

Based on your description, firstly, I need to clarify two things:

If possible, do not add servers whose versions are different from those of your original ones. That is to say, install server 2016 instead of 2019. Different server versions may cause problems when you put them into the same cluster.

2) If your storage configuration has been the same one since you initially built up your cluster, I think your issue is not related with Refs/Ntfs modes.

Since your issue cooured when you set these two VLANs to traverse the aggregated connection, I think your issue may be caused by network configuration, such as NIC settings. Even if you can ping among these servers, you can still encounter NIC issues. Do you have any other warning reports of your validation? If so, do you have any NIC-related error/critical or warning reports within your validation?

Besides, you can go to system event to check if there are any relevant logs during your vlan setting and cluster validation: %SystemRoot%\System32\winevt.

Thanks for your support!

BR,
Joan

If the Answer is helpful, please click "Accept Answer" and upvote it.

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

Hi,

Thanks for your reply!

Event 5120 indicates that Cluster Shared Volumes (CSV) observed an error and attempted to recover. When CSV recovery does not succeed, an Event 5142 is logged to the System event log. There are many reasons can lead to event 5120, not just network configuration failure. I think you should firstly check if your nodes are working well, especially the one that your problematic volume4 is stored on. Here is an article:
https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/event-5120-5142-access-clusterstorage-folder

Besides, you should also check if you have any unused but active network-adapters.

Also, I found an article which discusses the possible cases when recovery does not succeed and event id 5120 and 5142 occur, you can check your condition based on the cases listed in this article:
https://argonsys.com/microsoft-cloud/library/troubleshooting-cluster-shared-volume-recovery-failure-system-event-5142/

(Please note: Information posted in the given link is hosted by a third party. Microsoft does not guarantee the accuracy and effectiveness of information.)

Actually, as you yourself said, contacting our Microsoft tech support may be the quickest way for you. And you don't have to worry about if our engineers will reject you because they think your case is related to network issue. When we scope your issue, we need to capture your packets and analyze them carefully to identify if your issue is related to network. We will also call you on the phone and operate your computer remotely in real time to troubleshoot your issue. Besides, if your issue turns out not to be related to network in the mid of troubleshooting, we will transfer your issue to relevant pods, or we will open a task to cooperate with the relevant pods, to solve your issue together. Above all, your issue will be carefully and efficiently treated with all of our engineers.

In addition, if the issue has been proved as system flaw, the consulting fee would be refund. You may find phone number for your region accordingly from the link below.

Global Customer Service phone numbers:

https://support.microsoft.com/en-us/help/13948/global-customer-service-phone-numbers

You can just call the phone number corresponding to your location.

Thanks for your support and understanding! And if you think my work is helpful, I would appreciate it if you could support my work by clicking Accept Answer.

BR,
Joan

If the Answer is helpful, please click "Accept Answer" and upvote it.

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Failover cluster: Connecting VLANs across switches causes FC CSV to fail. Why?

2 answers