Based on your description, firstly, I need to clarify two things:
- If possible, do not add servers whose versions are different from those of your original ones. That is to say, install server 2016 instead of 2019. Different server versions may cause problems when you put them into the same cluster.
2) If your storage configuration has been the same one since you initially built up your cluster, I think your issue is not related with Refs/Ntfs modes.
Since your issue cooured when you set these two VLANs to traverse the aggregated connection, I think your issue may be caused by network configuration, such as NIC settings. Even if you can ping among these servers, you can still encounter NIC issues. Do you have any other warning reports of your validation? If so, do you have any NIC-related error/critical or warning reports within your validation?
Besides, you can go to system event to check if there are any relevant logs during your vlan setting and cluster validation: %SystemRoot%\System32\winevt.
Thanks for your support!
If the Answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.
Well, even after troubleshooting network switches and making recommended changes the problem still remains. In fact we had a serious outage today morning because this this issue.
The networking guys found that LACP period was specified as short. That was causing LACP timeouts. They suggested we remove that from the LAG. We did and things seemed to be stable (upon adding the two VLANs to the LAG). However, overnight, FCM started losing it connection to CSVs.....just like it was doing earlier.
For now, I have reverted back the network (removed VLANs from the LAG).........and things seem back to normal.
This is the exact error I am seeing:
Event ID 5120
Cluster Shared Volume 'Volume4' ('CSV4 SSD deduped') has entered a paused state because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.
(same error for all other CSVs we have)
I do not know what to do. I could call Microsoft tech support, but I am afraid they will say it is a networking issue....because this happens when I add the VLAN to the LAG.
Where do I even start?
Here is another error I see in conjunction with that one..
Event ID 5142
Cluster Shared Volume 'Volume4' ('CSV4 SSD 100k deduped') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.