Some points to note here.
- Patch the server with all Windows OS Updates and restart it.
- Try disabling the Antivirus on both the servers and give it a try.
Here is a thread as well that discusses the same issue and you can try out some troubleshooting steps from this and see if that helps you to sort the Issue.
Cluster Network Validation - fail UDP port 3343
S2D Cluster Validation Fails Firewall and UDP Port 3343
--If the reply is helpful, please Upvote and Accept it as an answer--
The intel nics are X710
I'll try and get a screenshot when I'm next in the office, but it just listed all the mellanox nics between each other saying port 3343 error
There is no vSwitch on the mellanox, only the intels
Ping works fine between nodes, and I can ping the IPs assigned to the mellanox
Weird thing, I've since created the cluster which worked great. Now I have the cluster built, I've since ran the cluster tests which came back error free?
It's only the test cluster report before the clusters created which fails
The switch ports for the mellanox have vlan tagging, but not the vlanid on the adapters in windows. They are set to vlan 0, if I change that to match the vlan tag of the switch port it breaks comms
I've tested livemigration over the mellanox which works fine, which makes me think the earlier error was a red herring....
Sorry I missed this reply last week!
So with the Intel x710 NICs, I've seen this issue a few times. Are you using untagged VLANs on the Intel NICs for management traffic? If you change to tagged VLANs for management as well, the error generally goes away.
As for the Mellanox NICs, in order for RoCE to work correctly, you need to configure VLANs, PFC, ETS and DCB correctly. If you are unable to set the NICs on the host to match the VLAN on the switch without it breaking, then the switches aren't configured correctly. The Switch needs to be configured as a trunk port with the VLANs tagged for storage, and then the NICs on the host need to be tagged (VLANID set) to the matching VLAN for storage.
Live migration won't fail if RoCE is misconfigured, but it will cause increased CPU usage. It will also impact the latency and storage performance of S2D, as well as cause potential timeout issues that impact reliability during patching operations.
Thanks again for the reply. I went ahead and built the cluster, since then not a single error or warning in the reports.
The test-rdma scripts and various PS commands querying flowcontrol and the like all come back good.
My only slight concern is the vlan id on the storage nics. The switch ports are in a trunk with the vlan tagged on, but if I match the id in windows comms fails. If I leave them alone, everything works and the nics can only communicate with each other, not the rest of the network so it looks good. I've noticed that our other HV cluster with an iscsi san is configured the same way.
I'm unsure why those port errors were coming up in the reports before, but the servers had only just been built so maybe some windows updates fixed it.
Sign in to comment