Storage Spaces Direct Cluster (Validation fails on port 3343 over the Mellanox NICs)

Question

Storage Spaces Direct Cluster (Validation fails on port 3343 over the Mellanox NICs)

Glen Harrison 6

Hi Everyone,

I am building a 4 node storage spaces direct cluster running Server 2022.

Each node has two (dual port) NICS. Intel 10gb and Mellanox 100gb.

When running the cluster validation test, is it normal to see errors on the mellanox NICs for port 3343?

My config is:

intel nic0 and nic1 attached to a SET vSwitch (vnics for management, cluster, livemigration)

mellanox nic0 and nic1 for storage

The report is all green except for this one error. It's not firewall, as i've checked the ports are allowed.

Thanks!!

Ben Thomas 6 Reputation points MVP

2022-01-22T19:48:15.347+00:00

Hey,

What model are the Intel NICs?
Can you provide a screenshot of the validation error?

The Mellanox NICs are not configured in any SET Team Correct? Have you made sure they're tagged for the right VLANs? And that Network QoS is configured for RoCE properly on both the host NICs and the Switch interfaces they're connected to?

Are there any issues with pinging the various Mellanox interfaces from each host?
Glen Harrison 6 Reputation points

2022-01-23T13:45:23.4+00:00

The intel nics are X710

I'll try and get a screenshot when I'm next in the office, but it just listed all the mellanox nics between each other saying port 3343 error
There is no vSwitch on the mellanox, only the intels
Ping works fine between nodes, and I can ping the IPs assigned to the mellanox

Weird thing, I've since created the cluster which worked great. Now I have the cluster built, I've since ran the cluster tests which came back error free?

It's only the test cluster report before the clusters created which fails

The switch ports for the mellanox have vlan tagging, but not the vlanid on the adapters in windows. They are set to vlan 0, if I change that to match the vlan tag of the switch port it breaks comms

I've tested livemigration over the mellanox which works fine, which makes me think the earlier error was a red herring....
Ben Thomas 6 Reputation points MVP

2022-01-30T22:05:45.047+00:00

Sorry I missed this reply last week!

So with the Intel x710 NICs, I've seen this issue a few times. Are you using untagged VLANs on the Intel NICs for management traffic? If you change to tagged VLANs for management as well, the error generally goes away.

As for the Mellanox NICs, in order for RoCE to work correctly, you need to configure VLANs, PFC, ETS and DCB correctly. If you are unable to set the NICs on the host to match the VLAN on the switch without it breaking, then the switches aren't configured correctly. The Switch needs to be configured as a trunk port with the VLANs tagged for storage, and then the NICs on the host need to be tagged (VLANID set) to the matching VLAN for storage.

Live migration won't fail if RoCE is misconfigured, but it will cause increased CPU usage. It will also impact the latency and storage performance of S2D, as well as cause potential timeout issues that impact reliability during patching operations.
Glen Harrison 6 Reputation points

2022-01-31T09:18:17.72+00:00

Thanks again for the reply. I went ahead and built the cluster, since then not a single error or warning in the reports.

The test-rdma scripts and various PS commands querying flowcontrol and the like all come back good.

My only slight concern is the vlan id on the storage nics. The switch ports are in a trunk with the vlan tagged on, but if I match the id in windows comms fails. If I leave them alone, everything works and the nics can only communicate with each other, not the rest of the network so it looks good. I've noticed that our other HV cluster with an iscsi san is configured the same way.

I'm unsure why those port errors were coming up in the reports before, but the servers had only just been built so maybe some windows updates fixed it.

2 answers

Your answer

Ben Thomas 6 Reputation points MVP

2022-01-22T19:48:15.347+00:00

Hey,

What model are the Intel NICs?
Can you provide a screenshot of the validation error?

The Mellanox NICs are not configured in any SET Team Correct? Have you made sure they're tagged for the right VLANs? And that Network QoS is configured for RoCE properly on both the host NICs and the Switch interfaces they're connected to?

Are there any issues with pinging the various Mellanox interfaces from each host?
Glen Harrison 6 Reputation points

2022-01-23T13:45:23.4+00:00

The intel nics are X710

I'll try and get a screenshot when I'm next in the office, but it just listed all the mellanox nics between each other saying port 3343 error
There is no vSwitch on the mellanox, only the intels
Ping works fine between nodes, and I can ping the IPs assigned to the mellanox

Weird thing, I've since created the cluster which worked great. Now I have the cluster built, I've since ran the cluster tests which came back error free?

It's only the test cluster report before the clusters created which fails

The switch ports for the mellanox have vlan tagging, but not the vlanid on the adapters in windows. They are set to vlan 0, if I change that to match the vlan tag of the switch port it breaks comms

I've tested livemigration over the mellanox which works fine, which makes me think the earlier error was a red herring....
Ben Thomas 6 Reputation points MVP

2022-01-30T22:05:45.047+00:00

Sorry I missed this reply last week!

So with the Intel x710 NICs, I've seen this issue a few times. Are you using untagged VLANs on the Intel NICs for management traffic? If you change to tagged VLANs for management as well, the error generally goes away.

As for the Mellanox NICs, in order for RoCE to work correctly, you need to configure VLANs, PFC, ETS and DCB correctly. If you are unable to set the NICs on the host to match the VLAN on the switch without it breaking, then the switches aren't configured correctly. The Switch needs to be configured as a trunk port with the VLANs tagged for storage, and then the NICs on the host need to be tagged (VLANID set) to the matching VLAN for storage.

Live migration won't fail if RoCE is misconfigured, but it will cause increased CPU usage. It will also impact the latency and storage performance of S2D, as well as cause potential timeout issues that impact reliability during patching operations.
Glen Harrison 6 Reputation points

2022-01-31T09:18:17.72+00:00

Thanks again for the reply. I went ahead and built the cluster, since then not a single error or warning in the reports.

The test-rdma scripts and various PS commands querying flowcontrol and the like all come back good.

My only slight concern is the vlan id on the storage nics. The switch ports are in a trunk with the vlan tagged on, but if I match the id in windows comms fails. If I leave them alone, everything works and the nics can only communicate with each other, not the rest of the network so it looks good. I've noticed that our other HV cluster with an iscsi san is configured the same way.

I'm unsure why those port errors were coming up in the reports before, but the servers had only just been built so maybe some windows updates fixed it.

Answer 1

Hi there,

Some points to note here.

Patch the server with all Windows OS Updates and restart it.
Try disabling the Antivirus on both the servers and give it a try.

Here is a thread as well that discusses the same issue and you can try out some troubleshooting steps from this and see if that helps you to sort the Issue.

Cluster Network Validation - fail UDP port 3343
https://learn.microsoft.com/en-us/answers/questions/249241/cluster-network-validation-fail-udp-port-3343.html

S2D Cluster Validation Fails Firewall and UDP Port 3343
https://social.technet.microsoft.com/Forums/office/en-US/c3e15170-2a83-48a8-b671-efc2a9afe4cf/s2d-cluster-validation-fails-firewall-and-udp-port-3343?forum=winserverfiles

--------------------------------------------------------------------------------------------------

--If the reply is helpful, please Upvote and Accept it as an answer--

Answer 2

@Glen Harrison any update here? I ran into same issue with a new deployed Server 2022 cluster (4xDell AX740xd, SMB-Traffic via QLogic QL41262 over Cisco N9K-C93180YC-EX-switches).

When we started patching last week (due to april-patchday) it took 7 minutes after first node had rebooted until it rejoined the cluster. In Failover cluster manager it throwes error

Cluster node 'HYPERVISOR01' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls.

followed by

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. . Votes required to start cluster: 2 Votes available: 0 Nodes with votes: HYPERVISOR02 HYPERVISOR03 HYPERVISOR04 Guidance: Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. The cluster will be able to start and the nodes will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the 'Start-ClusterNode -FQ' Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node's copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.

In my host-based-firewall-logs, i noticed:

DROP TCP 10.100.0.10 192.168.100.10 51199 3343 0 - 0 0 0 - - - SEND 13592

which is weired because 10.100.0.10 is my management-ip and 192.168.100.10 is SMB-A-network. Why does the management-network try to communicate via SMB-A-network which is unrouted?!

Cheers
Miranda

Share via

Storage Spaces Direct Cluster (Validation fails on port 3343 over the Mellanox NICs)

2 answers

Your answer