Brand new cluster experiences catastrophic failures when primary MC-LAG peer (Dell VLT) reloads
This may or may not be considered outside the scope of this support forum, but I'm wondering if there is anyone here who has experience with a cluster attached to Dell switches configured in VLT...
We have a four-node Microsoft Failover Cluster, with each server equipped with a pair of NICs configured in a Switch Embedded Team (SET). Each NIC within the team is connected to one of the two peers in the VLT domain, with a single link per connection. The VLT domain connects to an “access” switch via a VLT port-channel with LACP, facilitating client access.
We have followed best practices and official documentation to ensure that SET and VLT are configured correctly. However, during fault/failure simulations, we consistently observe catastrophic outages affecting the cluster, but only when the tests are conducted against the primary VLT peer. These issues include nodes being dropped from the cluster, VMs failing, crashing, or entering a paused state, and Cluster Shared Volumes (CSVs) disconnecting.
For example, the following conditions will cause our cluster to enter a failed state and lose network connectivity for an unacceptable amount of time:
- Reloading the primary VLT peer by pulling the power or by issuing the reload command
- Administratively shutting down all server ports, VLT port-channel uplink and VLTi
The individual links to the servers fail over gracefully. Killing the VLTi on the secondary VLT peer also results in a graceful failover. Reloading the secondary VLT peer causes a graceful failover as well.
We expect each peer to handle failures similarly, but they clearly do not. We have a feeling this is a switch issues, but we're not certain of that. Maybe it could be a configuration issue with the networking of our cluster. We’re out of ideas… and almost out of drywall to bang our heads against. Any assistance would be greatly appreciated.