How to fix a 2 node S2D cluster when one node had multiple drives replaced

Dave Baddorf 20 Reputation points
2024-03-11T18:14:52.8766667+00:00

Hello!

I have a customer with a 2 node Microsoft Hyper-V S2D Cluster running Windows Server 2022. The storage pool spans both servers (Microsoft's S2D Mirroring) with twelve 2TB SSD's per server.

The one server had three SSD drives fail all at once (not really sure how this happened). Let's call this failed Hyper-V server V1 while the fully working server is V2. V2 is currently running 25 VM's with the Storage Pool in a degraded state.

The question is once these SSD drives on V1 are replaced, how do I proceed?

  1. Can I repair the cluster volumes on V1? What process would I follow to do this with multiple drive failures? Is it ok to boot V1 on the network (where it can connect with V2) even though there are multiple replaced SSD drives?
  2. Do I need to remove V1 from the cluster? Is this even possible on a two-node cluster?
  3. Do I need to destroy the cluster in order to rebuild from the ground up? I really don't want to do this because the working V2 has 25 running VM's and data.

I'd basically like the safest option to bring V1 back operational in the cluster without having to take down V2, or at least restoring the VM data on the Cluster volumes.

If any further information is needed, I'll be glad to try to get it to you.

Any direction or insight on how to proceed would be greatly appreciated!

Thanks so much, Dave

Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
959 questions
0 comments No comments
{count} votes

Accepted answer
  1. Ian Xue (Shanghai Wicresoft Co., Ltd.) 29,891 Reputation points Microsoft Vendor
    2024-03-15T10:04:22.13+00:00

    Hi Dave,

    Hope you're doing well.

    Please refer to the following steps to replace drivers:

    1. Check status of storage subsystem and storage jobs by using PowerShell:

    Get-StorageSubSystem Cluster | Get-StorageJob

    1. See physical disk footprint:

    Get-PhysicalDisk | ft DeviceId,FriendlyName,SerialNumber,Uniqueid,*Status,foot,Usage,PhysicalLocation

    3.Retire bad disk, one specific disk

    Get-PhysicalDisk -SerialNumber XXXXXXXXX | Set-PhysicalDisk -Usage Retired

    1. See physical disk footprint (to get any disk by SerialNumber). Monitor until footprint is zero.

    Get-PhysicalDisk -SerialNumber XXXXXXXXX | ft DeviceId,FriendlyName,SerialNumber,Uniqueid,*Status,foot,Usage,PhysicalLocation

    1. Remove retired disk from storage pool.

    $FailedDisk = Get-PhysicalDisk -SerialNumber XXXXXXXXX

    $Pool = $FailedDisk | Get-StoragePool -IsPrimordial:$false

    Remove-PhysicalDisk -StoragePool $pool -PhysicalDisks $FailedDisk

    6.Enable LED light of physical disk on a server rack, to more easily identify the disk.

    Get-PhysicalDisk |? -SerialNumber XXXXXXXXX | Enable-PhysicalDiskIdentification

    1. Physically remove the failed disk from the server chassis.
    2. Physically install the new disk into the server chassis.
    3. Disable LED light of physical disk on a server chassis.

    Disable-PhysicalDiskIdentification

    1. Add new physical disk into pool.

    $disk = Get-PhysicalDisk |? CanPool -like True

    Get-StoragePool S2D | Add-PhysicalDisk -PhysicalDisks $disk

    1. Get virtual disk and repair virtual disk. Repair job may start on its own.

    Get-VirtualDisk

    Repair-VirtualDisk VirtualDiskName

    1. Wait until repair completes.

    Get-StorageJob

    1. Once the repair storage jobs are complete, repeat steps above for any other capacity disks which need to be replaced.

    Best Regards,

    Ian Xue


    If the Answer is helpful, please click "Accept Answer" and upvote it.


3 additional answers

Sort by: Most helpful
  1. Net Runner 505 Reputation points
    2024-03-12T15:38:26.6433333+00:00

    Hi,

    1. As far as I remember, you can not repair the cluster volumes on an isolated V1 host as long as it does not have a connection to the healthy V2 since there is no local redundancy on a 2-node Storage Spaces Direct cluster node. Hence, there is nothing to repair. The data has to be synchronized from V2. That said, booting V1 back and letting it connect with V2 is the only way to proceed.
    2. You don't need to remove V1 from the cluster. Yes, this is possible.
    3. There is no need to destroy the cluster.

    As you have already mentioned, the safest option is to bring back V1. Before letting it connect with V2, make sure you have removed the failed disks from the S2D pool (that does not happen automatically) and that the new disks are added to the S2D pool as well.

    Make sure you have backups of your virtual machines currently located on V2 since Storage Spaces Direct 2-node clusters are known to sometimes replicate problems to the healthy node. You have to be quick since, as I mentioned above, your V2 is running without local redundancy. If even a single disk fails, you will lose all the data and will have to recreate the whole cluster and pool from scratch and restore virtual machines from backups.

    Local reconstruction codes are one of the reasons I prefer using Virtual SAN https://www.starwindsoftware.com/vsan software instead of Storage Spaces Direct for smaller 2/3-node clusters.

    I wish you the best of luck with reviving your cluster!

    1 person found this answer helpful.

  2. Alex Bykovskyi 1,831 Reputation points
    2024-03-12T19:41:44.34+00:00

    Thanks for mentioning StarWind.

    Hey,

    StarWind VSAN works great in 2 or 3 node configurations. It can be deployed on top of either hardware or software RAID depending on you needs, while VSAN will replicate data across the node. Check for more information: https://www.starwindsoftware.com/storage-spaces-direct

    Cheers,

    Alex Bykovskyi

    StarWind Software

    Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

    1 person found this answer helpful.
    0 comments No comments

  3. Dave Baddorf 20 Reputation points
    2024-03-29T02:23:04.43+00:00

    I was able to get the cluster operational even with three failed disks. I did the "Get-PhysicalDisk -SerialNumber <SN> | Set-PhysicalDisk -Usage Retired" for the three failed drives on the working server first (Get-PhysicalDisk didn't produce anything on the bad server without it being connected to the good server). Then I brought up the failed server and allowed it to connect to the working cluster server. I believe that it automatically added the good drives and removed the failed drives (which had been physically removed). At least I had errors trying to remove the failed drives from the storage pool (Remove-PhysicalDisk -StoragePool $pool -PhysicalDisks $FailedDisk) and add the replacement drives (Get-StoragePool S2D | Add-PhysicalDisk -PhysicalDisks $disk). I should have just watched "Get-StorageJob" after bringing both systems back together before trying to do any manual commands. Thanks for everyone's help!

    0 comments No comments