How to fix a 2 node S2D cluster when one node had multiple drives replaced

Question

How to fix a 2 node S2D cluster when one node had multiple drives replaced

Dave Baddorf 20

Hello!

I have a customer with a 2 node Microsoft Hyper-V S2D Cluster running Windows Server 2022. The storage pool spans both servers (Microsoft's S2D Mirroring) with twelve 2TB SSD's per server.

The one server had three SSD drives fail all at once (not really sure how this happened). Let's call this failed Hyper-V server V1 while the fully working server is V2. V2 is currently running 25 VM's with the Storage Pool in a degraded state.

The question is once these SSD drives on V1 are replaced, how do I proceed?

Can I repair the cluster volumes on V1? What process would I follow to do this with multiple drive failures? Is it ok to boot V1 on the network (where it can connect with V2) even though there are multiple replaced SSD drives?
Do I need to remove V1 from the cluster? Is this even possible on a two-node cluster?
Do I need to destroy the cluster in order to rebuild from the ground up? I really don't want to do this because the working V2 has 25 running VM's and data.

I'd basically like the safest option to bring V1 back operational in the cluster without having to take down V2, or at least restoring the VM data on the Cluster volumes.

If any further information is needed, I'll be glad to try to get it to you.

Any direction or insight on how to proceed would be greatly appreciated!

Thanks so much, Dave

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Accepted answer

3 additional answers

Your answer

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Answer 1

Hi Dave,

Hope you're doing well.

Please refer to the following steps to replace drivers:

Check status of storage subsystem and storage jobs by using PowerShell:

Get-StorageSubSystem Cluster | Get-StorageJob

See physical disk footprint:

Get-PhysicalDisk | ft DeviceId,FriendlyName,SerialNumber,Uniqueid,*Status,foot,Usage,PhysicalLocation

3.Retire bad disk, one specific disk

Get-PhysicalDisk -SerialNumber XXXXXXXXX | Set-PhysicalDisk -Usage Retired

See physical disk footprint (to get any disk by SerialNumber). Monitor until footprint is zero.

Get-PhysicalDisk -SerialNumber XXXXXXXXX | ft DeviceId,FriendlyName,SerialNumber,Uniqueid,*Status,foot,Usage,PhysicalLocation

Remove retired disk from storage pool.

$FailedDisk = Get-PhysicalDisk -SerialNumber XXXXXXXXX

$Pool = $FailedDisk | Get-StoragePool -IsPrimordial:$false

Remove-PhysicalDisk -StoragePool $pool -PhysicalDisks $FailedDisk

6.Enable LED light of physical disk on a server rack, to more easily identify the disk.

Get-PhysicalDisk |? -SerialNumber XXXXXXXXX | Enable-PhysicalDiskIdentification

Physically remove the failed disk from the server chassis.
Physically install the new disk into the server chassis.
Disable LED light of physical disk on a server chassis.

Disable-PhysicalDiskIdentification

Add new physical disk into pool.

$disk = Get-PhysicalDisk |? CanPool -like True

Get-StoragePool S2D | Add-PhysicalDisk -PhysicalDisks $disk

Get virtual disk and repair virtual disk. Repair job may start on its own.

Get-VirtualDisk

Repair-VirtualDisk VirtualDiskName

Wait until repair completes.

Get-StorageJob

Once the repair storage jobs are complete, repeat steps above for any other capacity disks which need to be replaced.

Best Regards,

Ian Xue

If the Answer is helpful, please click "Accept Answer" and upvote it.

Dave Baddorf 20 Reputation points

2024-03-17T01:22:22.16+00:00

Thanks so much, Ian Xue!

If I have three failed drives on one Hyper-V Server, and it is currently disconnected from the other Hyper-V Server (the one currently running all of the VM's in the cluster), should these commands be run on the Hyper-V Server with failed drives before it is connected back to the good Hyper-V Server? Or should I connect the two servers and then run these commands?

I really appreciate your step-by-step directions!

Dave

Answer 2

Net Runner 620

Hi,

As far as I remember, you can not repair the cluster volumes on an isolated V1 host as long as it does not have a connection to the healthy V2 since there is no local redundancy on a 2-node Storage Spaces Direct cluster node. Hence, there is nothing to repair. The data has to be synchronized from V2. That said, booting V1 back and letting it connect with V2 is the only way to proceed.
You don't need to remove V1 from the cluster. Yes, this is possible.
There is no need to destroy the cluster.

As you have already mentioned, the safest option is to bring back V1. Before letting it connect with V2, make sure you have removed the failed disks from the S2D pool (that does not happen automatically) and that the new disks are added to the S2D pool as well.

Make sure you have backups of your virtual machines currently located on V2 since Storage Spaces Direct 2-node clusters are known to sometimes replicate problems to the healthy node. You have to be quick since, as I mentioned above, your V2 is running without local redundancy. If even a single disk fails, you will lose all the data and will have to recreate the whole cluster and pool from scratch and restore virtual machines from backups.

Local reconstruction codes are one of the reasons I prefer using Virtual SAN https://www.starwindsoftware.com/vsan software instead of Storage Spaces Direct for smaller 2/3-node clusters.

I wish you the best of luck with reviving your cluster!

Dave Baddorf 20 Reputation points

2024-03-13T01:09:24.25+00:00
Thanks, Net Runner!

You said "Before letting it connect with V2, make sure you have removed the failed disks from the S2D pool (that does not happen automatically) and that the new disks are added to the S2D pool as well."

I'm assuming that by removing the failed disk and adding the new disk you are talking about the following commands:

Get-StoragePool <storagepool> | Remove-PhysicalDisk –PhysicalDisk $Disk-old Get-StoragePool <storagepool> | Add-PhysicalDisk –PhysicalDisks $Disk-new –Verbose

Would these commands be run from the failed-storage V1 hypervisor before it is connected back to the V2 Hyper-V server?

You also mentioned that I needed to act quickly. I thought that each Hyper-V Node has a "spare" SSD which has space unallocated in order for the rebuilding of the local volumes in the event of a failed drive. Is this not the case?

Thanks again! Dave
Dave Baddorf 20 Reputation points

2024-03-13T11:40:35.3333333+00:00

Wait - I get it. The array on V2 can't handle a drive failure because there is no duplicate data since the other host is offline.

I'd still appreciate a response to my question about the commands and whether they should be run on V1 before being reconnected.

Thanks!
Net Runner 620 Reputation points

2024-03-17T21:45:03.3333333+00:00

@Dave Baddorf The commands are correct. They should be run on V1 to make your V1 degraded pool standby and ready to absorb the data from V2.

The reason why you need to hurry is accurate as well.
Dave Baddorf 20 Reputation points

2024-03-18T16:47:35.5866667+00:00

In the response from Ian, my understanding is that he recommends running the Repair-VirtualDisk command after only replacing one disk. Then, it sounds like he's suggesting to replace the next failed disk and re-running Repair-VirtualDisk. However, it sounds like you are recommending replacing all three failed drives on the bad (V1) server before booting V1 with cluster connectivity. Do you recommend the Repair-VirtualDisk after the two servers are reconnected in the cluster? Thanks for the helpful recommendations!
Dave Baddorf 20 Reputation points

2024-03-18T17:42:07.8566667+00:00

In the response from Ian, it looks like he is suggesting to replace one drive at a time and wait for the repair to complete. But I that doesn't make sense with your suggestion to remove the bad drives on V1 before booting the two servers back into the cluster. Am I following your recommendations correctly? Should I remove all bad drives, and add all new drives on V1 before booting V1 into the cluster with V2? Thanks!

Answer 3

Thanks for mentioning StarWind.

Hey,

StarWind VSAN works great in 2 or 3 node configurations. It can be deployed on top of either hardware or software RAID depending on you needs, while VSAN will replicate data across the node. Check for more information: https://www.starwindsoftware.com/storage-spaces-direct

Cheers,

Alex Bykovskyi

StarWind Software

Note: Posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

Answer 4

I was able to get the cluster operational even with three failed disks. I did the "Get-PhysicalDisk -SerialNumber <SN> | Set-PhysicalDisk -Usage Retired" for the three failed drives on the working server first (Get-PhysicalDisk didn't produce anything on the bad server without it being connected to the good server). Then I brought up the failed server and allowed it to connect to the working cluster server. I believe that it automatically added the good drives and removed the failed drives (which had been physically removed). At least I had errors trying to remove the failed drives from the storage pool (Remove-PhysicalDisk -StoragePool $pool -PhysicalDisks $FailedDisk) and add the replacement drives (Get-StoragePool S2D | Add-PhysicalDisk -PhysicalDisks $disk). I should have just watched "Get-StorageJob" after bringing both systems back together before trying to do any manual commands. Thanks for everyone's help!

Share via

How to fix a 2 node S2D cluster when one node had multiple drives replaced

3 additional answers

Your answer