Storage Spaces Direct single point of failure in 1 SSD

Question

Last weekend we had a serious issue with one of our 2019 HyperV HCI clusters.
It turns out a single SSD cache disk was causing the entire cluster to basically grind to a halt. S2D is built with fault-tolerance in mind. Disks can fail, even multiple. Nodes can fail. Networks can fail. It should be able to take hits and keep running. Well, turns out one hit is enough if it's done in the right place.

Here's what we have

6 Dell node Win2019 HCI cluster running HyperV and S2D
Each node with a mix of SSD (cache) and HDD (capacity) (5 SSD + 15 HDD)
2x10Gbit dedicated RDMA network for S2D and 2X10Gbit dedicated for VMs and management

What happened ?
During planned maintenance and installing windows updates, one of the nodes failed after it's reboot. Or actually one of the SSD's in this node failed. This SSD showed a huge latency (1000ms) in Admin Centre starting from the moment the node was rebooted. We only discovered this in WAC after a while. This caused the entire storage layer of this node to become overloaded. And subsequently this impacted the entire S2D pool and cluster. Pool was online, including the Virtual disks, but they also showed latency in the 500-750ms range. Where they are usually below 1ms. The reboot of each node always causes S2D repair jobs (expected behavior), but these had trouble finishing, again because of the huge latency.
Network issues were ruled out. We had perfect connectivity and the 10G ports were at max 20% usage. Very low compared to normally.

We first tried to retire the SSD that failed (powershell command). But this did not help. In the end we had to physicallty pull the SSD from the server to solve the issue. The high latency was gone immediately, VMs came back online and S2D jobs finished quickly.

How can a single SSD cause an entire cluster to fail ?
An issue with a single disk I get. And and i can get that same issue impacting an entire node too. But why does it impact the entire cluster ? Is the only option here to retire the disk (which we did), or retire (power off) the node (which can be done remotely) ?

Anyone run into similar issues ?
Thanx in advance

Answer

@Richard Willkomm

Good questions. We are continually analyzing how to treat disks that aren't failed, but aren't working as expected. These disks as capacity can have an effect on the system, but as cache disks have a greater effect. We did create the outlier detection to help with making these "marginal disks" easier to detect.

The next thing is what can the system do about them? At first blush the answer is, "Get rid of them!!". However this is fraught with peril. If there are more than one disks having a problem and you get rid of them, you can lose data because they may have the only copies of specific pieces of data....that's not good.

What if you get rid of one, and right after that another has a problem....do you get rid of that one, same data issue as just described, not necessarily a good thing. Sometimes a slow disk with good data is better than just cutting disks out of the system. We are working on finding smart ways to take all of this into consideration with our storage health automation, but we are being cautious to ensure we are not putting any data at more risk.

With regard to retiring, part of that process is to take the data on that disk and move it somewhere else (if there is spare capacity to do so). If the disk is slow, that takes time.

Removing from the pool is another option, but the success is somewhat dependent on the state of the disk itself. Since the disk is not acting as expected, anything you do with regards to software with it is somewhat dependent.

Last resort is physically removing it, which you did.

One last thing, which my not be directly relevant to your situation, but is to the marginal disk discussion, we have seen disks be marginally responsive. Customer removes them and sends them back to the seller which tests them and it seems fine. There have been devices out there that only become marginal in certain I/O patters or at certain levels of I/O stress (which S2D can throw I/O at a device faster that most anything else out there). I'm not aware of any device that keeps perf or errors based on perf, on the disk, so the disk manufacturers have no evidence that the disk was behaving poorly......

Sorry for your frustration and thank you for the description and feedback. We continue to strive to improve.

I hope this helps,
Steven Ekren
Senior Program Manager
Windows Server and Azure Stack HCI
Microsoft

Answer

Sounds like an issue best handled by opening a support case. It appears you have resolved the problem, but opening a case may uncover something unique in your environment (if you still have the logs from that time period) or may help others in the future if a root cause can be determined.

Answer

Hi,

Agree, due to the complexity, it's recommended to open a case with MS to troubleshooting the S2D issue:

https://support.microsoft.com/en-us/gp/customer-service-phone-numbers

Thanks for your time!
Best Regards,
Anne

-----------------------------

If the Answer is helpful, please click "Accept Answer" and upvote it.

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer

Hi Richard,

Due to the spreading of data, the cluster can slow down if one node or a component in the node has issues.
See this blog for how this works: https://techcommunity.microsoft.com/t5/storage-at-microsoft/deep-dive-the-storage-pool-in-storage-spaces-direct/ba-p/425959

If a cache device fails, the capacity disks bound to it will move to the other cache device and everything is well.
If a cache device is having more latency than other cache devices, the "Drive latency outlier detection" will detect this and it will show in Windows Admin Center or through Powershell.
If a cache device is suddenly having more latency, the outlier detection has not noticed it yet and will cause the issues you have seen.

Hope this helps.

Answer

Hi Darryl and others,

Thank you for your answer. I was about to contact you directlty when I noticed your reply. Been busy with other cluster-issues the
last 2 weeks. Life for an S2D admin can be a hassle sometimes ;)

When reading you answer, thing become more clear. But two things do pop up.

The 'Outlier dectection' is a detection only ? I reports in powershell and WAC only. It's not smart in the way that upon detection
it will automatically remove the bad drives from the pool ? You would need to build something like that yourself ?
(perhaps this is explained in your link I still need to read through)

The other thing is how to get the failed drive to retire. We tried that. But had no effect. Pulling physically was the only way to do
it, apart from shutting down the node holding the failed disk, which wouldn't been quicker but more impactfull.

Retire I did using this;

get-physicaldisk -serialnumber | Set-pysicaldisk -usage Retired

Removing it from the pool would probably need this, but I did not try that.

$Badboy = get-physicaldisk -serialnumber
get-storagepool s2 | remove-pysicaldisk -physicaldisk $Badboy

Would this have worked remains the question. Or is there another command to get rid of the disk ?

Greetz
Richard

Share via

Storage Spaces Direct single point of failure in 1 SSD

8 answers