Storage Spaces just goes offline in high IO

Linwood.F 116 Reputation points
2023-03-03T16:09:42.89+00:00

I have Windows 10 pro in use for an image processing workstation, with a fairly beefy system:

AMD 3970X (32 core)

128gb

NVMe system disk

6 x SSD 2TB druves (mix of Samsung 860 EVO and 850 EVO)

Gigabyte TRX40 Designare

The 6 SSD's are in a storage space with a two way mirror providing 5.4TB of space. It is fully updated, and no 3rd party antivirus, anti-malware or other software that might intercept file IO (Defender is on however).

Generally everything works great EXCEPT when I leave a high disk IO program running for a few hours (which does a LOT of IO as well as CPU processing) periodically the storage space drive will go offline.

While I get a pile of event log messages at that point, they all seem after-effects, failed writes to the logical volume. When I look at the storage space status (before recovery), it shows all 6 drives as "OK", and if I bring the drive back online manually all works fine. The closest I can see to a relevant error is in StorageSpaces-Driver I get an event 312 that simply says:

Virtual disk {1d789716-4224-40f7-9452-b2b3a0bd4634} has failed a write operation to all its copies.

Unfortunately because they are in a storage space, I cannot run Magician or similar program to see current health status of the individual drives, so am depending on the storage spaces "OK" to say that they are, well, OK.

I feel like this is some sort of resource exhaustion issue with very high rate IO from many, many threads (this is a astronomical image stacking program called Pixinsight). I guess it could be a hardware failure, but there is no disk corruption afterwards, and in the past (though not today yet) I have run a complete scan and the whole logical volume is readable. But there is no indication in the event log of resource exhaustion.

I would appreciate any advice as to how to debug this issue. This is not easy to reproduce; it happens maybe 10% of the time when running for many hours like this. It never happens in lighter use. And it always works fine if brought back online (though I tend to reboot afterwards just in case).

Thanks in advance, Linwood.

PS. If it helps, here is the definition of the storage pool:

ObjectId                          : {1}\\LEF\root/Microsoft/Windows/Storage/Providers_v2\SPACES_StoragePool.ObjectId="{0e1c5b08-7d79-11eb-ba71-806e6f6e6963}:SP:{76aca0ee-237f-4828-8d34-4353537397b6}"
PassThroughClass                  :
PassThroughIds                    :
PassThroughNamespace              :
PassThroughServer                 :
UniqueId                          : {76aca0ee-237f-4828-8d34-4353537397b6}
AllocatedSize                     : 11880416411648
ClearOnDeallocate                 : False
EnclosureAwareDefault             : False
FaultDomainAwarenessDefault       : PhysicalDisk
FriendlyName                      : Pool
HealthStatus                      : Healthy
IsClustered                       : False
IsPowerProtected                  : False
IsPrimordial                      : False
IsReadOnly                        : False
LogicalSectorSize                 : 4096
MediaTypeDefault                  : Unspecified
Name                              :
OperationalStatus                 : OK
OtherOperationalStatusDescription :
OtherUsageDescription             :
PhysicalSectorSize                : 4096
ProvisioningTypeDefault           : Fixed
ReadOnlyReason                    : None
RepairPolicy                      : Parallel
ResiliencySettingNameDefault      : Mirror
RetireMissingPhysicalDisks        : Auto
Size                              : 11997756260352
SupportedProvisioningTypes        : {Thin, Fixed}
SupportsDeduplication             : False
ThinProvisioningAlertThresholds   : {70}
Usage                             : Other
Version                           : Windows Server vNext
WriteCacheSizeDefault             : Auto
WriteCacheSizeMax                 : 18446744073709551614
WriteCacheSizeMin                 : 0
PSComputerName                    :

Windows 10
Windows 10
A Microsoft operating system that runs on personal computers and tablets.
11,195 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Limitless Technology 44,121 Reputation points
    2023-03-07T08:38:32.1766667+00:00

    Hello there,

    Ensure you have installed the latest monthly update, and then run the Cluster Validation tool. You can run it from Failover Cluster Manager or PowerShell with the Test-Cluster cmdlet. Usually, the root cause of this issue is that you are using non-SES compliant hardware.

    The below thread discusses the same issue and you can try out some troubleshooting steps from this and see if that helps you to sort the Issue. https://social.technet.microsoft.com/Forums/ie/en-US/4fc1fb86-61fa-4976-8b3f-9e314586fef8/storage-spaces-direct-cluster-virtual-disk-goes-offline-when-rebooting-a-node?forum=winserverClustering

    https://social.technet.microsoft.com/Forums/en-US/7d97edef-c4d9-47a7-ae55-488a8a394483/storage-space-offline-due-to-critical-write-failures-add-drives?forum=win10itprogeneral

    Hope this resolves your Query !!

    --If the reply is helpful, please Upvote and Accept it as an answer–