I have Windows 10 pro in use for an image processing workstation, with a fairly beefy system:
AMD 3970X (32 core)
128gb
NVMe system disk
6 x SSD 2TB druves (mix of Samsung 860 EVO and 850 EVO)
Gigabyte TRX40 Designare
The 6 SSD's are in a storage space with a two way mirror providing 5.4TB of space. It is fully updated, and no 3rd party antivirus, anti-malware or other software that might intercept file IO (Defender is on however).
Generally everything works great EXCEPT when I leave a high disk IO program running for a few hours (which does a LOT of IO as well as CPU processing) periodically the storage space drive will go offline.
While I get a pile of event log messages at that point, they all seem after-effects, failed writes to the logical volume. When I look at the storage space status (before recovery), it shows all 6 drives as "OK", and if I bring the drive back online manually all works fine. The closest I can see to a relevant error is in StorageSpaces-Driver I get an event 312 that simply says:
Virtual disk {1d789716-4224-40f7-9452-b2b3a0bd4634} has failed a write operation to all its copies.
Unfortunately because they are in a storage space, I cannot run Magician or similar program to see current health status of the individual drives, so am depending on the storage spaces "OK" to say that they are, well, OK.
I feel like this is some sort of resource exhaustion issue with very high rate IO from many, many threads (this is a astronomical image stacking program called Pixinsight). I guess it could be a hardware failure, but there is no disk corruption afterwards, and in the past (though not today yet) I have run a complete scan and the whole logical volume is readable. But there is no indication in the event log of resource exhaustion.
I would appreciate any advice as to how to debug this issue. This is not easy to reproduce; it happens maybe 10% of the time when running for many hours like this. It never happens in lighter use. And it always works fine if brought back online (though I tend to reboot afterwards just in case).
Thanks in advance, Linwood.
PS. If it helps, here is the definition of the storage pool:
ObjectId : {1}\\LEF\root/Microsoft/Windows/Storage/Providers_v2\SPACES_StoragePool.ObjectId="{0e1c5b08-7d79-11eb-ba71-806e6f6e6963}:SP:{76aca0ee-237f-4828-8d34-4353537397b6}"
PassThroughClass :
PassThroughIds :
PassThroughNamespace :
PassThroughServer :
UniqueId : {76aca0ee-237f-4828-8d34-4353537397b6}
AllocatedSize : 11880416411648
ClearOnDeallocate : False
EnclosureAwareDefault : False
FaultDomainAwarenessDefault : PhysicalDisk
FriendlyName : Pool
HealthStatus : Healthy
IsClustered : False
IsPowerProtected : False
IsPrimordial : False
IsReadOnly : False
LogicalSectorSize : 4096
MediaTypeDefault : Unspecified
Name :
OperationalStatus : OK
OtherOperationalStatusDescription :
OtherUsageDescription :
PhysicalSectorSize : 4096
ProvisioningTypeDefault : Fixed
ReadOnlyReason : None
RepairPolicy : Parallel
ResiliencySettingNameDefault : Mirror
RetireMissingPhysicalDisks : Auto
Size : 11997756260352
SupportedProvisioningTypes : {Thin, Fixed}
SupportsDeduplication : False
ThinProvisioningAlertThresholds : {70}
Usage : Other
Version : Windows Server vNext
WriteCacheSizeDefault : Auto
WriteCacheSizeMax : 18446744073709551614
WriteCacheSizeMin : 0
PSComputerName :