The picture outlines the redundant cabling...
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
First of all, we have similar problems on 2 clusters with the same hardware and the same OS (Windows Server 2016) in 2 different locations.
The setup:
The JBOD is equipped with a mix of SSD and HDD drives. A single Storage Pool contains all JBOD disks, several Storage Spaces (vDisks) have been created and been made highly available as Cluster Shared Volumes. The cluster holds a SOFS role, shares are presented to the Hyper-V hosts. The shares store VM configs and VMs virtual disks. All VM virtual disks are using fixed sizes, there are no dynamic expanding disks in use.
The entire setup was running fine when the OS of the Cluster nodes was still Windows Server 2012 R2. We re-installed (not upgraded) each node to Windows Server 2016 using the rolling cluster upgrade procedure. As part of the procedure all involved HPE hardware components also got their firmware upgraded as per HPE delivered Smart Update Manager for Windows Server 2016, All Windows Updates up to Jan 2021 are applied and installed.
The cluster validation test indicated not a single warning.
After the upgrade we run into the following problem:
When creating a VM virtual disk with the size of 100 GB the cluster node holding the CSV we were writing to crashed into a BSOD. We then figured out that we are able to reproduce the issue without producing the BSOD when we would only create a VM virtual disk with the size of 1 GB. When doing so we receive several of the following event log entries:
Event ID 153 Disk
The IO operation at logical block address 6c1a8 for Disk 3 was retried.
Please help me to get rid of this warning, when warning occur my mailbox get disconnected.
There are about 30 events of this kind logged per second and the disk it refers to is always different. The different disks pointed out not being in the same JBOD, so we expect it not a JBOD being problematic here. Next we did check the health counter of each physical disk in the JBODS, they are all good, there are no errors logged. Furthermore the virtual disk (storage spaces) is put in a degraded status and a storage job is kicked off. Again, when a .vhdx file is created with a size > 5 GB the SOFS Cluster node fails into a BSOD.
We have verified Storage Spaces compatibility with windowsservercatalogs.com, the following components are certified for the use of Storage Spaces for OS 2012R2, 2016 and 2019:
We are operating a similar setup, just with different JBODS and HBAs of different vendors:
Also this environment was upgraded from Windows Server 2012 R2 to Windows Server 2016 – but operates just as expected, without any problems.
Estimated problem:
There seems to be a problem between the MPIO Windows Server 2016 (Microsoft), Drivers for H241 (Microsemi) and/or the HBAs firmware.
HPE support asked us to contact the software vendor for Storage Spaces, which is Microsoft, and pointed out that HPE does not list Windows Server 2016 to support the H241. They indeed presented a document not listing Windows Server 2016. Why would then H241 be certified for the use with Windows Server 2016 and 2019 in windowsservercatalog.com?
The picture outlines the redundant cabling...
Hi,
From your description, it seems to be related to the HPE H241 HBAs.
Here is the doc about event 153:
Interpreting Event 153 Errors
Since there is no other error recorded and the disks are healthy, it's hard for forum guys to find the root cause. If you want to get better help, it is suggested to go to open a case in the following link so that a dedicated Support Professional can assist you in a more efficient manner.
https://support.microsoft.com/en-us/gp/customer-service-phone-numbers
Thanks for your time!
Best Regards,
Mico Mi
-----------------------------
If the Answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.
Hi Micro Mi,
you are absolutely right and I'm with you.
We are running two support cases, one with Microsoft and another one with HPE.
The fact that disk health counters show no errors but we are losing the Storage Spaces under IO workload indicated to us that the problem occurs somewhere on the way down to the disks. Since removing the redundant path stabilizes the environment indicated a problem somewhere in the area of HBA and MPIO. Seeing a similar setup with HBAs and JBODs of different vendors working as expected just approves this idea.
We would have expected that both companies put their heads together for testing or at least with a way out of this terrible situation, worst case even with a recommendation for different HBAs for this combo, but that doesn’t seem to be the case.
In the mean time we have tried all versions of firmware for the HBA as well as different drivers we could get. We even tried a different type of HBA, LSI SAS 9300-8e, but this one wouldn’t connect to the JBOD. We’ve tried different MPIO policies without success.
At some point we’ve had a MVP specialized on Storage Space checking our environment and performing some tests, but ended up with the need for support by either MS or HPE.
The Microsoft support case already runs for more than 2 weeks without any progress at all. Ticket being pushed back and forth between sub-contracted companies of MS Customer Support.
HPE only pointed out that our cabling scheme is not supported for the D3700 JBODs, they refer to daisy chain and furthermore pointed out that H241 HBAs don’t list Windows Server 2016 as a supported OS – but this adapter is certified for the use of Storage Spaces with Windows Server 2016 per windowsservercatalog.com.
Being a customer with a Storage Spaces certified setup one feels quite alone with this setup.
We really would appreciate anyone with similar issues or good ideas to share valuable troubleshooting steps we haven’t tried yet while continue to wait for support.
kind regards and many thanks,
Christian
Hi there, sorry for the late update, but the situation keeps us quite busy…
As mentioned before, we are running support cases with HPE and Microsoft on this one, the solution is yet to come…
HPE supported us by sending an alternative to the HBA H241, HPE Smart Array P441. Unfortunately, this device uses the same driver in the same version as well as the same hardware. The hardware exchange was not successful, we are still losing the virtual disks as soon as there is a bit of disk IO, for instance while creating a .vhdx file on the Cluster Shared Volume. In order to avoid further confusion whether the MPIO Windows feature interferes, we have uninstalled it. There is no need for it right now since we are not using redundant cabling. Uninstalling the MPIO component still did not fix our issue.
Still HPE keeps complaining about the cabling scheme not being a supported scenario. Luckily, with a bit of organizing, we were able to setup the following scenario: