First of all, we have similar problems on 2 clusters with the same hardware and the same OS (Windows Server 2016) in 2 different locations.
- 2x HPE DL 380 Gen9
- 3x HBA H241 per host (HPE FW: 7.00, driver 126.96.36.199)
- 3x HPE D3700 JBOD
MPIO is configured for SAS support and uses a FOO policy. Each server has a redundant path to each JBOD with a MiniSAS HD cable.
The JBOD is equipped with a mix of SSD and HDD drives. A single Storage Pool contains all JBOD disks, several Storage Spaces (vDisks) have been created and been made highly available as Cluster Shared Volumes. The cluster holds a SOFS role, shares are presented to the Hyper-V hosts. The shares store VM configs and VMs virtual disks. All VM virtual disks are using fixed sizes, there are no dynamic expanding disks in use.
The entire setup was running fine when the OS of the Cluster nodes was still Windows Server 2012 R2. We re-installed (not upgraded) each node to Windows Server 2016 using the rolling cluster upgrade procedure. As part of the procedure all involved HPE hardware components also got their firmware upgraded as per HPE delivered Smart Update Manager for Windows Server 2016, All Windows Updates up to Jan 2021 are applied and installed.
The cluster validation test indicated not a single warning.
After the upgrade we run into the following problem:
When creating a VM virtual disk with the size of 100 GB the cluster node holding the CSV we were writing to crashed into a BSOD. We then figured out that we are able to reproduce the issue without producing the BSOD when we would only create a VM virtual disk with the size of 1 GB. When doing so we receive several of the following event log entries:
Event ID 153 Disk
The IO operation at logical block address 6c1a8 for Disk 3 was retried.
Please help me to get rid of this warning, when warning occur my mailbox get disconnected.
There are about 30 events of this kind logged per second and the disk it refers to is always different. The different disks pointed out not being in the same JBOD, so we expect it not a JBOD being problematic here. Next we did check the health counter of each physical disk in the JBODS, they are all good, there are no errors logged. Furthermore the virtual disk (storage spaces) is put in a degraded status and a storage job is kicked off. Again, when a .vhdx file is created with a size > 5 GB the SOFS Cluster node fails into a BSOD.
We have verified Storage Spaces compatibility with windowsservercatalogs.com, the following components are certified for the use of Storage Spaces for OS 2012R2, 2016 and 2019:
- DL 380 Gen 9
Finally we figured out that we are unable to reproduce the issue when removing the redundant SAS connection to the JBODs. This means that if there is no path redundancy to the storage, there are no problems.
We are operating a similar setup, just with different JBODS and HBAs of different vendors:
- 2x HPE DL 380 Gen9
- 3x LSI SAS 9300-8e
- 3x DataOn DNS 2608 JBOD
Also this environment was upgraded from Windows Server 2012 R2 to Windows Server 2016 – but operates just as expected, without any problems.
There seems to be a problem between the MPIO Windows Server 2016 (Microsoft), Drivers for H241 (Microsemi) and/or the HBAs firmware.
HPE support asked us to contact the software vendor for Storage Spaces, which is Microsoft, and pointed out that HPE does not list Windows Server 2016 to support the H241. They indeed presented a document not listing Windows Server 2016. Why would then H241 be certified for the use with Windows Server 2016 and 2019 in windowsservercatalog.com?