Windows Server 2016 Storage Spaces MPIO issue with HPE H241 HBAs?

Question

Windows Server 2016 Storage Spaces MPIO issue with HPE H241 HBAs?

Pauly, Christian 11

First of all, we have similar problems on 2 clusters with the same hardware and the same OS (Windows Server 2016) in 2 different locations.

The setup:

2x HPE DL 380 Gen9
3x HBA H241 per host (HPE FW: 7.00, driver 106.26.0.64)
3x HPE D3700 JBOD
MPIO is configured for SAS support and uses a FOO policy. Each server has a redundant path to each JBOD with a MiniSAS HD cable.

The JBOD is equipped with a mix of SSD and HDD drives. A single Storage Pool contains all JBOD disks, several Storage Spaces (vDisks) have been created and been made highly available as Cluster Shared Volumes. The cluster holds a SOFS role, shares are presented to the Hyper-V hosts. The shares store VM configs and VMs virtual disks. All VM virtual disks are using fixed sizes, there are no dynamic expanding disks in use.

The entire setup was running fine when the OS of the Cluster nodes was still Windows Server 2012 R2. We re-installed (not upgraded) each node to Windows Server 2016 using the rolling cluster upgrade procedure. As part of the procedure all involved HPE hardware components also got their firmware upgraded as per HPE delivered Smart Update Manager for Windows Server 2016, All Windows Updates up to Jan 2021 are applied and installed.

The cluster validation test indicated not a single warning.

After the upgrade we run into the following problem:

When creating a VM virtual disk with the size of 100 GB the cluster node holding the CSV we were writing to crashed into a BSOD. We then figured out that we are able to reproduce the issue without producing the BSOD when we would only create a VM virtual disk with the size of 1 GB. When doing so we receive several of the following event log entries:

Event ID 153 Disk
The IO operation at logical block address 6c1a8 for Disk 3 was retried.
Please help me to get rid of this warning, when warning occur my mailbox get disconnected.

There are about 30 events of this kind logged per second and the disk it refers to is always different. The different disks pointed out not being in the same JBOD, so we expect it not a JBOD being problematic here. Next we did check the health counter of each physical disk in the JBODS, they are all good, there are no errors logged. Furthermore the virtual disk (storage spaces) is put in a degraded status and a storage job is kicked off. Again, when a .vhdx file is created with a size > 5 GB the SOFS Cluster node fails into a BSOD.

We have verified Storage Spaces compatibility with windowsservercatalogs.com, the following components are certified for the use of Storage Spaces for OS 2012R2, 2016 and 2019:

H241
D3700
DL 380 Gen 9
Finally we figured out that we are unable to reproduce the issue when removing the redundant SAS connection to the JBODs. This means that if there is no path redundancy to the storage, there are no problems.

We are operating a similar setup, just with different JBODS and HBAs of different vendors:

2x HPE DL 380 Gen9
3x LSI SAS 9300-8e
3x DataOn DNS 2608 JBOD

Also this environment was upgraded from Windows Server 2012 R2 to Windows Server 2016 – but operates just as expected, without any problems.

Estimated problem:

There seems to be a problem between the MPIO Windows Server 2016 (Microsoft), Drivers for H241 (Microsemi) and/or the HBAs firmware.

HPE support asked us to contact the software vendor for Storage Spaces, which is Microsoft, and pointed out that HPE does not list Windows Server 2016 to support the H241. They indeed presented a document not listing Windows Server 2016. Why would then H241 be certified for the use with Windows Server 2016 and 2019 in windowsservercatalog.com?

4 answers

Your answer

Answer 1

Pauly, Christian 11

The picture outlines the redundant cabling...

Answer 2

Hi,
From your description, it seems to be related to the HPE H241 HBAs.
Here is the doc about event 153:
Interpreting Event 153 Errors
Since there is no other error recorded and the disks are healthy, it's hard for forum guys to find the root cause. If you want to get better help, it is suggested to go to open a case in the following link so that a dedicated Support Professional can assist you in a more efficient manner.  
https://support.microsoft.com/en-us/gp/customer-service-phone-numbers

Thanks for your time!
Best Regards,
Mico Mi

-----------------------------

If the Answer is helpful, please click "Accept Answer" and upvote it.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

Answer 3

Hi Micro Mi,

you are absolutely right and I'm with you.

We are running two support cases, one with Microsoft and another one with HPE.
The fact that disk health counters show no errors but we are losing the Storage Spaces under IO workload indicated to us that the problem occurs somewhere on the way down to the disks. Since removing the redundant path stabilizes the environment indicated a problem somewhere in the area of HBA and MPIO. Seeing a similar setup with HBAs and JBODs of different vendors working as expected just approves this idea.
We would have expected that both companies put their heads together for testing or at least with a way out of this terrible situation, worst case even with a recommendation for different HBAs for this combo, but that doesn’t seem to be the case.
In the mean time we have tried all versions of firmware for the HBA as well as different drivers we could get. We even tried a different type of HBA, LSI SAS 9300-8e, but this one wouldn’t connect to the JBOD. We’ve tried different MPIO policies without success.
At some point we’ve had a MVP specialized on Storage Space checking our environment and performing some tests, but ended up with the need for support by either MS or HPE.
The Microsoft support case already runs for more than 2 weeks without any progress at all. Ticket being pushed back and forth between sub-contracted companies of MS Customer Support.
HPE only pointed out that our cabling scheme is not supported for the D3700 JBODs, they refer to daisy chain and furthermore pointed out that H241 HBAs don’t list Windows Server 2016 as a supported OS – but this adapter is certified for the use of Storage Spaces with Windows Server 2016 per windowsservercatalog.com.
Being a customer with a Storage Spaces certified setup one feels quite alone with this setup.
We really would appreciate anyone with similar issues or good ideas to share valuable troubleshooting steps we haven’t tried yet while continue to wait for support.

kind regards and many thanks,
Christian

Mico Mi 1,936 Reputation points

2021-02-18T02:14:45.84+00:00

Hi,
If you have any progress, please feel free to feedback.
And I also hope someone with similar issues can share their advice and solutions here.
Best Regards,
Mico Mi

Answer 4

Hi there, sorry for the late update, but the situation keeps us quite busy…
As mentioned before, we are running support cases with HPE and Microsoft on this one, the solution is yet to come…
HPE supported us by sending an alternative to the HBA H241, HPE Smart Array P441. Unfortunately, this device uses the same driver in the same version as well as the same hardware. The hardware exchange was not successful, we are still losing the virtual disks as soon as there is a bit of disk IO, for instance while creating a .vhdx file on the Cluster Shared Volume. In order to avoid further confusion whether the MPIO Windows feature interferes, we have uninstalled it. There is no need for it right now since we are not using redundant cabling. Uninstalling the MPIO component still did not fix our issue.
Still HPE keeps complaining about the cabling scheme not being a supported scenario. Luckily, with a bit of organizing, we were able to setup the following scenario:

1x DL 380 Gen 9 Windows Server 2016 (latest updates)
1x H241 HBA (FW 7.00)
1x D3700 JBOD (FW 7.00)
We have created the simplest possible scenario:
A single server connected to a single JBOD with a single miniSAS HD cable – no MPIO feature installed. All components are running the latest firmware (including disks) and drivers.

With storage spaces we have created a single virtual disk, made it highly available as a Cluster Shared Volume (to mimic the troubled cluster mentioned above). We are able to reproduce the error on this system as well. These are now 3 systems with similar JBODs and HBAs experiencing the same trouble. HPE is still unsure whether this single miniSAS HD cable-connected JBOD would be a supported scenario.
We have had this problem for the first at 27 JAN 2021, right after the upgrade and contacted HPE right away. We are unauthorized sending logfiles of any kind, unfortunately.
Right now I don’t see where this how this would be anything else but a driver or firmware (or both) issue. We have eliminated as many possible obstacles as possible between the OS and the storage. We are awaiting staff from HPE to analyze the system, but currently they fail to find the right people dealing with storage spaces or scale-out file server.
Does anyone have a good idea how further narrow this down? Just to make sure one more time, this not S2D (storage spaces direct), this is Scale-out file server on Windows Server 2016, usually connected redundant to 3 JBODs with a 3 way mirror configuration. I’ll also provide a scratch of our simplified scenario, maybe this helps understanding…
Best regards,
Christian

Share via

Windows Server 2016 Storage Spaces MPIO issue with HPE H241 HBAs?

4 answers

Your answer