Fileserver Failover Cluster Draining Node takes forever

Question

Hello,

we have a 2-node Failover Cluster on Windows Server 2019 S2D with only the 'File Server' role installed. The Role hosts a single SMB file share.

When pausing one node using the 'pause -> drain roles' command clients lose connectivity to the SMB share for a very long amount of time (5 min)

In the Log I can see this event in preparation to switch the disk to the other node:

00002234.00003344::2020/08/31-07:40:36.695 INFO [RES] Physical Disk : Sending FSCTL_DISMOUNT_VOLUME ...

However, it never succeeds. The cluster share is offline at this time as cluster ip and cluster name are already down.

About 3.5 minutes later, a timeout is reported:

00002234.00002258::2020/08/31-07:44:05.234 WARN [RHS - Timeout] Resource 'Cluster Virtual Disk (FileServerVirtualDisk)' has not responded to the call OFFLINERESOURCE:4. The timeout to respond has been exceeded by 16 milliseconds, taking recovery actions.

Another 4 minutes later, another timeout.

00002234.00003104::2020/08/31-07:48:05.221 ERR [RES] Physical Disk : Terminate: Terminate thread timed out, attempting to stop pool defense and detach space for device number 7.

After that, finally failover takes place. Thats expected behaviour, however, a timeout should never happen in the first place.

When I tried the switchover while there was no load on the file share, everything works as expected, e.g.

00002450.00000af4::2020/08/26-08:03:30.144 INFO [RES] Physical Disk : Sending IOCTL_VOLUME_OFFLINE ...
00002450.00000af4::2020/08/26-08:03:30.145 INFO [RES] Physical Disk : Offlining disk ...

Only 1ms after the IOCTL_VOLUME_OFFLINE command the next action happens.

Obviously, something is preventing IOCTL_VOLUME_OFFLINE to finish under load.

I have watched Task Manager -> Memory during that operation and I could see the amount of "Cached" Memory is decreasing while waiting for the IOCTL_VOLUME_OFFLINE command.

How can I make sure my File Server Failover cluster role can actually fail over without losing connectivity for >5 minutes?

/Klaus

Answer

have you run the cluster validation wizard against your configuration to see that the cluster is seeing all components without errors and warnings?

Answer

Hi,

According to your description and based on my experience, it may have RHS deadlock when you drain the node:

"RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action. By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it. This is configurable with the resource DeadlockTimeout common property.

To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:

(Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000"

You may check the following article for detailed information about RHS deadlock:

https://techcommunity.microsoft.com/t5/failover-clustering/understanding-how-failover-clustering-recovers-from-unresponsive/ba-p/371847#:~:text=This%20is%20usually%20associated%20with,the%20RHS%20process%20to%20terminate.

If the reply could be of help, please help to accept it as an answer, thanks for your cooperation!
Thanks for your time!
Best Regards,
Anne

Fileserver Failover Cluster Draining Node takes forever

2 answers