Fileserver Failover Cluster Draining Node takes forever

Klaus Koehler 1 Reputation point
2020-08-31T08:57:55.773+00:00

Hello,

we have a 2-node Failover Cluster on Windows Server 2019 S2D with only the 'File Server' role installed. The Role hosts a single SMB file share.

When pausing one node using the 'pause -> drain roles' command clients lose connectivity to the SMB share for a very long amount of time (5 min)

In the Log I can see this event in preparation to switch the disk to the other node:

00002234.00003344::2020/08/31-07:40:36.695 INFO [RES] Physical Disk <Cluster Virtual Disk (FileServerVirtualDisk)>: Sending FSCTL_DISMOUNT_VOLUME ...

However, it never succeeds. The cluster share is offline at this time as cluster ip and cluster name are already down.

About 3.5 minutes later, a timeout is reported:

00002234.00002258::2020/08/31-07:44:05.234 WARN [RHS - Timeout] Resource 'Cluster Virtual Disk (FileServerVirtualDisk)' has not responded to the call OFFLINERESOURCE:4. The timeout to respond has been exceeded by 16 milliseconds, taking recovery actions.

Another 4 minutes later, another timeout.

00002234.00003104::2020/08/31-07:48:05.221 ERR [RES] Physical Disk <Cluster Virtual Disk (FileServerVirtualDisk)>: Terminate: Terminate thread timed out, attempting to stop pool defense and detach space for device number 7.

After that, finally failover takes place. Thats expected behaviour, however, a timeout should never happen in the first place.

When I tried the switchover while there was no load on the file share, everything works as expected, e.g.

00002450.00000af4::2020/08/26-08:03:30.144 INFO [RES] Physical Disk <Cluster Virtual Disk (FileServerVirtualDisk)>: Sending IOCTL_VOLUME_OFFLINE ...
00002450.00000af4::2020/08/26-08:03:30.145 INFO [RES] Physical Disk <Cluster Virtual Disk (FileServerVirtualDisk)>: Offlining disk ...

Only 1ms after the IOCTL_VOLUME_OFFLINE command the next action happens.

Obviously, something is preventing IOCTL_VOLUME_OFFLINE to finish under load.

I have watched Task Manager -> Memory during that operation and I could see the amount of "Cached" Memory is decreasing while waiting for the IOCTL_VOLUME_OFFLINE command.

How can I make sure my File Server Failover cluster role can actually fail over without losing connectivity for >5 minutes?

/Klaus

Windows Server 2019
Windows Server 2019
A Microsoft server operating system that supports enterprise-level management updated to data storage.
3,479 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
962 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. TimCerling(ret) 1,156 Reputation points
    2020-08-31T12:34:37.32+00:00

    have you run the cluster validation wizard against your configuration to see that the cluster is seeing all components without errors and warnings?


  2. Xiaowei He 9,871 Reputation points
    2020-09-01T06:35:58.107+00:00

    Hi,

    According to your description and based on my experience, it may have RHS deadlock when you drain the node:

    "RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action. By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it. This is configurable with the resource DeadlockTimeout common property.

    To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:

    (Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000"

    You may check the following article for detailed information about RHS deadlock:

    https://techcommunity.microsoft.com/t5/failover-clustering/understanding-how-failover-clustering-recovers-from-unresponsive/ba-p/371847#:~:text=This%20is%20usually%20associated%20with,the%20RHS%20process%20to%20terminate.


    If the reply could be of help, please help to accept it as an answer, thanks for your cooperation!
    Thanks for your time!
    Best Regards,
    Anne