2-Node Failover Cluster fails to failover during graceful shutdown/restart (Error 1168)

Question

2-Node Failover Cluster fails to failover during graceful shutdown/restart (Error 1168)

Nithin Kanyan Kandi 5

I am experiencing an issue with a 2-node Windows Failover Cluster where automatic failover works perfectly during hard failures, but completely fails during a graceful OS shutdown or restart.

Here are the details of my environment and the specific symptoms:

Environment Details:

OS: Windows server 2025 Standard (Ver:24H2)

Cluster Setup: 2-Node Cluster with a Disk Witness

Workload: SQL Server Always On Availability Groups (AG)

Storage: Dell ME Series SAN via iSCSI

Multipathing: Dell specific DSM

The Issue:

The cluster successfully handles unexpected failures, but when I attempt a standard restart or shutdown of the active node, the cluster fails to move the resources to the secondary node. The disk resource seems to hang or fail during the termination process.

Testing Matrix:

Hard Power Pull (Active Node): Failover is SUCCESSFUL.

Network Disconnection (Active Node): Failover is SUCCESSFUL.

Manual Resource Move (via FCM): Move is SUCCESSFUL.

Graceful Restart / Shutdown (Active Node): Fails to move resources to the other node.

Error Logs:

In the Cluster Events / System Event logs, I am seeing the following error during the shutdown process:

"Cluster physical disk resource encountered attempting to terminate, error code 1168"

Troubleshooting Performed:

Validated the cluster configuration (Cluster Validation passes).

Verified that manual drain/pause of the node works correctly.

Checked that the Dell ME SAN firmware and drivers are up to date.

0 comments

4 answers

Your answer

Answer 1

Marcin Rabiniak 0

Got the same error, did you find a fix eventually ? During my diagnostic i am finding also errors like bellow:

30952 Warning Microsoft-Windows-SMBClient Microsoft-Windows-SMBClient/Operational

The SMB redirector did not select the connection initiated with the following parameters:

Server name: fe80::a325:be9e:25f0:8a05%13

IP Address: x.x.x.x:445

Transport: TCPIP

Instance Name:\Device\SmbCsv

Port Origin: The port was selected from the global registry settings.

The failure status associated with this decision: The network path cannot be located.

Nithin Kanyan Kandi 5 Reputation points

2026-04-23T05:39:41.8866667+00:00
Hello,

The problem I encountered was solely during the reboot process, where I experienced a cluster failure. This issue was associated with Windows. I discovered a solution by enabling the following patches for Windows Server 2025. The fix was included in the August patch, but it is turned off by default; you can activate it through an override on the machine.

Please verify if the August 2025 Cumulative Update or a later version for Windows Server 2025 is installed:

PowerShell

Get-HotFix -Id KB5063878

Edit the registry key - create the Key if it does exist.

To Enable the fix:

Windows Command Prompt

reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides /v 2005146767 /t REG_DWORD /d 1 /f

To Disable the fix:

Windows Command Prompt

reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides /v 2005146767 /t REG_DWORD /d 0 /f

Default state of the fix:

Windows Command Prompt

reg delete HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Policies\Microsoft\FeatureManagement\Overrides /v 2005146767 /f>

Restart nodes after registry change.

Answer 2

Me 0

I have a simular issue with a new setup. 2 node windows server 2025 and Dell ME5 shared storage. When i add a quorum witness disk the cluster goes down and can't fix it. I need to recreate the cluster. Did you find something already?

0 comments

Answer 3

Hi Kandi,

The extra message makes the picture clearer: this is not just an AG failover problem. It is a real quorum-loss event during a planned shutdown. Microsoft describes quorum as the majority of votes the cluster needs to stay online, and says the cluster stops if it drops below that majority. In a 2-node cluster with a witness, the witness is part of that vote, and either node should be able to survive a single node failure.

Because hard failure, network loss, and manual move all work, but a graceful shutdown does not, the likely weak point is the ordered shutdown path: the node is losing quorum while Cluster Service is trying to hand off votes/resources. A disk witness is a shared disk used in quorum voting, so if that witness or its storage path becomes unavailable during shutdown, quorum can drop and the cluster service will stop. That fits your “witness disk failover / 1168 / lost quorum” pattern much better than a pure SQL AG issue.

The next I want you to try:

Move-ClusterGroup "Cluster Group" -Node <name-node>

Before you reboot or shut down the active node, because Microsoft says moving a clustered role is an appropriate step for routine maintenance.

And you run this command to find the "name-node":

Get-ClusterNode | Format-Table Name, State

For the AG itself, a planned manual failover is only supported when both replicas are synchronous and synchronized.

If that still fails, may be the disk witness specifically and temporarily test with a file share witness or cloud witness, both of which Microsoft supports as quorum options.

Next, I want you to capture Get-ClusterQuorum and the cluster log during the shutdown attempt, if the witness vote disappears or the disk witness goes offline first, that will confirm the root cause.

Nithin Kanyan Kandi 5 Reputation points

2026-04-23T05:43:10.9633333+00:00

Hi,

Thank you for your support, the issue was with Windows server 2025, and they released patch for the same, but its disabled by default. Once enabled the issue is resolved

Answer 4

Hi Kandi,

This looks more like a storage/offline path problem than a quorum problem. In WSFC, a planned failover starts by the cluster sending the primary replica offline, and the cluster service waits for resources to shut down gracefully before it escalates to termination; PendingTimeout controls how long it waits. Error 1168 is ERROR_NOT_FOUND ("Element not found"). https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/failover-and-failover-modes-always-on-availability-groups?view=sql-server-ver17

Because hard failure, network loss, and manual move all work, but graceful shutdown or restart does not, my read is that the weak spot is the orderly shutdown of the disk witness or shared-disk path: stale persistent reservations, a third-party DSM/MPIO interaction, or storage latency that only appears when Windows is shutting down the storage stack. Microsoft’s MPIO guidance specifically calls out third-party DSMs, path loss, and outdated /drivers, and the disk troubleshooting guide recommends checking the system logs, setting Set-ClusterLog - Level 5, and investigating persistent reservation issues. https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/windows-server-mpio-troubleshooting?

For the next test, I would do the shutdown: Suspend-ClusterNode <active-node> -Drain -Force. Microsoft documents that -Drain moves workloads gracefully, and -ForceDrain stops anything that cannot be moved safely. For AGs, a planned failover is only supported when both replicas are synchronous and synchronized, so draining first or failing over the AG explicitly before rebooting is the safer maintenance path.

If the logs keep pointing at the disk witness, a useful isolation step is to temporarily switch quorum to a file share witness and see whether graceful shutdown starts working; Microsoft supports file share witness on Windows Server 2025, and notes that disk witness and file share witness are standard quorum options for a 2-node cluster. If the PR issue remains, the storage vendor should inspect the LUN and persistent reservations, and Microsoft notes Clear-ClusterDiskReservation as a possible cleanup step. https://learn.microsoft.com/en-us/windows-server/failover-clustering/file-share-witness?tabs=domain-joined-witness

If you paste the relevant cluster log lines around the 1168 event, I can help pinpoint whether it is the AG role, the disk witness, or the iSCSI/DSM layer that is stalling.

Nithin Kanyan Kandi 5 Reputation points

2026-04-12T13:20:54.0766667+00:00

Following error along with 1168 event,

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Cluster Service has terminated due to a fatal error. Error Message: lost quorum, Error Code: 3735605

Share via

2-Node Failover Cluster fails to failover during graceful shutdown/restart (Error 1168)

4 answers

Your answer