Share via

2-Node Failover Cluster fails to failover during graceful shutdown/restart (Error 1168)

Nithin Kanyan Kandi 5 Reputation points
2026-04-12T08:00:42.2966667+00:00

I am experiencing an issue with a 2-node Windows Failover Cluster where automatic failover works perfectly during hard failures, but completely fails during a graceful OS shutdown or restart.

Here are the details of my environment and the specific symptoms:

Environment Details:

OS: Windows server 2025 Standard (Ver:24H2)

Cluster Setup: 2-Node Cluster with a Disk Witness

Workload: SQL Server Always On Availability Groups (AG)

Storage: Dell ME Series SAN via iSCSI

Multipathing: Dell specific DSM

The Issue:

The cluster successfully handles unexpected failures, but when I attempt a standard restart or shutdown of the active node, the cluster fails to move the resources to the secondary node. The disk resource seems to hang or fail during the termination process.

Testing Matrix:

Hard Power Pull (Active Node): Failover is SUCCESSFUL.

Network Disconnection (Active Node): Failover is SUCCESSFUL.

Manual Resource Move (via FCM): Move is SUCCESSFUL.

Graceful Restart / Shutdown (Active Node): Fails to move resources to the other node.

Error Logs:

In the Cluster Events / System Event logs, I am seeing the following error during the shutdown process:

"Cluster physical disk resource encountered attempting to terminate, error code 1168"

Troubleshooting Performed:

Validated the cluster configuration (Cluster Validation passes).

Verified that manual drain/pause of the node works correctly.

Checked that the Dell ME SAN firmware and drivers are up to date.

Windows for business | Windows Server | Storage high availability | Clustering and high availability
0 comments No comments

4 answers

Sort by: Most helpful
  1. Marcin Rabiniak 0 Reputation points
    2026-04-22T22:06:14.8+00:00

    Got the same error, did you find a fix eventually ? During my diagnostic i am finding also errors like bellow:

    30952 Warning Microsoft-Windows-SMBClient Microsoft-Windows-SMBClient/Operational

    The SMB redirector did not select the connection initiated with the following parameters:

    Server name: fe80::a325:be9e:25f0:8a05%13

    IP Address: x.x.x.x:445

    Transport: TCPIP

    Instance Name:\Device\SmbCsv

    Port Origin: The port was selected from the global registry settings.

    The failure status associated with this decision: The network path cannot be located.

    Was this answer helpful?


  2. Me 0 Reputation points
    2026-04-20T15:48:29.5+00:00

    I have a simular issue with a new setup. 2 node windows server 2025 and Dell ME5 shared storage. When i add a quorum witness disk the cluster goes down and can't fix it. I need to recreate the cluster. Did you find something already?

    Was this answer helpful?

    0 comments No comments

  3. Tan Vu 2,655 Reputation points Independent Advisor
    2026-04-13T16:16:15.6866667+00:00

    Hi Kandi,

    The extra message makes the picture clearer: this is not just an AG failover problem. It is a real quorum-loss event during a planned shutdown. Microsoft describes quorum as the majority of votes the cluster needs to stay online, and says the cluster stops if it drops below that majority. In a 2-node cluster with a witness, the witness is part of that vote, and either node should be able to survive a single node failure.

    Because hard failure, network loss, and manual move all work, but a graceful shutdown does not, the likely weak point is the ordered shutdown path: the node is losing quorum while Cluster Service is trying to hand off votes/resources. A disk witness is a shared disk used in quorum voting, so if that witness or its storage path becomes unavailable during shutdown, quorum can drop and the cluster service will stop. That fits your “witness disk failover / 1168 / lost quorum” pattern much better than a pure SQL AG issue.

    The next I want you to try:

    Move-ClusterGroup "Cluster Group" -Node <name-node>
    

    Before you reboot or shut down the active node, because Microsoft says moving a clustered role is an appropriate step for routine maintenance.

    And you run this command to find the "name-node":

    Get-ClusterNode | Format-Table Name, State
    
    

    For the AG itself, a planned manual failover is only supported when both replicas are synchronous and synchronized.

    If that still fails, may be the disk witness specifically and temporarily test with a file share witness or cloud witness, both of which Microsoft supports as quorum options.

    Next, I want you to capture Get-ClusterQuorum and the cluster log during the shutdown attempt, if the witness vote disappears or the disk witness goes offline first, that will confirm the root cause.

    Was this answer helpful?


  4. Tan Vu 2,655 Reputation points Independent Advisor
    2026-04-12T09:38:57.4133333+00:00

    Hi Kandi,

    This looks more like a storage/offline path problem than a quorum problem. In WSFC, a planned failover starts by the cluster sending the primary replica offline, and the cluster service waits for resources to shut down gracefully before it escalates to termination; PendingTimeout controls how long it waits. Error 1168 is ERROR_NOT_FOUND ("Element not found"). https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/failover-and-failover-modes-always-on-availability-groups?view=sql-server-ver17

    Because hard failure, network loss, and manual move all work, but graceful shutdown or restart does not, my read is that the weak spot is the orderly shutdown of the disk witness or shared-disk path: stale persistent reservations, a third-party DSM/MPIO interaction, or storage latency that only appears when Windows is shutting down the storage stack. Microsoft’s MPIO guidance specifically calls out third-party DSMs, path loss, and outdated /drivers, and the disk troubleshooting guide recommends checking the system logs, setting Set-ClusterLog - Level 5, and investigating persistent reservation issues. https://learn.microsoft.com/en-us/troubleshoot/windows-server/backup-and-storage/windows-server-mpio-troubleshooting?

    For the next test, I would do the shutdown: Suspend-ClusterNode <active-node> -Drain -Force. Microsoft documents that -Drain moves workloads gracefully, and -ForceDrain stops anything that cannot be moved safely. For AGs, a planned failover is only supported when both replicas are synchronous and synchronized, so draining first or failing over the AG explicitly before rebooting is the safer maintenance path.

    If the logs keep pointing at the disk witness, a useful isolation step is to temporarily switch quorum to a file share witness and see whether graceful shutdown starts working; Microsoft supports file share witness on Windows Server 2025, and notes that disk witness and file share witness are standard quorum options for a 2-node cluster. If the PR issue remains, the storage vendor should inspect the LUN and persistent reservations, and Microsoft notes Clear-ClusterDiskReservation as a possible cleanup step. https://learn.microsoft.com/en-us/windows-server/failover-clustering/file-share-witness?tabs=domain-joined-witness

    If you paste the relevant cluster log lines around the 1168 event, I can help pinpoint whether it is the AG role, the disk witness, or the iSCSI/DSM layer that is stalling.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.