Why would wsfc try to restart a failed sql server role on same node?

Hamilton, Chuck 1 Reputation point
2022-09-14T14:27:43.587+00:00

We have an 2 node active/passive cluster hosting a single SQL Server role. The Windows version is 2012 R2. The SQL version is
2016 Sp2 CU4.

SQL crashed hard on cluster node A without recording anything related to the crash in the error log. There was an empty minidump file at the time it crashed and a txt file that reported a "bugcheck" and session with a non-yielding scheduler but didnt contain much else. No stack dump, nothing.

The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on. Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role? I've worked with a lot of clusters and normally it would fail over to the passive node immediately.

There's nothing in any of the logs - sql, cluster, windows event, etc that indicate any reason why it was trying to restart on the same node or why the 2nd node took so long to take over the role.

Windows for business | Windows Server | Storage high availability | Clustering and high availability
SQL Server | Other
0 comments No comments
{count} votes

7 answers

Sort by: Most helpful
  1. CathyJi-MSFT 22,401 Reputation points Microsoft External Staff
    2022-09-16T06:54:12.463+00:00

    Hi @Hamilton, Chuck ,

    If you set the value of Period for restarts (mm:ss) to 5 min. How long time before failed over to another node?

    Did you find any useful information from your cluster log?

    0 comments No comments

  2. Hamilton, Chuck 1 Reputation point
    2022-09-16T12:01:30.51+00:00

    There's info in the cluster log but it doesnt answer the question of why it took so long for the role to fail over. I can see it attempting to resart at least once on the original node but failing to come completely online. Theres nothing in the log about it actually failing over but I can see from the SQL error log that it started up on the passive node about 14 minutes after the initial failure.

    The only other thing in the cluster log is the Availability Group's role failing and not coming back online also. I dont think this should take down the production SQL instance though. This AG is not set for automatic failover. It is set to async commit and is used for DR replication to another data center. I'm pretty sure I've seen those AG roles go offline before and all it did was break the replication until it came back online.

    If I set the period to 5 minutes I'll have to wait for another failure before I can tell if it had any effect. This is a critical production system and I cant just cause a failure to test it.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.