Why would wsfc try to restart a failed sql server role on same node?

Hamilton, Chuck 1 Reputation point
2022-09-14T14:27:43.587+00:00

We have an 2 node active/passive cluster hosting a single SQL Server role. The Windows version is 2012 R2. The SQL version is
2016 Sp2 CU4.

SQL crashed hard on cluster node A without recording anything related to the crash in the error log. There was an empty minidump file at the time it crashed and a txt file that reported a "bugcheck" and session with a non-yielding scheduler but didnt contain much else. No stack dump, nothing.

The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on. Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role? I've worked with a lot of clusters and normally it would fail over to the passive node immediately.

There's nothing in any of the logs - sql, cluster, windows event, etc that indicate any reason why it was trying to restart on the same node or why the 2nd node took so long to take over the role.

Windows for business | Windows Server | Storage high availability | Clustering and high availability
SQL Server | Other
0 comments No comments
{count} votes

7 answers

Sort by: Most helpful
  1. Sean Gallardy - MSFT 1,901 Reputation points Microsoft Employee
    2022-09-14T17:58:48.183+00:00

    The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on.

    Unless you've gone in and changed the cluster settings, this is the default behavior.

    Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role?

    You'll have to look at the cluster log to determine exactly why, but it's most likely the timing based on your cluster configuration along with the set (or default, most likely) values for the resource and role.

    0 comments No comments

  2. Hamilton, Chuck 1 Reputation point
    2022-09-14T19:02:33.977+00:00

    When did that become the default and where do I change that behavior? If it fails on node A, I want it to fail over immediately to node B and never try to restart on A. This caused a 15 minute outage and is not the behavior I am used to. There's a setting in the properties of the role on the failover tab that might be what you're talking about but the description is not clear if it controls restarts, failovers, or both.

    241110-image.png

    0 comments No comments

  3. Hamilton, Chuck 1 Reputation point
    2022-09-14T19:21:01.347+00:00

    I think I found what you are talking about on the properties of the SQL instance, Policies tab.
    241201-image.png

    According to cluster logs it tried to restart the instance twice (not once) on the original node, and then did not fail over - ever. It stayed down for 15 minutes and then restarted on the original node. Thats not my understanding of how its supposed to work. If I'm reading it correctly it should try to restart once and then fail over all resources in the role.

    0 comments No comments

  4. CathyJi-MSFT 22,401 Reputation points Microsoft External Staff
    2022-09-15T07:17:24.443+00:00

    Hi @Hamilton, Chuck ,

    >If I'm reading it correctly it should try to restart once and then fail over all resources in the role.

    Period for restarts (mm:ss) : Specify the length of the period (minutes and seconds) during which the Cluster service counts the number of times that a resource has been restarted. You set it to 15 min.

    241353-capture1.png

    > If it fails on node A, I want it to fail over immediately to node B and never try to restart on A. This caused a 15 minute outage and is not the behavior I am used to.

    If so, you can choose if the resource fails, do not restart option. Or you can reduce the period for restarts.


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    0 comments No comments

  5. Hamilton, Chuck 1 Reputation point
    2022-09-15T13:45:32.683+00:00

    So then based on the image above, if SQL crashes, it will try to restart 1 time on the original node. If it fails to restart it should "fail over all resources in this role"? And it should not wait for 15 minutes before failing it over correct?

    If so it still leaves me wondering why it delayed for so long. We were down for nearly 15 minutes before it failed over. SQL crashed at 19:00:10. At 19:03:50 the cluster "failed to bring the role completely online". It then did nothing until 19:14:24 when it was started on the 2nd node. Why such a long delay?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.