Why would wsfc try to restart a failed sql server role on same node?

Question

Why would wsfc try to restart a failed sql server role on same node?

Hamilton, Chuck 1

We have an 2 node active/passive cluster hosting a single SQL Server role. The Windows version is 2012 R2. The SQL version is
2016 Sp2 CU4.

SQL crashed hard on cluster node A without recording anything related to the crash in the error log. There was an empty minidump file at the time it crashed and a txt file that reported a "bugcheck" and session with a non-yielding scheduler but didnt contain much else. No stack dump, nothing.

The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on. Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role? I've worked with a lot of clusters and normally it would fail over to the passive node immediately.

There's nothing in any of the logs - sql, cluster, windows event, etc that indicate any reason why it was trying to restart on the same node or why the 2nd node took so long to take over the role.

7 answers

Your answer

Answer 1

The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on.

Unless you've gone in and changed the cluster settings, this is the default behavior.

Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role?

You'll have to look at the cluster log to determine exactly why, but it's most likely the timing based on your cluster configuration along with the set (or default, most likely) values for the resource and role.

Answer 2

Hamilton, Chuck 1

When did that become the default and where do I change that behavior? If it fails on node A, I want it to fail over immediately to node B and never try to restart on A. This caused a 15 minute outage and is not the behavior I am used to. There's a setting in the properties of the role on the failover tab that might be what you're talking about but the description is not clear if it controls restarts, failovers, or both.

Answer 3

Hamilton, Chuck 1

I think I found what you are talking about on the properties of the SQL instance, Policies tab.

According to cluster logs it tried to restart the instance twice (not once) on the original node, and then did not fail over - ever. It stayed down for 15 minutes and then restarted on the original node. Thats not my understanding of how its supposed to work. If I'm reading it correctly it should try to restart once and then fail over all resources in the role.

Answer 4

Hi @Hamilton, Chuck ,

>If I'm reading it correctly it should try to restart once and then fail over all resources in the role.

Period for restarts (mm:ss) : Specify the length of the period (minutes and seconds) during which the Cluster service counts the number of times that a resource has been restarted. You set it to 15 min.

> If it fails on node A, I want it to fail over immediately to node B and never try to restart on A. This caused a 15 minute outage and is not the behavior I am used to.

If so, you can choose if the resource fails, do not restart option. Or you can reduce the period for restarts.

If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Answer 5

Hamilton, Chuck 1

So then based on the image above, if SQL crashes, it will try to restart 1 time on the original node. If it fails to restart it should "fail over all resources in this role"? And it should not wait for 15 minutes before failing it over correct?

If so it still leaves me wondering why it delayed for so long. We were down for nearly 15 minutes before it failed over. SQL crashed at 19:00:10. At 19:03:50 the cluster "failed to bring the role completely online". It then did nothing until 19:14:24 when it was started on the 2nd node. Why such a long delay?

Sean Gallardy - MSFT 1,901 Reputation points Microsoft Employee

2022-09-15T15:18:47.463+00:00

Why such a long delay?

You'll have to look at the cluster log to find out.

Share via

Why would wsfc try to restart a failed sql server role on same node?

7 answers

Your answer