Why would wsfc try to restart a failed sql server role on same node?

Hamilton, Chuck 1

We have an 2 node active/passive cluster hosting a single SQL Server role. The Windows version is 2012 R2. The SQL version is
2016 Sp2 CU4.

SQL crashed hard on cluster node A without recording anything related to the crash in the error log. There was an empty minidump file at the time it crashed and a txt file that reported a "bugcheck" and session with a non-yielding scheduler but didnt contain much else. No stack dump, nothing.

The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on. Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role? I've worked with a lot of clusters and normally it would fail over to the passive node immediately.

There's nothing in any of the logs - sql, cluster, windows event, etc that indicate any reason why it was trying to restart on the same node or why the 2nd node took so long to take over the role.

7 answers

CathyJi-MSFT 22,401 Reputation points Microsoft External Staff

2022-09-16T06:54:12.463+00:00

Hi @Hamilton, Chuck ,

If you set the value of Period for restarts (mm:ss) to 5 min. How long time before failed over to another node?

Did you find any useful information from your cluster log?
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Hamilton, Chuck 1 Reputation point

2022-09-16T12:01:30.51+00:00

There's info in the cluster log but it doesnt answer the question of why it took so long for the role to fail over. I can see it attempting to resart at least once on the original node but failing to come completely online. Theres nothing in the log about it actually failing over but I can see from the SQL error log that it started up on the passive node about 14 minutes after the initial failure.

The only other thing in the cluster log is the Availability Group's role failing and not coming back online also. I dont think this should take down the production SQL instance though. This AG is not set for automatic failover. It is set to async commit and is used for DR replication to another data center. I'm pretty sure I've seen those AG roles go offline before and all it did was break the replication until it came back online.

If I set the period to 5 minutes I'll have to wait for another failure before I can tell if it had any effect. This is a critical production system and I cant just cause a failure to test it.
Please sign in to rate this answer.
CathyJi-MSFT 22,401 Reputation points Microsoft External Staff

2022-09-19T06:25:21.987+00:00

Hi @Hamilton, Chuck ,

Please do a test in test environment before changing the setting in product environment.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Why would wsfc try to restart a failed sql server role on same node?

7 answers

Your answer