Hi @Hamilton, Chuck ,
If you set the value of Period for restarts (mm:ss) to 5 min. How long time before failed over to another node?
Did you find any useful information from your cluster log?
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
We have an 2 node active/passive cluster hosting a single SQL Server role. The Windows version is 2012 R2. The SQL version is
2016 Sp2 CU4.
SQL crashed hard on cluster node A without recording anything related to the crash in the error log. There was an empty minidump file at the time it crashed and a txt file that reported a "bugcheck" and session with a non-yielding scheduler but didnt contain much else. No stack dump, nothing.
The cluster service however tried to restart the SQL role a couple of times on the same original node that it crashed on. Then after about 15 minutes it finally failed over to node B. Why would it take so long for node B to take over the role? I've worked with a lot of clusters and normally it would fail over to the passive node immediately.
There's nothing in any of the logs - sql, cluster, windows event, etc that indicate any reason why it was trying to restart on the same node or why the 2nd node took so long to take over the role.
Hi @Hamilton, Chuck ,
If you set the value of Period for restarts (mm:ss) to 5 min. How long time before failed over to another node?
Did you find any useful information from your cluster log?
There's info in the cluster log but it doesnt answer the question of why it took so long for the role to fail over. I can see it attempting to resart at least once on the original node but failing to come completely online. Theres nothing in the log about it actually failing over but I can see from the SQL error log that it started up on the passive node about 14 minutes after the initial failure.
The only other thing in the cluster log is the Availability Group's role failing and not coming back online also. I dont think this should take down the production SQL instance though. This AG is not set for automatic failover. It is set to async commit and is used for DR replication to another data center. I'm pretty sure I've seen those AG roles go offline before and all it did was break the replication until it came back online.
If I set the period to 5 minutes I'll have to wait for another failure before I can tell if it had any effect. This is a critical production system and I cant just cause a failure to test it.