How to minimize AG failovers

Question

How to minimize AG failovers

SQLRocker 126

We have SQL AG in async mode so automatic failovers don't happen, but sometimes there are issues on the primary (like high CPU) which cause a failover attempt (from PRIMARY_NORMAL to RESOLVING_NORMAL to PRIMARY_PENDING , before coming back to PRIMARY_NORMAL), it causes a blip of around a min or so.

Before we were having the above issues quite often so I made the following changes so that the failovers might be minimum:

Changed FAILURE_CONDITION_LEVEL to 1.
Changed HEALTH_CHECK_TIMEOUT to 90000 (90secs)
Changed 'LeaseTimeout' to 100000 (from cluadmin, i think this is the max number it could take)

Above worked fine for a couple of yrs, but today we had a another blip of around a min, a CPU spike was seen and AG went to RESOLVING_NORMAL - PRIMARY_PENDING state and took around min to come back online.

I was wondering if anyone else had had to do the above things to minimize the failovers, I know there are underlying issues on the primary which should be looked into, but for now I am looking to see if there are any other steps to be done to minimize failover attempts by AG.

Extra info: sql error log will have 'connection timeout' error & cluster log will have 'missed two-fifth consecutive heartbeats' msgs.

Thanks.

Vahid Ghafarpour 23,385 Reputation points Volunteer Moderator

2023-06-05T23:56:49.3566667+00:00

Minimizing failover attempts should not be a long-term solution. It is essential to address the underlying issues causing the failovers to ensure the stability and optimal performance of your SQL AG environment.

2 answers

Your answer

Vahid Ghafarpour 23,385 Reputation points Volunteer Moderator

2023-06-05T23:56:49.3566667+00:00

Minimizing failover attempts should not be a long-term solution. It is essential to address the underlying issues causing the failovers to ensure the stability and optimal performance of your SQL AG environment.

Answer 1

Investigate the underlying performance issues: As you mentioned, there are underlying issues on the primary that need to be addressed. Identify the cause of the high CPU spikes and take appropriate measures to optimize the performance of your primary server. This could involve analyzing query performance, indexing, resource contention, or any other factors contributing to the CPU spikes.

Adjust the FAILURE_CONDITION_LEVEL setting: You have already set FAILURE_CONDITION_LEVEL to 1, which means that any issue affecting the availability of the primary replica triggers a failover. Consider reviewing this setting and evaluate if setting it to a higher value (e.g., 2) would be more appropriate for your scenario. A higher value allows for more tolerance towards transient issues, reducing unnecessary failover attempts.

Review AG health checks and timeouts: Validate that the HEALTH_CHECK_TIMEOUT value of 90 seconds is suitable for your environment. Depending on the size and complexity of your AG, you may need to adjust this value. Consider tuning the timeout to a level that allows enough time for health checks to complete without unnecessarily triggering failovers.

Monitor network connectivity and heartbeat settings: Ensure that network connectivity between your AG replicas is stable and reliable. Frequent network interruptions or issues can trigger failover attempts. Also, review the heartbeat settings within the AG configuration. Adjust the missed heartbeat thresholds if needed, considering the network stability and latency between the replicas.

Analyze cluster logs and SQL Server error logs: Continuously monitor cluster logs and SQL Server error logs to gather more detailed information about the failover events. Analyzing these logs can provide insights into the root causes of the failovers and help identify any patterns or recurring issues that need to be addressed.

Answer 2

FAILURE_CONDITION_LEVEL : 1 = least failovers, 5 = most , so you are incorrect with you saying making it 2.
"Adjust the missed heartbeat thresholds if needed" -> what exactly do you mean, what are the exact steps to increase this threshold?

Normally I won't look at changing these default values at all. They just work.

But we faced some frequent issues before, lot of t/s was done and I had to make the changes, which seem to be working alright. but yeah , definitely not the first things to do (to anyone who is reading), troubleshoot the underlying issues and these should be almost the last things to look into if nothing else works.

Pretty much HEALTH_CHECK_TIMEOUT is the only one out of the 3 which I think that can be increased in my case (I already have it on 90 from default 30sec), rest 2 are already at their max wrt to minimizing failovers (FAILURE_CONDITION_LEVEL & LeaseTimeout).

My goal with this post is to learn if there are any other parameters that can be changed for minimizing failovers except for the 3 that I mention, if anyone knows of any parameters , please let me know, thanks.

Share via

How to minimize AG failovers

2 answers

Your answer