Investigate the underlying performance issues: As you mentioned, there are underlying issues on the primary that need to be addressed. Identify the cause of the high CPU spikes and take appropriate measures to optimize the performance of your primary server. This could involve analyzing query performance, indexing, resource contention, or any other factors contributing to the CPU spikes.
Investigate the underlying performance issues: As you mentioned, there are underlying issues on the primary that need to be addressed. Identify the cause of the high CPU spikes and take appropriate measures to optimize the performance of your primary server. This could involve analyzing query performance, indexing, resource contention, or any other factors contributing to the CPU spikes.
Adjust the FAILURE_CONDITION_LEVEL setting: You have already set FAILURE_CONDITION_LEVEL to 1, which means that any issue affecting the availability of the primary replica triggers a failover. Consider reviewing this setting and evaluate if setting it to a higher value (e.g., 2) would be more appropriate for your scenario. A higher value allows for more tolerance towards transient issues, reducing unnecessary failover attempts.
Review AG health checks and timeouts: Validate that the HEALTH_CHECK_TIMEOUT value of 90 seconds is suitable for your environment. Depending on the size and complexity of your AG, you may need to adjust this value. Consider tuning the timeout to a level that allows enough time for health checks to complete without unnecessarily triggering failovers.
Monitor network connectivity and heartbeat settings: Ensure that network connectivity between your AG replicas is stable and reliable. Frequent network interruptions or issues can trigger failover attempts. Also, review the heartbeat settings within the AG configuration. Adjust the missed heartbeat thresholds if needed, considering the network stability and latency between the replicas.
Analyze cluster logs and SQL Server error logs: Continuously monitor cluster logs and SQL Server error logs to gather more detailed information about the failover events. Analyzing these logs can provide insights into the root causes of the failovers and help identify any patterns or recurring issues that need to be addressed.