Application lost connection to DB on 2017 Always On cluster
We have 3-node SQL 2017 Always On cluster with 2 x HA and 1 x DR. We experienced an application outage reported around 1:00 AM possibly because the listener was not available. When we troubleshoot at around 7:00AM, the symptom is that we were not able to connect to the affected SQL instance using SSMS on primary node #1. However, we were able to connect to the instance on secondary node #2.
On node #2, we can see that one of the application databases was in synchronizing state, whereas the other databases used by this app were synchronized. I don't know if it matters, but each of the app databases is dedicated to a different client location.
We resolve the application outage by restarting SQL service on the secondary HA node #2. Then the app could connect and we could open the instance in SSMS on the primary.
In the AlwaysON_health extended event on node #1, we see some event records at 11:00 PM and the detail says:
Event: hadr_db_partner_set_sync_state
commit_policy: WaitForHarden
commit_policy_target: DoNothing
Sync_state: NOT
Then 20 minutes later:'
Event: hadr_db_partner_set_sync_state
commit_policy: WaitForHarden
commit_policy_target: WaitForHarden
Sync_state: LOG
Can anyone enlighten me on what's going on in this scenario and the root cause? Why would we not be able to open the instance in SSMS on node one when only one of the app DBs was not synchronizing? How to monitor and alert if this happens in the future?