I ran into issue related to our newly setup aoag. After a week of setup of aoag, we ran into issue on 04/24 2:16AM where aoag went into hung state for few mins.
Our important database is "Main_DB". Within few mins 2-3 mins all the dbs came back online(synchronized state) except one "Main_DB".
We ended up removing this database from the aoag to bring it back online and then added it back.
Sequence of the configuration/events is as follows.
· Original SQL FCI cluster 2 node without aoag setup (active/passive)-Running fine for 4 years.
· On 16 Apr We added 2 more nodes with node1 primary(old node), node3 (new 3rd node), node 4(new 4th node) and setup aoag in these 3 nodes and database sync up. We removed old node2(passive node). AOAG setting of leasetimeout, healthchecktimeout, failoverconditionlevel were set to default.
· on 22 Apr we removed node1 from aoag and setup only 2 node aoag cluster with node3 being primary.
· so we now have node3 primary replica and node4 secondary replica. Old SQL Instance ROLE in cluster on node1 was in offline state with SQL Resource in cluster in offline state as well.
· AOAG ROLE was online and working fine until 04/24.
· DBs were all online and working fine until 04/24
· on 24 apr around 2:26 am aoag got hung, all the databases came online except main_db (this is our important highly transactional db), we ended up removing this db (around 2:46 am) from the aoag to bring it back online.
· On 04/27 we added back main_db to the aoag by changing leastimeout-90000, healthchecktimeout-120000, failoverconditionlevel-1 and removed old SQLServer role that was offline
It seems to be working fine now however i was not able to debug what might have happened on 04/24 2:16 am. I'm attaching sqldump and sqlerrorlog for those days by removing unwanted data. I'm bit anxious that scenario happened on 4/24 may happen again.
Any help is appreciated!