Case of the unfailback-able DAG
Good afternoon all, hoping you are having a good weekend! I wish I could say I had a bunch of sleep last night but I can’t. One of my customers reached out to me in frustration at 1 am. They were in the middle of doing some initial Disaster Recovery testing for their new 2010 Exchange environment. For some background info this is a very large company with a mix of 2003, 2007 & 2010. They just added the 24 Exchange 2010 servers into the mix. They have a small pilot set of users on the 2010 side (around 250 users of their 100k +) The servers are split between two primary datacenters. Half of the 2010 servers in each. There are two Database Availability Groups. Each with 12 servers a piece, 6 per datacenter.
Now in this test they were using Datacenter Activation Coordination Mode (DAC) for both DAGs. In their test window they were able to successfully fail over from the primary datacenter both DAGs. They were able to test all parts from the secondary site (Mail flow , OWA access, ActiveSync, etc..) Now when their phase of the test needed them to restore messaging services back to the primary datacenter is where they started to experience issues. One of the two DAGs worked fine when they rejoined the nodes from the primary site. Now the other DAG had some issues when failing back. They were properly trying to execute the Start-DatabaseAvailabilityGroup cmdlet with the –ActiveDirectorySite parameter pointing to their primary AD site. This was failing stating that one or more of the nodes were already in the cluster! They confirmed this using the Get-DatabaseAvailabilityGroup cmdlet and looking for the “StartedMailboxServers” and StoppedMailboxServers” attribute of the DAG. Now remember these attributes are NOT indicative of the actual state of the server. The servers could be completely turned OFF but still appear in the “StartedMailboxServers” list. These attributes are what is used to figure out quorum as well as making database mounting decisions when DAC mode is enabled. So one of the servers that should have been evicted and appearing in the “StoppedMailboxServers” list was still in the started servers list.
This got my client wondering why. They also noted that the Cluster Services were disabled and stopped on the evicted nodes. Now they errantly set the service to automatic and tried to start the service. This should not be done since if the Start-DatabaseAvailabilityGroup cmdlet works successfully it will do this for you. Let the PowerShell commandlets do their job. When they set the service to automatic again, The evicted nodes were randomly then showing in the “StartedMailboxServers” list, even though the service wasn’t even running. This was merely adding to the confusion. So we set the affected 6 modes back to disabled. This again showed the correct started and stopped servers list once more. Now to figure out why it was failing we looked at the Failover Cluster Manager administrative console on each node to verify if the nodes that should still be listed in the cluster, in fact still were. We found on the one node that it still was listing itself as a down member of the cluster. All other nodes did in fact show just the proper nodes from the secondary datacenter. Now Exchange needs to match what the cluster thinks for it’s state and who’s in and who’s out. Since this wasn’t coalescing the Start-DatabaseAvailabilityGroup cmdlet was failing. Now being a member of the premier field engineering group I have access to internal knowledge bases and cases. What I did next likely shouldn’t be done without guidance from Microsoft Premier Support Services or PFE’s. Normally when managing DAG’s and their membership the EMS and EMC should always be used. Making changes in the FCM console is not recommended for most cases. Since the one server and only the one server was incorrectly reporting cluster membership we used the FCM to manually evict the node to align what exchange thought of membership to match the FCM. Once we did this and re-ran the Start-DatabaseAvailabilityGroup cmdlet it re-added the previously evicted nodes (including the troublesome one) back into the DAG. Not only did the cmdlet complete successfully, the FCM console now showed all 12 servers as members and being UP in status.
Finally to ensure that the DAG was fully functional they queried for all database copies and reported on each’s replication status and database stated. All showed mounted or healthy! At this point they were to run the Move-ActiveMailboxDatabase cmdlet to shift the active Database copies back to their primary datacenter! They also could have used the RedistributeActiveDatabases.PS1 script included with Exchange 2010 outlined at the end of this TechNet article on Managing Mailbox Database copies