How to Troubleshoot Cluster Continuous Replication Issues
Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.
Applies to: Exchange Server 2007, Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3
This topic discusses troubleshooting issues related to cluster continuous replication (CCR). For more information about tools that may assist in troubleshooting CCR issues, see Tools for Troubleshooting Issues with High Availability Deployments.
The procedures in this topic address the following issues in a CCR environment:
Get-StorageGroupCopyStatus reports that the database is “Failed” and is not seeded.
Get-StorageGroupCopyStatus reports that the database is “Failed”. The FailedMessage value indicates the storage group copy has diverged.
Get-StorageGroupCopyStatus reports that the database is “Failed”. The FailedMessage value provides specific information about the source of the failure.
Alerts, performance counters, or Get-StorageGroupCopyStatus indicate that copy or replay queues are backed up for a storage group copy.
Get-StorageGroupCopyStatus reports a stale time for LastInspectedLogTime.
Failover or Move-ClusteredMailboxServer succeeds, but databases do not mount.
Failover succeeds, but some databases do not automatically or manually mount. Alternatively, Get-ClusteredMailboxServerStatus reports one or more failed databases.
A database fails to mount at startup in a CCR environment.
MSExchangeRepl event 2073 is logged alerting that the Microsoft Exchange Replication Service is unable find a directory.
Move-ClusteredMailboxServer does not initiate a scheduled outage due to a replication issue.
Replication does not resynchronize after a failover on one or more storage groups.
Seeding is failing.
When failures occur, other than those listed here, look at the event log on both nodes to determine the cause and use the information in the logs to determine what recovery actions that must be taken. When you have identified the time that the failure occurred, other event logs may help you better understand the problem. If this information is insufficient, knowing the time when the issue occurred can be used to narrow your analysis and the size of the review window in the cluster.log. The cluster log provides trace level information for actions taken by the cluster management system.
Before You Begin
To perform this procedure, the account you use must be delegated the Exchange Server Administrator role and local Administrators group for the target server. For more information about permissions, delegating roles, and the rights that are required to administer Microsoft Exchange Server 2007, see Permission Considerations.
Procedure
Get-StorageGroupCopyStatus reports that the database is “Failed” and is not seeded
Possible Causes A configuration problem, or the replication copy does not have a valid baseline database. This issue could be caused by not seeding the storage group copy when the passive node was added.
Resolution
Verify that storage for the copy is properly configured and operational. If you find an error, you can trigger a new check of the copy by suspending and resuming the storage group.
Verify that the storage group and database paths are correctly configured relative to the storage on the passive server. You can do this by using the Get-StorageGroup cmdlet in the Exchange Management Console.
Use the Update-StorageGroupCopy cmdlet to seed the storage group copy.
Get-StorageGroupCopyStatus reports that the database is "Failed", and the FailedMessage value indicates the storage group copy has diverged
Possible Causes Occurs when there is a failover, and enough logs were lost that the database on the previous active server cannot be resynchronized with the current active database without a full reseed. This situation cannot occur in LCR.
Resolution Use the Update-StorageGroupCopy cmdlet to seed the storage group copy.
Get-StorageGroupCopyStatus reports that the database is “Failed” and the FailedMessage value provides specific information about the source of the failure
Possible Causes Many potential causes could result in a storage group copy being determined as failed. The previous cases, not being seeded and diverged, are two examples. The FailedMessage value specifically identifies the detected problem.
Resolution Run the Get-StorageGroupCopyStatus cmdlet to obtain the complete FailedMessage value, which identifies the specific problem that was detected. Analyze the information provided by the FailedMessage value, and resolve the reported condition. If the reported condition is a corrupted or missing log, try to find a non-corrupted log with the correct generation number. If the correct log cannot be found, use the Update-StorageGroupCopy cmdlet to reseed. If the message implies the logs on the source are not available, remove the share on the source’s log directory and restart the replication service on that node.
Alerts, performance counters, or Get-StorageGroupCopyStatus indicate that copy or replay queues are backed up for a storage group copy
Possible Causes A backlog of log copying or replay could indicate either a problem or a transitional situation in a recovery process. A transitional situation occurs when a previously offline passive node is brought online, or a storage group copy is recently resumed after it has been suspended for a significant period. Stopping the Microsoft Exchange Replication Service on the passive node has a similar effect to suspending all storage group copies on the node. If the situation isn't transitional, it could be caused by one of the following:
Configuration issue.
Suspended storage copy.
Replay service is stopped.
Storage has failed or is offline.
Passive node is offline.
Resolution Determine whether there is an actual problem or a transitional situation:
Determine if the Microsoft Exchange Replication Service is running on both nodes. You can do this by using the Services snap-in. If the service is stopped on either node, then you must start it.
Run the Exchange Management Shell cmdlet Get-StorageGroupCopyStatus with the fl (formatted list) option, and determine if the passive copy is suspended. If it is suspended, verify that the files of the passive copy are correctly present, and then resume the storage group copy by using the Resume-StorageGroupCopy cmdlet.
Run the Get-StorageGroupCopyStatus cmdlet with the fl option, and determine if the copy is “Healthy”. If the copy is “Failed”, review the list of status fields to determine the corrective action that is necessary.
Watch the replication performance counters over a several minute period to determine if progress is being made. Specifically, look at the replay generation number and the inspection generation number. If the copy queue length keeps increasing, but the replay queue length is short or decreasing, there may be an issue with the network file share on the active server or the active server itself. Verify that the active storage group copy’s log directory has a network file share defined on it by using the "net share" command, by using Windows Explorer, or by using the Computer Management snap-in. You can determine the GUID of the storage group by using the Get-StorageGroup cmdlet with the fl option in the Exchange Management Shell.
Get-StorageGroupCopyStatus reports an old time for LastInspectedLogTime
Possible Causes There are three possible causes of this symptom:
The active storage group copy’s database is dismounted.
The active storage group copy is mounted, but it is not changing at a significant rate. Therefore, logs are not being produced by the active storage group copy.
The Microsoft Exchange Replication Service is not running on the passive node.
Resolution Determine which of the three causes is occurring by doing the following:
Determine if the database is dismounted by using the Exchange Management Console or by running the Get-StorageGroupStatus cmdlet in the Exchange Management Shell. If it is dismounted, you must mount the database and changes to the database (for example, activity within the database) must be made before the LastInspectedLogTime will change.
Verify that the Microsoft Exchange Replication Service is running on the passive node. If the service is stopped, you must start it.
After verifying that the database is mounted, check to see whether the database is generating logs. Look in the active database’s log directory and identify the log file with the highest generation number. Check the timestamp on that log; it should match the value in LastInspectedLogTime.
Failover or Move-ClusteredMailboxServer succeeds, but databases do not mount
Possible Causes The typical cause of this problem is that the Cluster service account does not have the authority required to mount the database. Alternatively, failover resulted in more lost logs than permitted by the automatic mounting configuration settings. The other typical cause in a failover case is that the passive copies were not healthy at the time of the failure.
Resolution Permission issues with the Cluster service account typically occur during setup. If the databases do not mount at the end of setup, it usually indicates that the Cluster service account has not been granted the appropriate permissions. To resolve this, grant the appropriate permissions to the Cluster service account and then perform an orderly shutdown and restart of the entire cluster. You can do this by (1) taking the clustered mailbox server offline; (2) shutting down the passive node; (3) shutting down the active node; (4) starting the active node; (5) starting the passive node; and (6) bringing the clustered mailbox server online.
- Review the event log to determine whether the failover lost more logs than permitted by the automatic mounting configuration settings. After you have determined the status of the storage group copy’s database, you can explicitly mount it by running the Restore-StorageGroupCopy cmdlet in the Exchange Management Shell. Finally, run the Get-StorageGroupCopy cmdlet and look at the SummaryCopyStatus value to identify whether there are issues with the previously active copy that prevent it from mounting. If there are any issues, review the event log to identify the cause of the issue and then take steps to resolve the issue.
Failover succeeds, but some databases do not automatically or manually mount. Alternatively, Get-ClusteredMailboxServerStatus reports one or more failed databases
Possible Causes A recent failover resulted in more lost logs than permitted by the automatic mounting configuration settings. The other typical cause in a failover case is that the passive copy was not healthy at the time of the failure.
Note
Databases may be briefly marked failed or offline during a scheduled or unscheduled outage. This state is transitional, and it occurs while the replication service is trying to make a final copy of any available logs.
Resolution Review the event log to determine why the database failed to mount. The database may fail to mount due to corruption in the logs or database files. If the events indicate this, restore access to the database by moving the active server to the other node. You can determine if the database is failed by reviewing the event log. After you have determined the status of the storage group copy’s database, you can explicitly mount it by running the Restore-StorageGroupCopy cmdlet in the Exchange Management Shell. Next, run the Get-StorageGroupCopyStatus cmdlet and look at the SummaryCopyStatus value to identify whether there are issues with the previously active copy that prevent it from mounting. If the status shows that the storage group copy is too old to activate, the database can be restored when the failed node returns to service and more logs are available. The logs are automatically copied, and no action is required from you.
A database fails to mount at startup in a CCR environment
Possible Causes The database’s failure to mount could be the result of an explicit administrator action. If a database is explicitly dismounted, and then the clustered mailbox server is taken offline, the database will not be brought online at the next startup. Another possible cause could be that more than the acceptable number of logs was lost during a failover.
Resolution You can run the Get-ClusteredMailboxServerStatus cmdlet in the Exchange Management Shell to verify that the store is operational on the node. Use the Exchange Management Console or the Exchange Management Shell to attempt a mount operation of the affected database copy. For more information about mounting the database copy, see How to Mount a Database in a CCR Environment. Review the event log after the mount operation to determine if any errors were reported.
Cluster event MSExchangeRepl 2073 is logged alerting that the Microsoft Exchange Replication Service is unable find a specified directory
Possible Causes The Error event indicates that the Microsoft Exchange Replication Service could not create the directory that is specified by the event. The Microsoft Exchange Replication Service tries to create several required directories if they do not already exist. These include directory paths for source log files, destination log files, destination system files, and the path for the log file inspector.
The Microsoft Exchange Replication Service may be unable to create the specified directory because of a permission issue, a hardware failure, or a configuration failure.
Resolution Examine the error code returned by the event. Verify that the directory location is available, and that it can be accessed. Check the file system permissions. Make sure that the storage is configured correctly, and that the hardware is operating correctly.
Move-ClusteredMailboxServer does not initiate a scheduled outage due to a replication issue
Possible Causes The Exchange Management Shell Move-ClusteredMailboxServer cmdlet includes validation checks to prevent a scheduled outage to a passive node if replication is not completely healthy on all storage group copies. This behavior makes sure that scheduled outages are not extended for an inappropriate length of time.
Resolution Identify the specific storage groups with the problem, and correct any issue. The error message from Move-ClusteredMailboxServer cmdlet identifies the problematic storage group copy. If you wish to do the move and ignore the validation check, make sure that only the failed storage group copy’s database is dismounted. Retry the move operation and use the -IgnoreDismounted parameter. The IgnoreDismounted parameter indicates that dismounted storage groups are to be ignored for the purposes of replication health checks.
Replication does not resynchronize after a failover on one or more storage groups
Possible Causes The failure message returned by the Get-StorageGroupCopyStatus cmdlet indicates that the database is diverged. This situation is due to a failover when the old active server did not have enough logs replicated prior to failover.
Resolution Reseed the database by using the Update-StorageGroupCopy cmdlet in the Exchange Management Shell.
Seeding is failing
Possible Causes A backup is in progress on the active server or a communication issue.
Resolution Verify that a backup of the affected storage group copy or database is not in progress. Make sure the active node is online.
For More Information
For more information about the Exchange Management Shell cmdlets mentioned in this topic, see the following topics:
For information about troubleshooting local continuous replication, see How to Troubleshoot Local Continuous Replication Issues.