Managing Standby Continuous Replication
Microsoft Exchange Server 2007 will reach end of support on April 11, 2017. To stay supported, you will need to upgrade. For more information, see Resources to help you upgrade your Office 2007 servers and clients.
Applies to: Exchange Server 2007 SP1, Exchange Server 2007 SP2, Exchange Server 2007 SP3
In addition to the tasks for day-to-day management and administration of a Microsoft Exchange organization, there are tasks that are specific to standby continuous replication (SCR). Generally, the administrative tasks for SCR are:
Configuring disk storage for SCR and managing disk volumes.
Enabling and disabling SCR.
Monitoring replication activity.
Mounting, dismounting, creating, and removing databases.
Moving the location for storage of storage group or database files when a storage group is SCR-enabled.
Verifying the health of the SCR target.
Managing replication and replay activity.
Recovering from corruption.
These tasks are discussed in the remainder of this topic.
SCR is enabled and managed only by using the Exchange Management Shell. The Exchange Management Console cannot be used to enable or disable SCR, view SCR status, or manage any aspects of SCR.
Configuring Disk Storage for Standby Continuous Replication
SCR does not require specially configured disk storage. SCR requires storage that provides adequate capacity. Equivalent storage solutions should be configured for all SCR targets configured for the same storage group. We recommend that you follow the configuration procedures provided by your storage vendor to complete the configuration.
Managing Disk Volumes in an SCR environment
While managing an SCR environment, it may be necessary to manage disk volumes that are connected to your Exchange server. For example, the volume may need to be temporarily detached from the system for maintenance or other reasons. If maintenance needs to be performed on the disk volume containing the active copy of the storage group, the database in the active copy of the storage group should be dismounted. If maintenance needs to be performed on the disk volumes containing the passive copy of the storage group, all input/output (I/O) to the volume should be stopped by halting replication. For more information about managing disk volumes, see How to Prepare for Disk Management Activities when Using SCR.
Enabling Standby Continuous Replication
SCR is enabled only by using the Exchange Management Shell and running either the New-StorageGroup cmdlet or the Enable-StorageGroupCopy cmdlet. Both cmdlets include some new parameters that are introduced with Microsoft Exchange Server 2007 Service Pack 1 (SP1):
-StandbyMachine This parameter is used to specify the name of the computer that will contain the SCR target. The value of this parameter is set as part of the value for the msExchStandbyCopyMachines attribute of the storage group being enabled for SCR. The msExchStandbyCopyMachines attribute is a multivalued Unicode string that is added to the Active Directory directory service schema when Exchange 2007 SP1 is introduced into the Exchange organization.
-ReplayLagTime This parameter is used to specify the amount of time that the Microsoft Exchange Replication service should wait before replaying log files that have been copied to the SCR target computer. The format for this parameter is (Days.Hours:Minutes:Seconds). The default setting for this value is 24 hours. The maximum allowable setting for this value is 7 days. The minimum allowable setting is 0 seconds, although setting this value to 0 seconds does not affect the default delay in log replay activity of 50 log files. After being set, the value for this parameter cannot be changed without disabling and then enabling SCR.
-TruncationLagTime This parameter is used to specify the amount of time that the Microsoft Exchange Replication service should wait before truncating log files that have been copied to the SCR target computer and replayed into the copy of the database. The time period begins after the log has been successfully replayed into the copy of the database. The format for this parameter is (Days.Hours:Minutes:Seconds). The maximum allowable setting for this value is 7 days. The minimum allowable setting is 0 seconds, although setting this value to 0 seconds effectively eliminates any delay in log truncation activity. After being set, the value for this parameter cannot be changed without disabling and then enabling SCR.
-SeedingPostponed This parameter can be used to skip the initial seeding of the SCR target. If this parameter is used, the administrator must manually seed the SCR target using the Update-StorageGroupCopy cmdlet. This parameter is available only with the Enable-StorageGroupCopy cmdlet. It is not available with the New-StorageGroup cmdlet because no source database exists at this point.
Important
To change the replay or truncation delay settings, you must first disable SCR and then enable SCR using the new values for these settings.
In addition to the administrator-configured delay of replay that is specified using the ReplayLagTime parameter, Exchange also prevents a fixed number of log files from being replayed on an SCR target, regardless of the value for ReplayLagTime, using the following formula:
Maximum of ("value of ReplayLagTime" or "X log files")
where X=50. This is an additional safeguard against the need to reseed a storage group in situations when an SCR source that is in a continuous replication environment, for example, local continuous replication (LCR) or cluster continuous replication (CCR), experiences a lossy failover and is brought online using the Restore-StorageGroupCopy cmdlet. By delaying replay activity on the SCR targets, when a lossy failover for an SCR source occurs, the chances of needing to reseed the SCR copies are minimized because the nature of the data loss on the SCR source puts the two copies closer together in time.
Important
The built-in lag time of 50 log files, and the value of the ReplayLagTime parameter has implications for the creation of the initial SCR target database. An SCR target database will not be created until 50 transaction log files have been replicated to the SCR target computer, and until the time period specified by ReplayLagTime (or the default ReplayLagTime of 24 hours) has elapsed.
When you enable SCR for a storage group, a copy of the storage group (system files, log files, and database file) is automatically created and maintained on the SCR target computer, using the same paths as the storage group on the SCR source.
After SCR has been enabled, we recommend monitoring the health and status for each storage group using the Test-ReplicationHealth cmdlet. For detailed steps about how to enable SCR, see How to Enable Standby Continuous Replication for an Existing Storage Group and How to Enable Standby Continuous Replication for a New Storage Group.
SCR and Log Truncation
Because you cannot make backups of an SCR target database, SCR log truncation is not based on backup times. Instead, log truncation is determined by the checkpoint at the SCR source and the value for TruncationLagTime.
If the SCR source is a clustered mailbox server (CMS) in a CCR environment, the log truncation logic includes successfully copying and inspecting the log files by all SCR targets. This means that if an SCR target is not available, log truncation does not occur on the SCR source even if backups are taken.
In an SCR environment, an SCR target that is disabled and then enabled again may not need to be reseeded if all of the required log files are available, based on the following:
If circular logging is enabled for the storage group, log deletion will result in the enabled SCR target requiring a reseed due to gaps in the log sequence.
If a backup is taken that includes log file truncation, log deletion will result in the enabled SCR target requiring a reseed due to gaps in the log sequence.
If log files are not truncated via either of the preceding means, disabling and then enabling SCR should not require a reseed. In this case, log files at the SCR target will need to be deleted, but they will be replicated again from the SCR source.
If you plan on enabling an SCR target that was previously disabled, as a best practice, we recommend not performing any log truncating operations (for example, enabling circular logging or performing log truncating backups) until the SCR target is enabled and the configuration change that required the enabling has replicated throughout Active Directory.
Disabling Standby Continuous Replication
SCR is disabled only by using the Disable-StorageGroupCopy cmdlet and the StandbyMachine parameter. When disabling SCR, it is important that you include the appropriate value for the StandbyMachine parameter. If the SCR source storage group also has LCR enabled and you do not include the StandbyMachine parameter as part of this command, LCR will be disabled for the storage group.
Disabling SCR is necessary to change the value for either the ReplayLagTime or the TruncationLogDelay parameters. These values cannot be modified while SCR is enabled. Therefore, to change the replay or truncation delay settings, you must first disable SCR and then enable SCR again using the new values for these settings.
For detailed steps about how to disable SCR for a storage group, see How to Disable Standby Continuous Replication for a Storage Group.
Monitoring Replication Activity
Although SCR does not require any special monitoring, we recommend regularly monitoring each storage group to verify that it is replicating log files correctly. The Microsoft Exchange Server 2007 Management Pack for Microsoft Operations Manager 2005 includes alerts for several critical problems related to SCR environments:
Microsoft Exchange Replication service is not running. Note that the event that generates this alert does not repeatedly appear after the service is stopped, so any alert associated with it would be lost if it were cleared.
SCR target copy is in a Failed state.
SCR target copy is in a Healthy state, but it is behind in log copying.
You should investigate and resolve any of the preceding alerts generated by the Exchange 2007 Management Pack as quickly as possible.
Test-ReplicationHealth Cmdlet
Exchange 2007 SP1 introduces a new cmdlet called Test-ReplicationHealth. This cmdlet is designed for proactive monitoring of continuous replication (LCR, CCR, and SCR) and the continuous replication pipeline. The Test-ReplicationHealth cmdlet checks all aspects of replication, cluster services, storage group replication, and replay status to provide a complete overview of the replication system. Specifically, the Test-ReplicationHealth cmdlet performs the tests described in the following table.
Tests performed by the Test-ReplicationHealth cmdlet
Test | Description |
---|---|
Cluster network status |
Verifies that all cluster-managed networks found on the local node are running. This test applies only to CCR environments. |
Quorum group state |
Verifies that the cluster group containing the quorum resource is healthy. This test applies only to CCR environments. |
File share quorum state |
Verifies that the value of the FileSharePath used by the Majority Node Set quorum with file share witness is reachable. This test applies only to CCR environments. |
Clustered mailbox server group state |
Verifies that the CMS is healthy by confirming that all resources in the group are online. This test applies only to CCR environments. |
Node state |
Verifies that neither of the nodes in the cluster is in a paused state. This test applies only to CCR environments. |
DNS registration status |
Verifies that all cluster-managed network interfaces that have Require DNS registration to succeed set have passed Domain Name System (DNS) registration. This test applies only to CCR environments. |
Replication service status |
Verifies that the Microsoft Exchange Replication service on the local node is healthy. |
Storage group copy suspended |
Checks if continuous replication has been suspended for any storage groups. |
Storage group copy failed |
Checks if any storage group copies are in a Failed state. |
Storage group replication queue length |
Checks if any storage group has a replication copy queue length greater than best practice thresholds. Currently, these thresholds are:
|
Databases dismounted after failover |
Checks if any databases are dismounted or failed after a failover has occurred. This test only checks for databases that have failed as a result of a failover. |
Mounting and Dismounting Databases
It may occasionally be necessary to mount or dismount databases in an SCR environment. If the SCR source storage group or database needs reconfiguration or maintenance, you must block the services interacting with both while the activity is occurring. This could be required to perform a reconfiguration or to correct issues with the server or database. When the database is dismounted, it is inaccessible.
Moving the Location of Storage Group and Database Files
You can change the location of a database in an SCR-enabled storage group. In an SCR environment, there are two database files, one for each copy. When moving the storage group files or the database file, the locations for both copies must be changed in tandem.
Note
The complete path for the storage group files and database file must match on the SCR source and all SCR targets.
Similar procedures are used to reconfigure the location of a storage group log and system files and the location of the database files in an SCR environment. For detailed steps about how to change the location of log files and system files for an SCR-enabled storage group, see How to Move a Storage Group in a Standby Continuous Replication Environment. For detailed steps about how to change the location of database files in an SCR environment, see How to Move a Database in a Standby Continuous Replication Environment.
Important
Databases cannot be placed at the root of a volume.
Viewing Status Information
All monitoring and status is performed using the Exchange Management Shell. The Exchange Management Console does not display copy status or any other information about SCR. After SCR has been enabled for a storage group, you can use the Exchange Management Shell to view the SCR-specific configuration settings for the storage group and its database.
Status Information for Standby Continuous Replication
Exchange 2007 publishes a variety of status information for SCR copies. The following table describes the status information that is available for SCR-enabled storage groups. For detailed steps that explain how to obtain status information, see How to View the Status of Standby Continuous Replication.
Note
The following table lists the properties in the order that they appear when viewing the full output of the Get-StorageGroupCopyStatus cmdlet.
Status information available for SCR-enabled storage groups
Property | Description |
---|---|
Identity |
Server and name of the queried storage group. |
StorageGroupName |
Name of the queried storage group. |
SummaryCopyStatus |
Current overall status of the SCR copy. Possible values are:
|
Failed |
Verification of the database or logs identified an inconsistency that prevents replication. Alternatively, there is a configuration or access problem with the active or passive copy. Possible values are True and False. |
FailedMessage |
Textual message identifies the condition that caused replication to fail. It may not be the only replication problem area. |
Seeding |
Seeding in progress. Possible values are True and False. |
Suspend |
Replication (and replay) halted for the passive copy. This prevents the database from advancing and logs from being copied. Possible values are True and False. |
SuspendComment |
Optional administrator comment providing a reason or note as to why replication activity has halted. |
CopyQueueLength |
Number of transaction log files waiting to be copied to the passive copy log file folder. A copy is not considered completed until it has been checked for corruption. |
ReplayQueueLength |
Number of transaction log files that have been copied and are waiting to be replayed into the passive copy. |
LatestAvailableLogTime |
Time stamp on the source storage group of the most recently detected new transaction log file. |
LastCopyNotificationedLogTime |
Time associated with the last new log generated by the active storage group and known to the copy. |
LastCopiedLogTime |
Time stamp on the source storage group of the last successful copy of a transaction log file. |
LastInspectedLogTime |
Time stamp on the target storage group of the last successful inspection of a transaction log file. |
LastReplayedLogTime |
Time stamp on the target storage group of the last successful replay of a transaction log file. |
LastLogGenerated |
Last log generation number known to be generated on the active copy of the storage group. |
LastLogCopied |
Last log generation number successfully copied to the passive copy log folder. |
LastLogNotified |
Last log generation number generated by the active storage group and known to the copy. |
LastLogInspected |
Last log generation number inspected for consistency and corruption. |
LastLogReplayed |
Last log generation number successfully replayed into the passive copy of the storage group. |
LatestFullBackupTime |
Time of the last full backup. |
LatestIncrementalBackupTime |
Time of the last incremental backup. |
SnapshotBackup |
Backup taken using legacy streaming APIs or Volume Shadow Copy Service (VSS). Possible values are True and False. |
You can quickly assess the health of an SCR copy by looking at the values for SummaryCopyStatus, CopyQueueLength, ReplayQueueLength, and LastInspectedLogTime. These properties show whether the SCR copy is functioning correctly, and whether the SCR copy is relatively current in both copying and replaying logs. If the following conditions occur, you should determine the cause and correct the problem:
The copy is spending significant time in a state that is not healthy.
The copy queue length is more than 5.
The replay queue length is more than 20.
The last inspected log time does not show a current time. There are two likely reasons that could cause this: Either the storage group is not experiencing much change, or the replication service is stopped.
The replay queue length and copy queue length values are available as performance counters. They are the CopyQueueLength and ReplayQueueLength performance counters under the MSExchange Replication performance object.
There are some rare scenarios where the replication status can be misleading. The following is a list of those scenarios:
A storage group that is not active (that is, not changing) can report as healthy when it might not be healthy. This situation could occur because the unhealthy condition could not be detected until a log is replayed.
During replication initialization, the replication status is being evaluated and may not be accurate. When the initialization completes, the status is updated.
The value of the LastLogGenerated field can be wrong when a database is dismounted. However, all logs with end-user content are replicated if the storage group copy is replicating.
When there are one or more missing logs in the middle of a log stream, the passive copy continues to try to recover. In doing so, the replication status switches between failed and healthy states. The replay and copy queues continue to grow.
In some very rare conditions, a log can be successfully verified but it can still fail to replay. In this situation, the system alternates between failed and healthy states as it attempts to recover. The replay and copy queues continue to grow.
Verifying the Integrity of an SCR Target
When you use SCR, we recommend that you verify the integrity of each SCR target copy periodically by running a physical consistency check against the database and transaction log files. A physical consistency check examines the transaction logs and database files for corruption. You can perform the check by using the command-line version of the Microsoft Volume Shadow Copy Service tool (VSSAdmin.exe), and the Exchange Server Database Utilities (Eseutil.exe). For detailed steps about how to use VSSAdmin and Eseutil to check the transaction logs and database files for physical corruption, see How to Verify a Standby Continuous Replication Copy.
Note
Before you run a physical consistency check against a database, you must temporarily suspend all replication activity against the storage group. You can suspend replication activity by using the Suspend-StorageGroupCopy cmdlet in the Exchange Management Shell. When the consistency check has completed, you can resume transaction log replay activity by using the Resume-StorageGroupCopy cmdlet. We recommend that you perform verification during non-production hours and minimize the amount of time that replay activity is suspended. This is because suspending the storage group copy halts all updates to the SCR copy, thus causing some content to be vulnerable to a failure.
Managing Replication and Replay
Managing log file replication and replay in an SCR environment involves the following main activities:
Halting replication to the storage group copy
Restarting replication to the storage group copy
Reseeding a storage group
Halting and Restarting Changes to a Storage Group Copy and its Database
For a variety of reasons, it may be necessary to halt and restart transaction log replication activity. Transaction log replication occurs when the Microsoft Exchange Replication service is running, a storage group has been enabled for SCR, and both the SCR source and SCR target are operational. If either the source or target becomes unavailable, you must stop replication. In addition, some administrative tasks, such as seeding, require you to suspend replication for an SCR-enabled storage group. If you need to stop all access to a target's data files, you must suspend replication.
It may occasionally be necessary to control the activities of the SCR target. This could be required to perform a reconfiguration, or to correct issues with the server or the database. Halting log replay is also required to perform a physical consistency check of the SCR target. When it is necessary to control database copy updates, replication must be halted for the SCR target. Replication may also need to be halted when the SCR target logs are being manipulated for any reason.
For more information about halting replication changes to SCR copies, see How to Suspend Changes to a Standby Continuous Replication Target. For more information about restarting replication changes to SCR copies, see How to Resume Replication to a Standby Continuous Replication Target. For more information about performing an integrity check on the passive copy's transaction logs and database file, see How to Verify a Standby Continuous Replication Copy.
Seeding and Reseeding a Storage Group Copy
Seeding and reseeding a storage group copy in an SCR environment is performed by using the Update-StorageGroupCopy cmdlet with the StandbyMachine parameter (which is a new parameter added in Exchange 2007 SP1).
For detailed steps about how to seed or reseed an SCR target, see How to Seed a Standby Continuous Replication Target.
Recovering from Corruption by Assessing Replication Status at the Time of Corruption
After a failure or corruption of a database copy, you need to assess if you want to immediately continue operation using an SCR target. SCR provides key pieces of information to aid in this decision:
Health of the copy at the time of failure
Replay and copy queues at the time of the failure
Last inspected log time at the time of the failure
The information can be obtained using the Get-StorageGroupCopyStatus cmdlet. For detailed steps about how to obtain this information, see How to View the Status of Standby Continuous Replication.
Note
The last inspected log time provides information about the most recent changes from the SCR source. This aids in detecting failures that occur when the Microsoft Exchange Replication service is not started because the queue lengths are inaccurate when the Microsoft Exchange Replication service is stopped.
The copy queue length includes the best available information of the SCR source at the time of failure. Based on this information and your assessment of the recovery time of the failed database, you must decide if the available SCR target is to be activated:
If the replay queue length is significant, recovery might take time but it is not an indicator that significant data loss will be experienced.
If the copy queue length is significant, many logs have been lost. If the database is activated, it will be restored to a time frame of approximately the last copied log (also provided by the Get-StorageGroupCopyStatus cmdlet).
If the last inspected log time is significantly prior to the time of the failure, it is likely that the Microsoft Exchange Replication service is stopped and other queue information is inaccurate.
Note
Due to the nature of SCR, as well as external latencies and communication failures, it is possible for the copy queue length to be inaccurate because the current state of the active copy is asynchronously updated. In general, the inaccuracy is limited to activities approximately a minute before and after the failure.
Note
A failed database cannot be used to seed an SCR target.