Recovery Scenarios for E2K7…..I
This is the first of a few blogs about designing for availability and resilience… WHAT FAILURES MIGHT OCCUR AND HOW DO WE CHOOSE THE RIGHT DESIGN TO PROTECT US?
In the very early stages of a messaging design, and in particular at the point at which discussions surface concerning availability and resilience, it is often very useful to understand the type of issues that support teams are likely to face and how your proposed design stacks up.
EXAMPLE DESIGN
So first I need an example design. For the purposes of this blog I am using a pretty standard Exchange 2007 design based on CCR\SCR across 2 data centres. The design is best described on Technet here; ‘Site Resilience Configurations’. (See section ‘Production (Non-Dedicated) with One Active Directory Site’ – “This solution deploys redundant servers in a single Active Directory site that spans both datacenters.”)
I’m also using DPM for VSS based backups to disk, with long term backups to tape media, and there is a requirement to journal all messages to satisfy compliance regulations.
WHAT MIGHT GO WRONG?
The scenarios I’m going to base this on are as follows:
Data Centre Failure: The loss of an entire data centre
Server Hardware Failure: Component failure e.g. motherboard
Storage Failure: Access to all or a part of a volume\LUN – not including single disk failure
Mailbox Database Corruption (Physical): Most likely as a result of hardware failure
Mailbox Database Corruption (Logical): Data corruption may be as a result of faulting application or virus
Mailbox Deletion within Deleted Mailbox Retention period (<30 days): A result of an administrative or procedural error
Mailbox Deletion beyond Deleted Mailbox Retention period (>30 days): A result of an administrative or procedural error or returning employee
Email or Item Deletion (<14 days): User mistakenly deleted an item –administrator intervention required only if item hard deleted
Email or Item Deletion (>14 days): User mistakenly deleted an item –administrator intervention required
Identify if and when a particular email was sent\received (<30 days): Only message route required
Identify if and when a particular email was sent\received (>30 days): Only message route required
Identify if and when a particular email was sent\received (<14 days): Entire message required
Identify if and when a particular email was sent\received (>14 days): Entire message required
HOW DOES MY PROPOSED DESIGN PROTECT ME?
The following table takes the above scenarios and determines where the protection against the occurrence of each particular scenario is in your design. This first pass should help us to understand what might fail, what protection the design provides, the likelihood of the scenario occurring and the impact of that event.
Scenario | Mitigation | Impact (Worst case) | Estimated Recovery Time | Likelihood |
Data Centre Failure | SCR ( & redirection of network traffic\email) | Temporary loss of service to all users during presentation of SCR targets Minimal data loss | <2 hours* | Very low |
Server Hardware Failure | CCR | Temporary loss of service to all users on a single Exchange Server during cluster failover | <15 minutes | Moderate |
Storage Failure | CCR (single disk failure mitigated by RAID) | Temporary loss of service to all users on a single Exchange Server during cluster failover | <15 minutes | Moderate |
Mailbox Database Corruption (Physical) | CCR** | Temporary loss of service to all users on a single Exchange Server | <15 minutes | Low |
Mailbox Database Corruption (Logical) | DPM restore from disk | Temporary loss of service to all users on a single mailbox database | <2 hours | Low |
Mailbox Deletion within Deleted Mailbox Retention Period (<30 days) | Deleted Mailbox Retention*** | Temporary loss of service to a single user\temporary loss of all data | <15 minutes | High |
Mailbox Deletion beyond Deleted Mailbox Retention period (>30 days) | DPM restore of database from tape | n\a | <8 hours | High |
Email or Item Deletion (<14 days) | Deleted Item Retention**** | Loss of single\multiple items for a single user | <15 minutes | High |
Email or Item Deletion (>14 days) | DPM restore of database from tape | n\a | <8 hours | Moderate |
Identify if and when a particular email was sent\received (<30 days) | Message Tracking***** | n\a | <15 minutes | Low-Moderate |
Identify if and when a particular email was sent\received (>30 days) | DPM restore of single\multiple databases from tape | n\a | <2 days | Low-Moderate |
Identify if and when a particular email was sent\received (<14 days) | Message Journaling****** | n\a | <1 hour | Low-Moderate |
Identify if and when a particular email was sent\received (>14 days) |
Message Journaling | n\a | <1 hour | Low-Moderate |
* Whilst it is estimated that invoking the SCR target might take place in less than 2 hours, the loss of an entire data centre might mean that the complete service (including the redirection of Outlook clients, the internet connection, and the recovery of all ancillary services, such as an archive solution; may mean that resumption of service takes more than 2 hours.
** The alternative to failing over the entire server to the CCR replica is to restore a single database from disk using DPM. This increases the impact for the users will mailboxes on the affected database but provides no loss of service to users on the rest of the server.
*** The default Deleted Mailbox Retention period is 30 days which is configurable.
**** The default Deleted Item Retention period is 14 days which is configurable.
***** Message Tracking Logs are by default kept for 30 days. This is a configurable setting.
****** Currently it is assumed that all email is journaled and archived and retained for a period according to compliance requirements.
So to use an example from the table above. If an administrator was asked by customer to identify an email that was sent or received over 30 days ago (not actually provide the message itself but identify when it was sent and received) then they would have to identify the databases where the sender and recipient mailboxes were located at the time of the message delivery, restore them and try to find that message. A long and laborious task which might take up to 2 days. In my example I have assumed that the likelihood of this occurring is low-moderate. This exercise should highlight the areas where your proposed design doesn’t provide the protection that your specific company requires of it.
The next blog in this series is called ‘Recovery Scenarios for E2K7…..II’ and looks at each component of the design to determine which of them brings the most value at the smallest cost so that we can make a more informed decision as to which to choose to deploy…
Anonymous
December 02, 2008
PingBack from http://blog.a-foton.ru/index.php/2008/12/03/recovery-scenarios-for-e2k7%e2%80%a6i/Anonymous
December 02, 2008
PingBack from http://blog.a-foton.ru/index.php/2008/12/03/recovery-scenarios-for-e2k7%e2%80%a6i/Anonymous
March 19, 2010
Quick question/scenario, what would you expect an RTO for an exchange environment with around 1500 users to be for say a SAN failure? Company has a SAN controller fail with no redundant controller, possible corruption of Exchange stores, should/would standard recovery take over 2 days? 1 day?