Disaster Recovery

Simplify File Recovery with Data Protection Manager

Microsoft IT Showcase and Laura Euler

 

At a Glance:

  • Using DPM for backup at Microsoft
  • The DPM architecture
  • End-user data recovery
  • Configuring backup and recovery

Most large companies use magnetic tape for backing up data. Tape is cheap and useful for long-term retention and off-site storage, but it can be inefficient and unreliable. It’s physically fragile and

vulnerable to misalignments. It also takes a long time to back up data to a tape. Even when data has been backed up successfully, tapes are subject to loss, damage, or prolonged delivery times due to off-site storage.

Some industry analysts estimate that more than 40 percent of companies have had restorations fail because the data was not correctly written to tape, was corrupted, or was in other ways unusable. Verification procedures are often necessary to ensure that data has been properly backed up, but that adds more time and overhead.

Is there a better way? Hard disk space gets cheaper all the time, so it’s now more cost-effective to back up on disk and avoid tape backups altogether. Disks aren’t subject to the same pitfalls as tapes. They aren’t as fragile, they can be overwritten more easily, and they’re faster for accessing data. And management tools can make the task of keeping track of your data easier.

Microsoft® Data Protection Manager (DPM) 2006 is a server software application that optimizes disk-based backup and restoration. The DPM server routinely synchronizes with your production servers, capturing only the changes. This replication occurs at the byte level; instead of replicating an entire file when a change occurs, DPM replicates only the bytes that actually change within each file.

A test by the Microsoft Data Protection Services Group using a beta release demonstrated that DPM could protect a branch office with 300GB of data to back up every day through a single nightly synchronization session that lasted only 10 minutes. This session replaced the eight-hour tape backup session formerly required.

For users, DPM offers instant recoverability. If a user’s data becomes overwritten or lost, a backup copy is available without the need to request a tape. An employee can use Windows® Explorer or any Microsoft Office 2003 (or higher) application to right-click the file and initiate the restoration. If a catastrophic server failure occurs, the DPM image of the data can be set to read-only mode and shared over the network to provide users access to their data while the server is rebuilt.

Microsoft Case Study

With about 150 sites, including 115 branch offices around the world, Microsoft needed a backup and restoration solution that would enable it to centrally manage all remote locations. A big part of centrally managing support for remote locations is having the ability to restore data without requiring local personnel to pull a tape from a library and mount the media for restoration. Loading tape cannot be done from a central location. And a WAN-based solution didn’t work, either. It combined the unreliability of a tape backup with the slow transmission rates of the WAN.

Microsoft also wanted to reduce the cost of maintenance and repair support, as well as minimize backup errors. The monitoring tools used at Microsoft to verify tape-based backups identified about 16,000 errors each month for more than 5,000 servers. About 14,000 could be resolved via automation, but another 2,000 errors had to be addressed. But with at least 150 different error codes, resolving each was time consuming.

Yet another consideration was the size of the Microsoft datacenter servers. Some of the these servers contain so much information that the Microsoft Data Protection Services Group found that tape was simply unable to protect all the information.

The solution was DPM. Microsoft IT configured its DPM servers to store between 14 and 21 days of backup data in the form of easily accessible snapshots—dedicating 180 terabytes of storage connected to the DPM servers to cover all 115 branch offices. Tape backups for long-term retention and off-site backup are made from the DPM servers, completely freeing the branch offices from the expense of managing tapes locally. The 12 DPM servers replaced 115 tape libraries, media servers, and associated infrastructure at these sites.

DPM Architecture

An architectural overview of a typical DPM installation can be found in Figure 1. The DPM server software is installed on a dedicated server running Windows Server™ 2003 Service Pack 1 (SP1) or Microsoft Windows Storage Server 2003 SP1. Windows Server 2003 R2 also supports DPM. The DPM setup process also installs components of SQL Server 2000 (the DPM product includes a restricted license of SQL Server for use only with DPM).

Figure 1 Typical DPM Installation

Figure 1** Typical DPM Installation **

An Active Directory® service domain is required for the discovery of servers and to maintain the security settings of files and folders through access control lists. The Active Directory schema also holds the configuration settings required for the user recovery client to retrieve shadow copies from the DPM server. The DPM server must be in the same domain as the servers being protected.

The DPM agent software is installed on each protected server; it captures and logs all the changes made to the file system. These servers may run Windows 2000 (SP4 with the update rollup), Windows Server 2003, Windows Storage Server 2003, Windows Server 2003 R2, or Windows Storage Server 2003 R2. The agent is installed from within the DPM administrator console.

The clients run Windows XP and Windows Server 2003. The Previous Versions Client and Volume Shadow Copy Service (VSS) on these systems enable users to access and recover previous versions of their files. (Note that enabling end-user recoveries via DPM requires an update for computers running Windows XP SP2 and Windows Server 2003 SP1. For more information, see "An update is available to optimize the way that the Shadow Copy Client accesses shadow copies in Windows Server 2003 and in Windows XP".)

You might have noticed the final tape backup solution in Figure 1. This last step is optional, but for off-site archiving or long-term retention (especially for legal and financial purposes), tape is still the most cost-effective medium. However, for fast backups and restorations of recent data, it’s more efficient to restore directly from the DPM disk.

Synchronization is the process by which DPM transfers only the changes made on file servers to the DPM server and applies those changes to the replica of protected data. DPM updates data at the byte level within protected files. File operations such as renaming, deleting, and creating also are replicated for protected files.

DPM synchronization is asynchronous, meaning that synchronization occurs without blocking disk I/O on the protected objects. Changes are stored in the agent synchronization log and then replicated across the network to the DPM server, where they are stored in the transfer log. Data in the transfer log is later used to construct complete replicas that can be used to create a shadow copy of the protected data.

The Windows Server 2003 VSS creates consistent point-in-time copies of data known as shadow copies. After the DPM agent has synchronized the changed data with the DPM server, VSS creates replicas within the DPM server. Users can then browse and recover copies of deleted or corrupted files from various points in time. Eventually old shadow copies are deleted—as soon as the size of all shadow copies reaches either a configurable maximum or 64 shadow copies per volume (whichever occurs first).

Prior to VSS, there was no standard way to produce uncorrupted snapshots of a volume. You have to repair corruptions due to interrupted writes using tools such as Chkdsk.exe. VSS prevents incomplete writes by enabling applications to flush partially committed data from memory.

End-User Recovery

Allowing your users to selectively restore their own data improves productivity and satisfaction. Instead of being dependent upon someone from IT to locate and mount the appropriate backup tape, the user just navigates to the DPM file share and downloads the lost information. End-user recovery also lowers operation costs since helpdesk and administrators aren’t involved. Various industry studies have found that more than 90 percent of all tape restores are for single files, and nearly the same percentage of cases are for files that are less than 14 days old. So it makes sense to let your users recover their own data.

End users can recover files by browsing through shadow copies on the DPM server, either by using Windows Explorer or by using the Recover Previous Version command on the Microsoft Office 2003 Tools menu (see Figure 2).

Figure 2 Folder Recovery

Figure 2** Folder Recovery **

Backup Verification

Because DPM is disk based, it can take advantage of a redundant array of independent disks (RAID) for protection—unlike tape, with the potential to be a single point of failure. IT administrators can easily browse through the server running DPM to confirm that the shadow copy has been made, and they can check and verify the backup online. Being able to confirm the redundancy of data without continually restoring from tape may also benefit auditors doing compliance checks in regard to business continuity or disaster recovery planning.

DPM monitors its backup jobs to ensure that they complete without error. If an error is detected, DPM has a two-stage error-correction process. First, DPM automatically validates the replica against the production server to ensure that the replication is consistent and has occurred as planned. If inconsistencies between a data source and its replica are found, the fix-up activity resends the object or objects from the data source to the replica.

With tape-based systems, about 40 percent of the IT staff’s time is spent monitoring and correcting backup operations. DPM eliminates this chore because it automatically validates and reworks the replica to help ensure consistency with the production server.

Scheduling Backups

You can deploy DPM to perform a nightly synchronization and shadow copy; you can also schedule nearly continuous protection. The scheduling engine supports a wide range of protection options and frequencies.

In its default settings, DPM provides hourly synchronizations so that even complete server failure results in less than an hour of data loss. With shadow copy scheduled at regular points throughout the day, users can roll back to a previous copy of data that is less than two hours old—instead of being limited to the previous night’s tape. Other organizations might configure for less frequent synchronizations and perhaps only daily shadow copies. With 64 instances per volume, a once-a-day schedule would enable more than two months of recovery capability from disk.

DPM essentially eliminates having to schedule backup windows because it logs and replicates byte-level changes to the files on the production servers. This also makes backups more efficient. Because DPM captures changes as they happen instead of copying entire files, DPM backups place less load on production servers than conventional backup tools that have to copy entire files if even a single byte changes. (Note: the DPM administrator console provides a throttling feature that enables you to define the maximum amount of total bandwidth that the backup procedure can use.) Collapsing the backup window from hours to minutes also lets you use fewer servers than you’d need using slower tape backups.

DPM also provides an alternative to tape backup for servers that hold so many millions of files or terabytes of data that attempting to use a tape-based backup solution would not be practical. Full tape backups take more than 24 hours on such systems, and they are often unsuccessful due to the millions of files and folders that need to be cataloged. On these systems, even incremental backups are problematic, due simply to the sheer volume of files.

The IT department at Microsoft found that the tape-based solutions it tried with a 10-million-file server failed consistently because crawling through that directory structure took so long. They also found that as the backup length increases, so do the odds of encountering a drive failure or some other issue that causes the backup to fail. Because DPM captures the data changes as they occur, no crawling of the directory structure is needed. Using this approach, Microsoft was able to successfully protect these very large servers within the already existing backup window.

Selecting the Data

After you select the data you want to protect, you can plan how to collect the data in protection groups. A protection group is a set of volumes, folders, or shares under the same protection policy. Items included are called members. The protection policy contains both the synchronization schedule and the shadow-copy schedule.

The key consideration in creating protection groups is tolerance for data loss. Typically, data with a relatively low loss tolerance is in one protection group, data with a relatively high loss tolerance in another, and medium loss-tolerance data in another. A given protection group may contain members from different volumes and servers, although each protected object can be associated with only one protection group, and only one protection group can protect data sources on any single volume.

Although the complete data of a server can be easily restored through DPM, special steps are required to enable DPM to support recovery that includes restoring the operating system and system-state data. To accomplish this, DPM recommends that you back up the protected server using the backup feature of that server’s OS (meaning the backup utility included with Windows 2000 Server or Windows Server 2003).

You can use these backups to restore the server to a bootable state; store these backups on a volume that is added to a protection group on the DPM server. When you require a recovery, retrieve the system state data from the DPM server and write the data to restore media. You can then use the media with the appropriate backup tool to restore the system to a bootable state. Then you can restore protected data to the server using the normal DPM workflow.

Allocating Disk Space

Obviously, the amount of disk space that an organization allocates on the DPM server will directly affect how extensive its backup history will be. The set of available storage for replicas and shadow copies is known as the storage pool. DPM enables an organization to distribute a single server’s storage pool across multiple disks, adding more space to the pool when necessary.

The default DPM cumulative space allocation for a given protected volume includes the following:

  • A replica allocation of the smaller of 1.5 times the space used on the protected volume or the total volume capacity (with a minimum of 1.5GB). For a 10GB volume that is 40 percent full, this allocation equals 6GB.
  • A shadow-copy allocation of 20 percent of the size of the replica allocation (with a minimum of 550MB).
  • A transfer log allocation of 1.4 times the synchronization log space allocated on the protected server (with a minimum of 700MB).

In addition, DPM includes a calculator that can more precisely estimate the amount of space required to protect a given volume of data, and you can always manually specify space allocations either when creating the protection group or afterward.

Figure 3 shows actual DPM server utilization on the Microsoft corporate network. As you can see, the DPM server that protects the most branch offices is DPM #2, which is located in the Redmond datacenter and uses 1.55 terabytes of its allocated 3.69 terabytes of storage. The 12 datacenter DPM servers protect a total of 149 servers and hold 20.79 terabytes of data; the smallest DPM server has a 3-terabyte capacity, and the largest has a 10-terabyte capacity.

Figure 3 Group Utilization by DPM Server

Data Center DPM Server Branch Servers Protected Disk Capacity (in terabytes) Disk Allocated (in terabytes) Disk Used (in terabytes)
Dublin DPM #1 9 3.07 1.90 0.97
Dublin DPM #2 11 6.14 4.52 2.11
Dublin DPM #3 13 7.16 6.47 3.37
Dublin DPM #4 6 5.11 4.90 2.54
Dublin DPM #5 8 6.14 2.89 2.03
Redmond DPM #1 16 4.09 3.16 1.55
Redmond DPM #2 22 8.18 3.69 1.55
Redmond DPM #3 21 10.23 6.19 2.90
Redmond DPM #4 20 5.11 3.57 1.56
Singapore DPM #1 6 3.99 2.22 0.92
Singapore DPM #2 6 2.99 1.78 0.90
Singapore DPM #3 11 2.99 1.62 0.86
Totals: 149 63.76 41.98 20.79

Figure 4 shows the average number of files and the amount of total data stored on branch servers. Some protected servers hold more than 6 million files and have an average disk capacity of 218GB.

Figure 4 Average Branch Office Server Workload

Metric Per Volume Per Server Per Domain
Count 660 213 7
Average number of files 110,093 341,134 10,380,222
Maximum number of files 1,672,039 6,072,431 36,244,679
Average capacity in GB 70 218 6,629
Maximum capacity in GB 417 1,452 19,033

Analyzing the change workload—how much data on a protected volume or server actually changes between backups—underscores the value of DPM, which is based upon copying only data blocks that have changed. Whereas Figure 4 showed that the average protected server holds 218GB of data, Figure 5 shows that the average branch office server has a data change of only about 1.8GB per backup session.

Figure 5 Average Branch Office Change Workload per Server

Changes Per Volume Per Server Per Domain
Average change in GB 0.6 1.8 54
Longest changes in GB 31 32 201
Heavy day changes in GB 3.4 11 323
Maximum changes in GB 189 194 1,204

Conclusion

DPM is a handy solution for organizations that need advanced data protection. DPM enables users to recover their own documents quickly and easily. Recoveries that once took hours, while IT staff located and mounted tapes from a library, can now be achieved in seconds. Because tape backups can easily be made from the DPM server, organizations can still use tapes for long-term and offsite storage, while eliminating the long backup windows that tape requires.

DPM is especially useful if you need to manage backup and restoration services for branch offices and other remote locations from a high-security centralized datacenter, while at the same time reducing the cost of local administration.

Best Practices for DPM

In addition to the suggestions made in this article, if you plan to deploy DPM, your organization can benefit from a number of specific best practices, including:

  • Checking deployment requirements.
  • Scheduling strategies for DPM backups.
  • Integrating with Microsoft Operations Manager (MOM).

Deployment Requirements

For an organization to deploy a DPM server, the server must be running the Standard or Enterprise Edition version of either Microsoft Windows Server 2003 or Windows Storage Server 2003 SP1 or later.

In addition, your server must be a member of the same Active Directory domain as the file servers it protects and it should be an ordinary single-purpose server. The DPM server cannot be an Active Directory domain controller. Your server should be equipped with at least one logical volume (which may be made of multiple physical disks) and at least one additional unused disk.

Every file server that DPM will protect must have the DPM agent installed on it and must meet the following requirements.

First, the server must be running only one of the following operating systems: Windows Server 2003 SP1 or later; Windows Storage Server 2003 SP1 or later; or Windows 2000 Server with both SP4 and the Windows 2000 update rollup installed. Second, the server must be a member of the same Active Directory domain as the DPM server that protects it.

DPM can be used to protect standalone file servers today. Protecting clustered file servers will be supported with SP1 for DPM, which is scheduled to release in 2006.

Scheduling Strategies

Scheduling DPM backups involves a tradeoff between how often you want to back up with how many days of backed-up data to retain, because there is a 64-image limit for data retention. Scheduling two jobs per day, for example, will allow storing about 32 days’ worth of protection. More days can be saved if data has not changed: DPM does not create a shadow copy if there have been no changes.

With 64 images available, the most common rotations for a five-day business week would provide:

  • One per business day for over three months of daily restores available from disk
  • Two per business day would give six weeks with recovery points at the beginning of each business day and noon—resulting in a four-hour RPO (recovery point)
  • Three per business day—perhaps at 9:00 A.M., 12:00 P.M., and 3:00 P.M. would provide a month of backups

And when roll-back becomes the primary goal, perhaps the most common method is to synchronize hourly and initiate shadow copies at every even business hour (8:00 A.M., 10:00 A.M., 12:00 P.M., 2:00 P.M., and 4:00 P.M.). This gives two outcomes: if the primary server were to have a catastrophic failure, the DPM server would have data that is no more than one hour old, available for restoration. If data was deleted or overwritten, the DPM server would have a version no more than two hours old, available for easy and predictable restoration back to the last even hour.

Integrating with MOM

DPM already provides reporting through SQL Reporting Services, which comes included with DPM, as well as notification via e-mail, alerts through the DPM interface, Event Log and Performance Monitor statistics. But for more of an enterprise perspective, the DPM management pack for MOM 2005 allows you to centrally monitor data protection. You can check state, health, and performance of multiple DPM servers from a MOM management server. Health and status information can be viewed and analyzed in the context of the health and status of the network and other service.

DPM already provides reporting through SQL Reporting Services, which comes included with DPM, as well as notification via e-mail, alerts through the DPM interface, Event Log and Performance Monitor statistics. But for more of an enterprise perspective, the DPM management pack for MOM 2005 allows you to centrally monitor data protection. You can check state, health, and performance of multiple DPM servers from a MOM management server. Health and status information can be viewed and analyzed in the context of the health and status of the network and other service.

For more information about preparing to deploy DPM, see "Data Protection Manager Technical Overview" at "Data Protection Manager Technical Overview".

Microsoft IT Showcase presents an inside view of the Microsoft IT process for developing, deploying, and managing Microsoft solutions—from Microsoft IT professionals to IT professionals—peer to peer. The resources they provide reveal how Microsoft uses technology to solve specific business problems. Find out more at Microsoft IT: Showcase.

Laura Euler runs the universe from an underground fortress in an undisclosed location. You can contact her at lauraeu@speakeasy.net.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.