Tuning replication performance in DFSR (especially on Win2008 R2)
Hi all, Ned here again. There are a number of ways that DFSR can be tuned for better performance. This article will go through these configurations and explain the caveats. Even if you cannot deploy Windows Server 2008 R2 - for the absolute best performance - you can at least remove common bottlenecks from your older environments. If you are really serious about performance in higher node count DFSR environments though, Win2008 R2’s 3rd generation DFSR is the answer.
If you’ve been following DFSR for the past few years, you already know about some improvements that were made to performance and scalability starting in Windows Server 2008:
|Windows Server 2003 R2||Windows Server 2008|
|Multiple RPC calls||RPC Async Pipes (when replicating with other servers running Windows Server 2008)|
|Synchronous inputs/outputs (I/Os)||Asynchronous I/Os|
|Buffered I/Os||Unbuffered I/Os|
|Normal Priority I/Os||Low Priority I/Os (this reduces the load on the system as a result of replication)|
|4 concurrent file downloads||16 concurrent file downloads|
But there’s more you can do, especially in 2008 R2.
All registry values are REG_DWORD (and in the explanations below, are always in decimal). All registry tuning for DFSR in Win2008 and Win2008 R2 is made here:
A restart of the DFSR service is required for the settings to take effect, but a reboot is not required. The list below is not complete, but instead covers the important values for performance. Do not assume that setting a value to the max will make it faster; some settings have a practical limitation before other bottlenecks make higher values irrelevant.
Important Note: None of these registry settings apply to Windows Server 2003 R2.
AsyncIoMaxBufferSizeBytes Default value: 2097152
Possible values: 1048576, 2097152, 4194304, 8388608
Tested high performance value: 8388608
Set on: All DFSR nodes
RpcFileBufferSize Default value: 262144
Possible values: 262144, 524288
Tested high performance value: 524288
Set on: All DFSR nodes
StagingThreadCount Default value: 6
(Win2008 R2 only; cannot be changed on Win2008)
Possible values: 4-16
Tested high performance value: 8
Set on: All DFSR nodes. Setting to 16 may generate too much disk IO to be useful.
TotalCreditsMaxCount Default value: 1024
Possible values: 256-4096
Tested high performance value: 4096
Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow)
UpdateWorkerThreadCount Default value: 16
Possible values (Win2008): 4-32
Possible values (Win2008 R2): 4-63*
Tested high performance value: 32
Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow. The number being raised here is only valuable when replicating in from more servers than the value. I.e. if replicating in 32 servers, set to 32. If replicating in 45 servers set to 45.
*Important note: The actual top limit is 64. We have found that under certain circumstances though, setting to 64 can cause a deadlock that prevents DFSR replication altogether. If you exceed the maximum tested value of 32, set to 63 or lower. Do not set to 64 ever. The 32 max limit is recommended because we tested it carefully, and higher values were not rigorously tested. If you set this value to 64, periodically replication will stop working, the dfsrdiag replstate command hangs and does not return results, and the dfsrdiag backlog command hangs and does not return results.
When using all the above registry tuning on Windows Server 2008 R2, testing revealed that initial sync replication time was sometimes twice as fast compared to no registry settings in place. This was using 32 servers replicating a "data collection" topology to a single hub over thirty-two non-LAN networks with 32 RG's containing unique branch office data. The slower the network, the better the relative performance averaged:
|Test||Spokes||Hubs||Topology||GB/node||Unique||RG||Tuned||Network||Time to sync|
On Windows Server 2008 the same registry values showed considerably less performance improvement; this is partly due to additional service improvements made to DFSR in Win2008 R2, especially around the Credit Manager. Just like your phone, “3G” DFSR is going to work better than older models…
Note: do not use this table to predict replication times. It is designed to show behavior trends only!
Even if you are not using Windows Server 2008 R2 there are plenty of other factors to fast replication. Some of these I’ve talked about before, some are new. All are important:
- Minimize mixing of Win2003 and Win2008/Win2008 R2 - Windows Server 2008 introduced significant DFSR changes for RPC, inbound and outbound threading, and other aspects. However, if a Win2008 server is partnered with a Win2003 server for DFSR, most of those improvements are disabled for backwards compatibility. An ideal environment is 100% Windows Server 2008 R2, but a Win2008-only is still a huge improvement. Windows Server 2003 should be phased out of use as quickly as possible as it has numerous "1G" design issues that were improved on with experience in later OS's. Windows Server 2008 R2 credit manager and update worker improvements are most efficient when all operating systems are homogenous. If you are replacing Win2003 servers with newer OS, do the hub servers first as the increased number of files will provide some benefits even when talking to 2003 spokes.
- Consider multiple hubs - If using a large number of branch servers in a hub-and-spoke topology, adding “subsidiary hub” servers will help reduce load on the main hubs.So for example, this configuration would cause more bottlenecking:
And this configuration would cause less bottlenecking:
- Increase staging quota - The larger the replicated folder staging quotas are for each server, the less often files must be restaged when replicating inbound changes. In a perfect world, staging quota would be configured to match the size of the data being replicated. Since this is typically impossible, it should be made as large as reasonably possible. It must always be configured to be at least as large as the combined size of the count of the files controlled by UpdateWorkerThreadCount+16 on Win2008 and Win2008 R2. Why 16? Because that is the number of outbound files that could be replicated simultaneously.
This means that by default on Win2008/Win2008 R2, quota must be as large as the 32 largest files. If UpdateWorkerThreadCount is increased to 32, it must be as large as the 48 largest files (32+16). If any smaller then staging can become blocked when all 32 files are being replicated inbound and 16 outbound, preventing further replication until that queue is cleared. Frequent 4202 and 4204 staging events are indications of an inappropriately configured staging quota, especially if no longer in the initial sync phase of setting up DFSR for the first time.
Source : DFSR
Catagory : None
Event ID : 4202
Type : Warning
The DFS Replication service has detected that the staging space in use for the replicated folder at local path c:\foo is above the high watermark. The service will attempt to delete the oldest staging files. Performance may be affected.
Source : DFSR
Catagory : None
Event ID : 4204
Type : Information
The DFS Replication service has successfully deleted old staging files for the replicated folder at local path c:\foo. The staging space is now below the high watermark.
If you get 4206 staging events you have really not correctly sized your staging, as you are now blocking replication behind large files.
Event Type: Warning
Event Source: DFSR
Event Category: None
Event ID: 4206
Time: 3:57:21 PM
The DFS Replication service failed to clean up old staging files for the replicated folder at local path c:\foo. The service might fail to replicate some large files and the replicated folder might get out of sync. The service will automatically retry staging space cleanup in 1 minutes. The service may start cleanup earlier if it detects some staging files have been unlocked.
If still using Win2003 R2, staging quota would need to be as large as the 9 largest files. And if using read-only replication on Windows Server 2008 R2, at least 16 or the size specified in UpdateWorkerThreadCount – after all, a read-only replicated folder has no outbound replication.
So to recap the staging quota minimum recommendations:
- Windows Server 2003 R2: 9 largest files
- Windows Server 2008: 32 largest files (default registry)
- Windows Server 2008 R2: 32 largest files (default registry)
- Windows Server 2008 R2 Read-Only: 16 largest files
If you want to find the 32 largest files in a replicated folder, here’s a sample PowerShell command:
Get-ChildItem <replicatedfolderpath> -recurse | Sort-Object length -descending | select-object -first 32 | ft name,length -wrap –auto
- Consider read-only - Deploy Windows Server 2008 R2 read-only replication when possible. If users are not supposed to change data, mark those replicated folders as read-only. A read-only server cannot originate data and will prevent unwanted replication or change orders from occurring outbound to other servers. Unwanted changes generate load and lead to data overwrites – which to fix you will need to replicate back out from backups, consuming time and replication resources.
- Latest QFE and SP - Always run the latest service pack for that OS, and the latest DFSR.EXE/DFSRS.EXE for that OS. There are also updates for NTFS and other components that DFSR relies on. Hotfixes have been released that remove performance bugs or make DFSR more reliable; a more reliable DFSR is naturally faster too. These are documented in KB968429 and KB958802 but the articles aren’t always perfectly up to date, so here’s a trick: If you want to find the latest DFSR service updates, use these three searches and look for the highest KB number in the results:
Win2008 R2: https://www.bing.com/search?q=%22windows+server+2008+r2%22+%22dfsrs.exe%22+kbqfe+site%3Asupport.microsoft.com&go=&form=QBRE
Win2003 R2: https://www.bing.com/search?q=%22windows+server+2003+r2%22+%22dfsr.exe%22+kbqfe+site%3Asupport.microsoft.com&form=QBRE&qs=n
Remember, Win2003 mainstream support ends July 13, 2010. That’s the end of non-security updates for that OS.
People ask me all the time why I take such a hard line on DFSR hotfixes. I ask in return “Why don’t you take such a hard line?” These fixes cost us a fortune, we’re not writing them for our health. And that goes for all other components too, not just DFSR. It’s an issue intrinsic to all software. DFSR is not less reliable than many other Windows components – after all, NTFS is considered an extremely reliable file system but that hasn’t stopped it from having 168 hotfixes in its lifetime; DFSR just has a passionate group of Support Engineers and developers here at MS that want you to have the best experience.
- Consider and test anti-virus exclusions – Most anti-virus software has no concept of the data types that make up DFSR’s working files and database. Additionally, those file types are not executables and are therefore very unlikely to contain a useful malicious payload. If you are seeing slow performance within DFSR, test the following anti-virus file exclusions; if DFSR performs considerably better, contact your AV vendor for an updated version of their software and an explanation around the performance gap.
<drive>:\system volume information\DFSR\
<drive>:\system volume information\DFSR\database_<guid>\
<drive>:\system volume information\DFSR\config\
This should be validated carefully; many anti-virus products allow exclusions to be set but then do not actually abide by them. For maximum performance, you would exclude scanning any replicated files at all, but this is obviously unfeasible for most customers.
- Pre-seed the data when setting up a new replicated folder- Pre-seeding - often referred to as "pre-staging" - data on servers can lead to huge performance gains during initial sync. This is especially useful when creating new branch office servers; if they are being built in the home office, they can be quickly pre-seeded with data then sent out to the field for replication of the change delta. See the following article for pre-seeding recommendations. I have an updated version of it in the works too.
Going back to those same tests I showed earlier with 32 spokes replicating back to a single hub, note the average performance behavior when the data was perfectly pre-seeded:
|Test||Spokes||Hubs||Topology||GB/node||Unique||RG||Tuned||Staging||Net||Time to sync|
Even the 64Kbps frame relay connection was nearly as fast as the LAN! This is because no files had to be sent, only file hashes.
Note: do not use this table to predict replication times. It is designed to show behavior trends only.
- Go native Windows Server 2008 R2 – Not to beat a dead horse but the highest performance gains - including registry tuning and the greatly improved Credit Manager code - will be realized by using Windows Server 2008 R2. Win2003 R2 was first generation DFSR, Win2008 was second generation, and Win2008 R2 is third generation; if you are serious about performance you must get to 2008 R2.
- Use 64-bit OS with as much RAM as possible on hubs - DFSR can become bound by RAM availability on busy hub servers, especially when using the registry performance values above. There is absolutely no reason to run a 32-bit file server in this day and age, and with the coming of Windows Server 2008 R2, it’s no longer possible. For spoke servers that tend to have far less load, you can cut more corners of course; the ten-user sales team in Hicksville doesn’t need 16GB of RAM in their file server.
As a side note, customers periodically open cases to report “memory leaks” in DFSR. What we discuss is that DFSR intentionally caches as much RAM as it can get its hands on – really though, it’s the ESE (Jet) database doing this. So the idler other processes on a DFSR server are, the more memory a DFSR process will be able to gobble up. You can see the same behavior in LSASS’s database on DC’s.
- Use the fastest disk subsystem you can afford on hubs - Much of DFSR will be disk bound - especially in staging and RDC operations - so high disk throughput will dramatically lower bottlenecks; this is especially true on hub servers. As always, a disk queue length greater than 2 in PerfMon is in indication of an over-used or under-powered disk subsystem. Talk to your hardware vendors about the performance and cost differences of SATA, SCSI, and FC. Don’t forget about reliability too – I have a job here for life thanks to all the customers that use the least expensive, off-brand, no warranty, low parity, practically consumer-grade iSCSI products they can find. You get what you pay for and ultimately your users do not care about anything but their data. The OS is just a thing to make applications access files so that the business can make money. Someday the Linux desktop folks will figure this out and get some applications; then we may actually be in trouble here.
If using iSCSI, make sure you have redundant network paths to the disks, using multiple switches and NIC’s. We have had quite a few cases lately of no fault tolerance iSCSI configs that would go down for hours in the middle of DFSR updating the database and transaction logs, and the results were obviously not pretty.
- Use reliable networks - They don't necessarily have to be fast, but they do need to stay up. Many DFSR performance issues are caused by using old network card drivers, using malfunctioning "Scalable Network" (TCP offload, RSS, etc.) settings, or using defective WANs. Network card vendors release frequent driver updates to increase performance and resolve problems; just like Windows service packs, the drivers should be installed to improve reliability and performance. Companies often deploy cost saving WAN solutions (with VPN tunnels, frame relay circuits, etc.) that in the end cost the company more in lost productivity than they ever saved in monthly expense. DFSR - like all RPC applications - is sensitive to constant network instability.
- Review our performance tuning guides – For much more detail on squeezing performance out of your hardware, including network, storage, and the rest, review:
And that’s it.
- Ned “fork” Pyle