ContentSync – a content-based file copy tool

Suppose you need to sync the contents of two large folders, Source and Destination. Normally robocopy *.* Source Destination /MIR does the job. However even if the byte content of a file didn’t change, but the timestamp did, robocopy will copy the file and change the timestamp of the destination to match the source.

This is the safe and expected behavior, however there are whole classes of tools that track file modification based off the timestamp alone. For instance, MSBuild will trigger a cascading rebuild of all dependencies of a file if its timestamp has changed (even if the actual file contents is exactly the same). This is called overbuilding. A scalable build system should detect that the file contents didn’t change and avoid doing any work in this case.

Or take another example, suppose you need to upload hundreds of thousands of files to Azure using MSDeploy. If the timestamp on those files has changed, MSDeploy will upload those files even if the actual content is the same.

In general, if you’re deciding whether a file was modified based off the timestamp, you’re bound to schedule unnecessary work that could have been avoided if you checked whether the actual file bytes have changed.

Long story short, I wrote an open-source tool to sync/mirror directories based on the file content, not the timestamp:

https://github.com/KirillOsenkov/ContentSync

Usage: ContentSync.exe <Source> <Destination>

I needed this for the exact Azure website deployment scenario: I maintain https://source.roslyn.io, and we’ve recently moved from on-premises Microsoft hosting to Azure. I use https://github.com/KirillOsenkov/SourceBrowser to regenerate the website every night, and that’s 160,000 modified files totaling over 1 GB. You don’t want to upload that to Azure every night.

So I inserted an intermediate step into my deployment script – I ContentSync from the freshly built Index folder (it’s not incremental, all the files are generated from scratch every time) into a Staging folder from the last time. ContentSync doesn’t touch the files that haven’t actually changed (and that’s 99.9…%), and so MSDeploy doesn’t upload them since the timestamp is the same.

While I’m here, kudos to David Ebbo and his excellent https://github.com/davidebbo/WAWSDeploy that I use to simply deploy a folder to Azure (read more here: https://blog.davidebbo.com/2014/03/WAWSDeploy.html). It uses MSDeploy internally and hides all the details I don’t want to know about behind a super simple UX.