MSF V2 DeepDive – Batching Directory and Batch Files
This is a continuation from the previous post I did explaining the new memory based batching feature for Database providers in MSF V2. Please refer to that post at MSF V2 CTP2 Deep Dive – Memory Based Batching. This is probably a good time to mention that the final version of MSF V2 has been released to web and can be downloaded from https://msdn.microsoft.com/en-us/sync/default.aspx.
Database providers chunks up changes by spooling changes to the file system. These chunks, or batch files, are later transferred to the remote provider which applies them in the correct order. At any given point the runtime will be processing at most one batch file. Notice the emphasis on runtime. Any user specified event operating on the DataSet property would result in more than one batch file live in memory. Batches are spooled to the file system by the enumerating provider and the provider uses the RelationalSyncProvider.BatchingDirectory property on the enumerating provider to detect the base directory to use. For each sync the runtime will create a unique directory inside the base directory. The name of the base directory is of the format Sync_XXXXXXXXXXXX. The directory name is unique for the two providers currently synchronizing and the name does not change for subsequent syncs. This allows the runtime to detect “failed” synchronization attempts so it can resume from the failed point. More on the “Sync Resume” feature later. Inside that directory the runtime will spool one SyncBatchHeaderFile.sync file and one or more .batch files. The .sync file is the metadata file that contains metadata on the current sync session. It holds some key information such as the Version, MadeWithKnowledge and DestinationKnowledge. The .batch files contain the raw change data that needs to be applied on the destination. The .batch file is just a binary serialized version of the DbSyncBatchInfo type. The batch info type contains both the actual data and the metadata corresponding to that data. For faster access to the metadata the runtime serializes the metadata separately from the data. Here is the definition of the DbSyncBatchInfo type.
1: public class DbSyncBatchInfo : IDisposable
2: {
3: public DbSyncBatchInfo();
4: public long DataCacheSize { get; set; }
5: public DataSetSurrogate DataSetSurrogate { get; set; }
6: public string Id { get; set; }
7: public bool IsLastBatch { get; set; }
8: public uint SequenceNumber { get; set; }
9: public Version Version { get; set; }
10: public void Dispose();
11: protected virtual void Dispose(bool cleanup);
12: public byte[] GetLearnedKnowledge();
13: public void SetLearnedKnowledge(byte[] knowledgeBytes);
14: public override string ToString();
15: }
The actual data is contained in the DataSetSurrogate property while the rest of the properties are just the metadata for that data. Some key metadata items are
- Version – Batching version of the provider that generated this batch. This enables the destination runtime to apply versioning rules when consuming batches generated from an older version.
- Id – Unique id of this batch
- SequenceNumber – This is used by the destination to ensure that batches are applied in the same order they were generated in. This will ensure that batch files arriving out of order at the destination are still consumed in the correct sequence.
- DataCacheSize – Represents the deserialized size of the data.
With the RTM version we exposed this type out to the user. This change was made to accommodate feedback from users that they wanted to override the default BinarySerializer or in some cases completely move off DataSet as a data transfer medium. With this type public, users can now easily deserialize any .batch file and provider their customizations on top of that. Each batch file contains two binary objects. The first one is the actual DbSyncBatchInfo type sans the actual data followed by the serialized version of DataSertSurrogate. Users customizing this part of the runtime should pay attention to the format. I have attached a simple factory class DbSyncBatchInfoFactory.cs that can serialize and deserialize any batch file.
Resuming Sync Session
I had earlier mentioned that the runtime uses a unique reproducible directory name for storing batch files whenever it enumerates changes for a provider. This is so that the runtime can attempt to restart earlier failed sync session. There are many reasons for sync session abruptions but the most common one users face is transient network disconnections. Flaky network connections is quite a common issue with mobile users. Users would have restart a sync session if their network connectivity gets dropped in between. This meant that they had to download/upload all changes back to the remote server. This also means that the remote Sql Server is wasting precious CPU time enumerating the same changes over and over.
To avoid re-enumerating changes the enumerating provider will check to see if any batch files exist prior to spawning a new “Select” query. If any batch files for the current remote provider exists, the runtime will inspect the files to see if it can reuse those files. Several factors are considered to determine batch reusability such as state of enumeration, destination knowledge etc. The runtime uses the metadata from the SyncBatchHeaderFile.sync to check if existing batch files are relevant for the current state of the destination. If it determines that the files are relevant then it will pick up enumeration from where the older sync left off. This means if a table had changes between the first failed sync session and the restart only those changes would be picked up. If the runtime determines that the existing batch files are not relevant (destination got changes from a different peer or one of the batch files is corrupt of missing or the enumeration was incomplete the first time) all batch files are deleted and a fresh enumeration query is launched.
This batching directory is usually cleaned up by the runtime after it successfully applies all changes. Users can override this behavior by setting RelationalSyncProvider.CleanupBatchingDirectory property to false.
Batch Files Cleanup In Mid Tier
If in case users are using a mid tier to communicate with Sql Server then you would need a background job that constantly monitors the batching folder to delete any files older than a certain time. This is needed because the runtime cleanup of the batch files happens only on the destination provider after a successful application (to enable sync restart feature).
Anti-Virus tools and Batching
One note of caution to users having real time anti-virus scanning tools. Please exclude the batching directory from real time scan so that there is no file contention between the sync runtime and the virus scan runtime.
Maheshwar Jayaraman
Technorati Tags: DbSyncProvider,Sync Framework,Sync Services For ADO.NET