Replication: An error occurred while reading the .bcp data file for the 'MSmerge_rowtrack' article

Problem:

I have a random issue with a few of my Compact Edition subscribers.  The issue that occurs is that the synchronization fails to complete with the following error message:

An error occurred while reading the .bcp data file for the 'MSmerge_rowtrack' article. If the .bcp file is corrupt, you must regenerate the snapshot before initializing the Subscriber. (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147199428)
Get help: https://help/MSSQL_REPL-2147199428

The merge process was unable to deliver the snapshot to the Subscriber. If using Web synchronization, the merge process may have been unable to create or write to the message file. When troubleshooting, restart the synchronization with verbose history logging and specify an output file to which to write. (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147201001)
Get help: <https://help/MSSQL_REPL-2147201001>

This is on SQL Server 2005 SP3 build 9.00.4211

***Please note*** :

If you are experiencing this issue and your version of SQL Server is prior to SP3 (<9.00.4035), please upgrade your SQL Server build to build 9.00.4211 (SP3 + CU2) as you maybe running into this issue:
https://support.microsoft.com/kb/962003/

Cause:

The cause of the problem is that when the dynamic snapshot for a partition is written to the snapshot folder, it doesn't delete the existing MSmerge_xxx files. The snapshot agent rather opens the existing file, writes the new content into it over the existing content, and then saves it to disk.

This works well as long as the new content is larger than the existing content; if it is smaller, you'll effectively get the new content plus the left-over stuff from the previous version.

If you run the partition snapshot agent repeatedly, you might even end up with several corruptions, depending on the amount of current payload the file will carry. A graphical description of consecutive file versions could be:

  version1: 11111111111111111111   (payload 20 KB, total size 20 KB)
  version2: 22222222222221111111   (payload 13 KB, total size 20 KB)
  version3: 33333333222221111111   (payload 8 KB, total size 20 KB)
  version4: 44444444444221111111   (payload 11 KB, total size 20 KB)
  version5: 55555555555555555111   (payload 17 KB, total size 20 KB)

For reference:

"No key matching the described characteristics could be found within the current range"

This error message is different though the underlying cause may be the same:

    1-  On the publisher, delete rows that were previously changed
(metadata rows get moved from MSmerge_contents into MSmerge_tombstone so that the MSmerge_contents table has less rows)
    2-  Mark a subscription for reinit
    3-  Run the partition snapshot agent to generate new snapshot files.

Because MSmerge_contents has less rows, the BCP file gets corrupted.

Solution/Workaround:

Workaround 1:

Change the dynamic snapshot job, add a step to remove its folder before execution
This would be the best approach if you have a relatively fixed set of partitions, with a limited total number of partitions.
If however the number of partitions is changing a lot, or if you have a high number of different partitions, this workaround causes too much administrative overhead.

Workaround 2:

Execute the general full snapshot job daily, before a daily run of all dynamic snapshot jobs. This approach is best if you have some kind of maintenance window that allows for the execution of the jobs. You could set the job schedules accordingly and won't have to change the individual job definition.

The general snapshot will trigger the removal of all the partition snapshot folders. Execution of the general snapshot won't have much impact on the publisher database, but the creation of the partition snapshots might cause performance issues (I/O, locks, blocking). Needs to be tested accordingly during normal user application work.

Workaround 3:

Execute the general full snapshot job daily. Allow subscriptions to create a partition snapshot on demand when a new snapshot is required.
Similar to Workaround 2, but the partition snapshots are only executed when needed. This is done by the publication option:

    sp_addmergepublication @allow_subscriber_initiated_snapshot = N'true'

which you may already have enabled. Possible disadvantage is that you can't 100% predict the time when the snapshot is generated and its impact on the publisher database (I/O, locks, blocking) at that time.

Workaround 4:

Create a job that removes all dynamic snapshot folders. Execute it daily before a daily run of all dynamic snapshot jobs.
This would require some kind of coding, in order to retrieve the snapshot path names from the system tables or the snapshot folder.
Possible disadvantage is that the creation of the partition snapshots might cause performance issues (I/O, locks, blocking), might require execution during a maintenance window.