Deduplication in eDiscovery search results
This article describes how deduplication of eDiscovery search results works and explains the limitations of the deduplication algorithm.
When using eDiscovery tools to export the results of an eDiscovery search, you have the option to deduplicate the results that are exported. What does this mean? When you enable deduplication (by default, deduplication isn't enabled), only one copy of an email message is exported even though multiple instances of the same message might have been found in the mailboxes that were searched. Deduplication helps you save time by reducing the number of items that you have to review and analyze after the search results are exported. But it's important to understand how deduplication works and be aware that there are limitations to the algorithm that might cause a unique item to be marked as a duplicate during the export process.
The information in this article is applicable when exporting search results using one of the following eDiscovery tools:
- Content search in the Microsoft Purview compliance portal
- In-Place eDiscovery in Exchange Online
- The eDiscovery Center in SharePoint Online
If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.
How duplicate messages are identified
eDiscovery tools use a combination of the following email properties to determine whether a message is a duplicate:
- InternetMessageId - This property specifies the Internet message identifier of an email message, which is a globally unique identifier that refers to a specific version of a specific message. This ID is generated by the sender's email client program or host email system that sends the message. If a person sends a message to more than one recipient, the Internet message ID is the same for each instance of the message. Subsequent revisions to the original message receive a different message identifier.
- ConversationTopic - This property specifies the subject of the conversation thread of a message. The value of the ConversationTopic property is the string that describes the overall article of the conversation. A conversation consists of an initial message and all messages sent in reply to the initial message. Messages within the same conversation have the same value for the ConversationTopic property. The value of this property is typically the Subject line from the initial message that spawned the conversation.
- BodyTagInfo - This is an internal Exchange store property. The value of this property is calculated by checking various attributes in the body of the message. This property is used to identify differences in the body of messages.
During the eDiscovery export process, these three properties are compared for every message that matches the search criteria. If these properties are identical for two (or more) messages, those messages are determined to be duplicates, and the result is that only one copy of the message will be exported if deduplication is enabled. The message that is exported is known as the "source item". Information about duplicate messages is included in the Results.csv and Manifest.xml reports that are included with the exported search results. In the Results.csv file, a duplicate message is identified by having a value in the Duplicate to Item column. The value in this column matches the value in the Item Identity column for the message that was exported.
The following graphics show how duplicate messages are displayed in the Results.csv and Manifest.xml reports that are exported with the search results. These reports don't include the email properties previously described, which are used in the deduplication algorithm. Instead, the reports include the Item Identity property that is assigned to items by the Exchange store.
Results.csv report (viewed in Excel)
Manifest.xml report (viewed in Excel)
Additionally, other properties from duplicate messages are included in the export reports. This includes the mailbox the duplicate message is located in, whether the message was sent to a distribution group, and whether the message was Cc'd or Bcc'd to another user.
Limitations of the deduplication algorithm
There are some known limitations of the deduplication algorithm that might cause unique items to get marked as duplicates. It's important to understand these limitations so you can decide whether or not to use the optional deduplication feature.
There's one situation where the deduplication feature might mistakenly identify a message as a duplicate and not export it (but still cite it as a duplicate in the export reports). These are messages that a user edits but doesn't send. For example, let's say a user selects a message in Outlook, copies the contents of the message, and then pastes it in a new message. Then the user changes one of the copies by removing or adding an attachment, or changing the subject line or the body itself. If these two messages match the query of an eDiscovery search, only one of the messages will be exported if deduplication is enabled when the search results are exported. So even though the original message or the copied message was changed, neither of the revised messages were sent and therefore the values of InternetMessageId, ConversationTopic and BodyTagInfo properties weren't updated. But as previously explained, both messages are listed in the export reports
Unique messages can also be marked as duplicates when the Copy-on-Write page protection feature is enabled, as in the case of a mailbox being on Litigation Hold or In-Place Hold. The Copy-on-Write feature copies the original message (and saves it in the Versions folder of the user's Recoverable Items folder) before the revision to original item is saved. In this case, the revised copy and the original message (in the Recoverable Items folder) might be considered as duplicate messages and therefore only one of them would be exported.
If the limitations of the deduplication algorithm might impact the quality of your search results, then you shouldn't enable deduplication when you export items. If the situations described in this section are unlikely to be a factor in your search results, and you want to reduce the number of items most likely to be duplicates, then you should consider enabling deduplication.
For more information about exporting search results, see: