Configure search and analytics settings for eDiscovery (Premium) cases
You can configure settings for each Microsoft Purview eDiscovery (Premium) case to control the following functionality:
- Near duplicates and email threading
- Autogenerated review set query
- Ignore text
- Optical character recognition
If you're not an E5 customer, use the 90-day Microsoft Purview solutions trial to explore how additional Purview capabilities can help your organization manage data security and compliance needs. Start now at the Microsoft Purview compliance portal trials hub. Learn details about signing up and trial terms.
Configure analytics settings for a case
To configure search and analytics settings for a case:
- On the eDiscovery (Premium) page, select the case.
- On the Settings tab, under Search & analytics, choose Select. The case settings page is displayed. These settings are applied to all review sets in a case.
The following sections in this article describe the analytics settings that you can configure for a case.
Near duplicates and email threading
- Near duplicates/email threading: When turned on, duplicate detection, near duplicate detection, and email threading are included as part of the workflow when you run analytics on the data in a review set.
- Document and email similarity threshold: If the similarity level for two documents is above the threshold, both documents are put in the same near duplicate set.
- Minimum/maximum number of words: These settings specify that near duplicates and email threading analysis are performed only on documents that have at least the minimum number of words and at most the maximum number of words.
In this section, you can set parameters for themes. For more information, see Themes.
- Themes: When turned on, themes clustering is performed as part of the workflow when you run analytics on the data in a review set.
- Maximum number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in a review set.
- Include numbers in themes: When turned on, numbers (that identifies a theme) are included when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there may not be enough documents in a review set to produce the desired number of themes. When this setting is enabled, eDiscovery (Premium) adjusts the maximum number of themes dynamically rather than attempting to enforce the maximum number of themes.
Review set query
If you select the Automatically create a For Review saved search after analytics checkbox, eDiscovery (Premium) autogenerates review set query named For Review.
This query basically filters out duplicate items from the review set. This lets you review the unique items in the review set. This query is created only when you run analytics for a review set in the case. For more information, about review set queries, see Query the data in a review set.
There are situations where certain text will diminish the quality of analytics, such as lengthy disclaimers that get added to email messages regardless of the content of the email. If you know of text that should be ignored, you can exclude it from analytics by specifying the text string and the analytics functionality (Near-duplicates, Email threading, Themes, and Relevance) that the text should be excluded for. Using regular expressions (RegEx) as ignored text is also supported.
Optical character recognition (OCR)
When this setting is turned on, OCR processing will be run on image files. OCR processing is run in the following situations:
- When custodians and non-custodial data sources are added to a case. When OCR is applied to image files, the text in those files will be searchable during a collection. OCR processing is performed during the Advanced indexing process. OCR is only run on items that are processed during Advanced indexing. For example, if a large PDF file that is partially indexed or had other indexing errors is processed during Advanced indexing, the file will also have OCR applied. In other words, OCR processing only occurs on files that are reindexed during the Advanced indexing process. This means there may be situations where custodians are added to a case but some email attachments won't be processed for OCR because those files aren't processed during Advanced indexing.
- When content from other data sources (that aren't associated with a custodian and added to the case in a non-custodial data source) is added to a review set.
After data is added to a review set, image text can be reviewed, searched, tagged, and analyzed. You can view the extracted text in the Text viewer of the selected image file in the review set. For more information, see: