SharePoint 2013 Search – Near Duplicates and DocumentSignature
SharePoint 2013 Search – Near Duplicates and DocumentSignature
The official Microsoft Documentation regarding Duplicates can be found here:
https://msdn.microsoft.com/en-us/library/jj687488.aspx (Customizing search results in SharePoint 2013).
Issue:
Recently we have several cases reported regarding Duplicates in Search (SPS2013).
Therefore I would like to take the opportunity here to blog about this behavior. Hope this helps you in the field to understand the default behavior.
Symptoms
We have a few documents which have several similar but not exactly identical properties. When we perform the search based on a similar property, we do see only a few documents as Duplicates, and not all.
Cause
The Document stream – content text only (no titles, no filenames, no metadata, no urls) is broken into what are known as minima. The exact number of minima is Microsoft confidential but needless to say the approach is to break the document into same sized chunks.
Each group of chunks is then hashed into a larger chunk known as a supershingle
One or more supershingles are then hashed together to produce a megashingle.
The megashingle hash values are stored in the Database to produce a table of hashes for each document in the search index.
At query time this table is queried and each result in the main result set is tested for the existence of hashes that match its own set. If more than one hash is the same for two given documents then those are said to be near duplicates. The more matches are found then the closer those documents are to being identical.
The key point is that we are looking for NEAR DUPLICATES, not exact duplicates.
Note: megashingle hash values = DocumentSignature
Resolution
This is an expected behavior. As a Workaround, you can turn off ‘View Duplicates’ option in the search web part.
Note:
You can simulate the above Duplicate behavior with the SearchQueryTool (from Codeplex - https://sp2013searchtool.codeplex.com/), and fill out the “documentsignature” property in the “Select Properties” box, and then run your Search Query again.
Ofcourse you can check/uncheck the “Tim Duplicates” checkbox to see the differences in the Primary Result tab.
https://sp2013searchtool.codeplex.com/ tool, see beneath screen print.
Thanks to my colleagues for their collaboration:
Nicolas Uthurriague
Praveen Hebbar
Thanks to the author of this article: Manlon.lam[at]microsoft.com
Comments
- Anonymous
July 02, 2014
I've seen several scenarios where a single document gets crawled twice and leads to duplicate results - Anonymous
July 02, 2014
I've seen several scenarios where a single document gets crawled twice and leads to duplicate results