Are there tools that help identify duplicate documents in company intranet site?

2024-07-11T19:35:08.1933333+00:00

Greetings!

Is there a way or is there a tool I can use to view all duplicate documents that are being linked in an organization's intranet site?

Thank you in advance for your assistance.

SharePoint Server Management
SharePoint Server Management
SharePoint Server: A family of Microsoft on-premises document management and storage systems.Management: The act or process of organizing, handling, directing or controlling something.
2,940 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Xyza Xue_MSFT 24,256 Reputation points Microsoft Vendor
    2024-07-12T02:12:31.02+00:00

    Hi @Slater, Jacqueline (AF/OCIO/Contractor) ,

    If you are using SharePoint online, you can try using PowerShell to get duplicate files in the SharePoint site and export a CSV file.

    #Parameters
    $SiteURL = "https://tenant.sharepoint.com/sites/yoursite"
    $Pagesize = 2000
    $ReportOutput = "C:\Temp\Duplicates.csv"
     
    #Connect to SharePoint Online site
    Connect-PnPOnline $SiteURL -Interactive
      
    #Array to store results
    $DataCollection = @()
     
    #Get all Document libraries
    $DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}
     
    #Iterate through each document library
    ForEach($Library in $DocumentLibraries)
    {    
        #Get All documents from the library
        $global:counter = 0;
        $Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
            { Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
                 "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
       
        $ItemCounter = 0
        #Iterate through each document
        Foreach($Document in $Documents)
        {
            #Get the File from Item
            $File = Get-PnPProperty -ClientObject $Document -Property File
     
            #Get The File Hash
            $Bytes = $File.OpenBinaryStream()
            Invoke-PnPQuery
            $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
            $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
      
            #Collect data        
            $Data = New-Object PSObject 
            $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
            $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
            $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
            $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length        
            $DataCollection += $Data
            $ItemCounter++
            Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
                         -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
        }
    }
    #Get Duplicate Files by Grouping Hash code
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Duplicate Files Based on File Hashcode:"
    $Duplicates | Format-table -AutoSize
    #Export the duplicates results to CSV
    $Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation
    

    The result in my test: User's image

    If you are using SharePoint server, as per my test, enable the Show View Duplicates link option, or change the query rules to show duplicates on the search center page is a good way to identify duplicate file names.

    https://learn.microsoft.com/en-us/sharepoint/troubleshoot/search/near-duplicate-items-are-not-listed-in-search-results


    If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.