question

Ayeshjamal-4104 avatar image
0 Votes"
Ayeshjamal-4104 asked Ayeshjamal-4104 commented

SharePoint Online: Find Duplicate Documents using PowerShell – File Hash Method 5000>

i am trying to get all duplicates files within 1 document library for lists larger than 5000 i am not able to merge the two scripts togther

please refer to the script for getting items more than 5000 https://www.sharepointdiary.com/2016/12/sharepoint-online-get-all-items-from-large-lists-powershell-csom.html

i have a small issue am not able to update the following script to support document library greater than 5000 items please if someone can guide me please refer to the following link:https://www.sharepointdiary.com/2019/04/sharepoint-online-find-duplicate-files-using-powershell.html

 #Load SharePoint CSOM Assemblies
 Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
 Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
     
 #Parameters
 $SiteURL = "https://x.sharepoint.com"
 $ListName ="Documents"
     
 #Array to Results Data
 $DataCollection = @()
     
 #Get credentials to connect
 $Cred = Get-Credential
     
 Try {
     #Setup the Context
     $Con = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
     $Con.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
     
     #Get the Web
     $Web = $Con.Web
     $Con.Load($Web)
     
     #Get all List items from the library - Exclude "Folder" objects
     $List = $Con.Web.Lists.GetByTitle($ListName)
     $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
     $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>"
     $ListItems = $List.GetItems($Query)
     $Con.Load($ListItems)
     $Con.ExecuteQuery()
         
     $Count=1
     ForEach($Item in $ListItems)
     {
         #Get the File from Item
         $File = $Item.File
         $Con.Load($File)
         $Con.ExecuteQuery()
         Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count)" -Status "Scanning File '$($File.Name)'"
     
         #Get The File Hash
         $Bytes = $Item.file.OpenBinaryStream()
         $Con.ExecuteQuery()
         $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
         $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
     
         #Collect data       
            
         $Data = New-Object PSObject
         $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
         $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
         $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
         $DataCollection += $Data
            
           
         $Count++
     }
     #$DataCollection
     #Get Duplicate Files
     $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
     If($Duplicates.Count -gt 1)
     {
      
         $Duplicates | Out-GridView
            
            
     }
     Else
     {
         Write-host -f Yellow "No Duplicates Found!"
     }
 }
 Catch {
     write-host -f Red "Error:" $_.Exception.Message
 }
    
    
 $Duplicates | export-csv -Path c:\tmp\so.csv -NoTypeInformation
office-sharepoint-onlinewindows-server-powershell
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

RichMatheisen-8856 avatar image
0 Votes"
RichMatheisen-8856 answered Ayeshjamal-4104 commented

Try changing the query to take smaller batches instead of trying to eat the results all at once!

 $BatchSize = 1000
 $Query.ViewXml=@"
 <View Scope='RecursiveAll'>
     <Query>
         <Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where>
     </Query></View>"
     <RowLimit Paged="TRUE">$BatchSize</RowLimit>
 "@
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hello Rich,

please note that even after updating the script to smaller batches i am not able to query documents library that contain 15000 or 20000 can you please advise.

0 Votes 0 ·

I know next to nothing about SharePoint or CAML. But, if it's possible, I'd probably try adding the file name to your query, especially if wildcards can be used.

For example, instead of getting ALL the files, create a loop that gets all files beginning with "a", then "b" then "c" . . . then "0", then "1", etc. and ADD them to $ListItems. In the end you'd have a complete list of all files -- it'd just take a bit longer.

0 Votes 0 ·

Thanks

0 Votes 0 ·
ElsieLu-MSFT avatar image
0 Votes"
ElsieLu-MSFT answered RichMatheisen-8856 commented

Hi @RichMatheisen-8856 , welcome to Q&A forum!

Per my test, this script mentioned above can work normally in my environment, it can detect duplicate files. Are there any issues or error messages in your environment?

 #Load SharePoint CSOM Assemblies
 Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
 Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
     
 #Parameters
 $SiteURL = "https://crescenttech.sharepoint.com"
 $ListName ="Documents"
     
 #Array to Results Data
 $DataCollection = @()
     
 #Get credentials to connect
 $Cred = Get-Credential
     
 Try {
     #Setup the Context
     $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
     $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
     
     #Get the Web
     $Web = $Ctx.Web
     $Ctx.Load($Web)
     
     #Get all List items from the library - Exclude "Folder" objects
     $List = $Ctx.Web.Lists.GetByTitle($ListName)
     $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
     $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>"
     $ListItems = $List.GetItems($Query)
     $Ctx.Load($ListItems)
     $Ctx.ExecuteQuery()
         
     $Count=1
     ForEach($Item in $ListItems)
     {
         #Get the File from Item
         $File = $Item.File
         $Ctx.Load($File)
         $Ctx.ExecuteQuery()
         Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count)" -Status "Scanning File '$($File.Name)'"
     
         #Get The File Hash
         $Bytes = $Item.file.OpenBinaryStream()
         $Ctx.ExecuteQuery()
         $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
         $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
     
         #Collect data       
         $Data = New-Object PSObject
         $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
         $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
         $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
         $DataCollection += $Data
     
         $Count++
     }
     #$DataCollection
     #Get Duplicate Files
     $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
     If($Duplicates.Count -gt 1)
     {
         $Duplicates | Out-GridView
     }
     Else
     {
         Write-host -f Yellow "No Duplicates Found!"
     }
 }
 Catch {
     write-host -f Red "Error:" $_.Exception.Message
 }

Test Result:
147248-112.png


If the answer is helpful, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


112.png (101.1 KiB)
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

The original poster's problem wasn't that he couldn't find duplicates, it was that there was a problem if there were more than 5000 results from the query.

I suggested he limit the number of results returned at one time by asking for batches of results instead of getting them (or trying to get them) all at once. It's a common problem, not with only SharePoint, but with other products as well (Active Directory, for example).

Adding <RowLimit Paged="TRUE">$BatchSize</RowLimit> to the query seems to have resolved the situation.

0 Votes 0 ·