Checking for Duplicate File in OneDrive with Python

Victor Prasad 41 Reputation points
2024-04-01T22:29:58.8333333+00:00

Hello,

I am able to use MS Graph to programmatically check OneDrive for duplicate files, without having to download them all?

I would like to hash the files to determine they are duplicates, then move the duplicates to a recycle bin.

  1. Can I do this with a personal M365 account (is it free)?
  2. Can I do it with an E5 license for multiple accounts?*

*for the E5 account, the idea would be to find the number of duplicate files per account and let the users know. NOT know what the files are or view their contents. If needed we can do that similarly to 1.

Thanks,

V

Microsoft Graph
Microsoft Graph
A Microsoft programmability model that exposes REST APIs and client libraries to access data on Microsoft 365 services.
10,613 questions
OneDrive
OneDrive
A Microsoft file hosting and synchronization service.
817 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Andrew Geddes 741 Reputation points
    2024-04-17T11:08:20.0133333+00:00

    Victor, here is one option to implement this. I'd like to challenge the need a bit though if you'll entertain me. Typically, ODfB accounts aren't constrained by size (unless you're talking your top 1% consumers). Even if they surpass the 5TB threshold there are increases that can be done and you're not paying for storage. It would be better to manage the information lifecycle by user type/org level versus take this approach. Don't care about the details, care about how long any file should be there.

    Anyway, that aside, here's an approach

    To check for duplicate files in OneDrive using Python and the Microsoft Graph API, you can follow these steps:

    Register an Application: Register your application in the Azure portal to obtain the client_id, client_secret, and tenant_id. This is necessary for authenticating your application with Microsoft Graph.

    Set Permissions: In the Azure portal, set the required permissions for your application. For OneDrive file access, you might need permissions like Files.Read, Files.Read.All, Files.ReadWrite, Files.ReadWrite.All.

    Authenticate: Use the OAuth 2.0 authorization flow to obtain an access token from Microsoft Identity Platform.

    Fetch File List: Make a GET request to the /me/drive/root/children endpoint (for personal accounts) or /users/{user-id}/drive/root/children (for other users in an E5 organization) to retrieve a list of files in OneDrive.

    Hash Files: Since you cannot directly hash files using the Graph API, you'll need to download the files to hash their contents. However, to avoid downloading, you can compare file metadata such as size and name as a preliminary check for duplicates.

    Identify Duplicates: Use the file metadata or hashes to identify duplicates. If two or more files have the same hash, they are duplicates.

    Move Duplicates: To move duplicates to the recycle bin, use the DELETE /me/drive/items/{item-id} endpoint.

    For a personal M365 account, you can use the Graph Explorer to test these API calls. For an E5 license, you can perform these actions programmatically for multiple accounts, provided you have the necessary admin consent and permissions.