How to copy files from sharepoint drive that has huge heirarcy of folders and subfolders

Rayne 20 Reputation points
2024-05-12T03:16:23.7+00:00

We want to get all files from the sharepoint into our ADLS container. In sharepoint drive there are arround 60k files but through our existing pipeline we are getting only 600 files. We have a web activity with below url:-

https://graph.microsoft.com/v1.0/drives/@{pipeline().parameters.DriveID}/items/?$filter=lastModifiedDateTime ge @{pipeline().parameters.lastUpdate}&$count=true

And we are also handling odata.nextLink efficiently. I don't know where I'm going wrong. We have huge heirarcy of folder like :- In drive we have A, B, C, D folders where A has E, F , G and again they have subfolders. So there is a huge heirarcy (max hierarchy we have is 10 level). Now please help me to find the best approach so that we can copy all the files from sharepoint without missing a single file.

Is there any other endpoint that can retrieve each and every file from that sharepoint drive without missing anything.

If there is a recursive call scenario than please share how can I implement it without missing any folder and subfolder file's

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,483 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,833 questions
SharePoint
SharePoint
A group of Microsoft Products and technologies used for sharing and managing content, knowledge, and applications.
10,822 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 26,261 Reputation points
    2024-05-12T14:06:27.3966667+00:00

    Based on this old thread, you can opt for Power Automate where you set the option 'Option Include Nested items Yes.

    Also, my idea is to create a recursive function to iterate through folders and subfolders (ADF pipeline with Azure Function or Logic Apps for recursion)

    The Microsoft Graph API provides the capability to list items within a drive, and you can leverage the children endpoint to recursively fetch all items.

    You can set up an AAD app and grant it the necessary permissions to access SharePoint files.

    And then to build the initial API Call, you can use the driveItem/children endpoint to get the contents of a folderhe driveItem/children endpoint to get the contents of a folder.

    Just keep in mind that if you are starting from the root, {item-id} can be replaced with root and if you are starting from the root, {item-id} can be replaced with root.

    
    https://graph.microsoft.com/v1.0/drives/{drive-id}/items/{item-id}/children
    
    

    And don't forget to follow the @odata.nextLink to handle pagination if there are more items than can be returned in a single API call.

    Now comes the part of the recursivity, for each item returned by the API:

    • If the item is a folder, recursively call the API to get its children.
    • If the item is a file, add it to the list of files to be copied.

    How to Implement it in ADF?

    • Web Activity: To call the Microsoft Graph API.
    • ForEach Activity: To iterate over each item in the response.
    • If Condition Activity: To check if the item is a folder.
    • Azure Function or Logic App: To handle recursion if the item is a folder.
    1. Sample Pseudo-Code for Azure Function (Python):
         
         import requests
         
         def fetch_files_from_folder(drive_id, folder_id, access_token):
         
         url = f"https://graph.microsoft.com/v1.0/drives/{drive_id}/items/{folder_id}/children"
         
         headers = {
         
         "Authorization": f"Bearer {access_token}"
         
         }
         
         files = []
         
         while url:
         
         response = requests.get(url, headers=headers).json()
         
         for item in response.get('value', []):
         
            if item['folder']:
         
                files.extend(fetch_files_from_folder(drive_id, item['id'], access_token))
         
            else:
         
                files.append(item)
         
         url = response.get('@odata.nextLink', None)
         
         return files
         
      
    2. Integrate with ADF Pipeline:
      • Activity 1: Get Access Token
        • Use a Web Activity to get the OAuth2 access token.
      • Activity 2: Fetch Root Folder Items
        • Use another Web Activity to call the Microsoft Graph API for the root folder.
      • Activity 3: ForEach Activity
        • Iterate over the items in the response.
      • Activity 4: Check Item Type
        • Use an If Condition Activity to check if the item is a folder.
      • Activity 5: Recursion
        • If the item is a folder, call the Azure Function to handle recursion.
    • Get Drive Root Items: https://graph.microsoft.com/v1.0/drives/{drive-id}/root/children
    • Get Folder Children: https://graph.microsoft.com/v1.0/drives/{drive-id}/items/{folder-id}/children

    Here are more links to help you :

    https://www.youtube.com/watch?v=inZz9537sXs

    https://sharepoint.stackexchange.com/questions/186781/how-to-copy-all-files-from-folders-and-subfolders-into-one-folder-using-powershe

    https://www.spguides.com/power-automate-copy-folders/


  2. Anand Prakash Yadav 7,795 Reputation points Microsoft Vendor
    2024-05-14T09:04:33.9133333+00:00

    Hello Rayne,

    Thank you for posting your query here!

    You may refer the below Logic app workflow to get the libraries and lists from SharePoint and then iterate over each library and lists. Lastly uploading the files to Blob storage.

    Source: https://stackoverflow.com/questions/78100685/copy-files-from-sharepoint-site-content-to-azure-blob-using-logic-apps

    Workflow:

    Screenshot 2024-03-11 200811

    User's image

    User's image

    User's image

    User's image

    User's image

    Code:

    {
        "definition": {
            "$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#",
            "actions": {
                "For_each": {
                    "actions": {
                        "For_each_1": {
                            "actions": {
                                "Condition": {
                                    "actions": {
                                        "Create_blob_(V2)": {
                                            "inputs": {
                                                "body": "@body('Get_file_content_using_path')",
                                                "headers": {
                                                    "ReadFileMetadataFromServer": true
                                                },
                                                "host": {
                                                    "connection": {
                                                        "referenceName": "azureblob"
                                                    }
                                                },
                                                "method": "post",
                                                "path": "/v2/datasets/@{encodeURIComponent(encodeURIComponent('AccountNameFromSettings'))}/files",
                                                "queries": {
                                                    "folderPath": "/demo-container",
                                                    "name": "@{items('For_each_1')?['{FullPath}']}",
                                                    "queryParametersSingleEncoded": true
                                                }
                                            },
                                            "runAfter": {
                                                "Get_file_content_using_path": [
                                                    "SUCCEEDED"
                                                ]
                                            },
                                            "runtimeConfiguration": {
                                                "contentTransfer": {
                                                    "transferMode": "Chunked"
                                                }
                                            },
                                            "type": "ApiConnection"
                                        },
                                        "Get_file_content_using_path": {
                                            "inputs": {
                                                "host": {
                                                    "connection": {
                                                        "referenceName": "sharepointonline"
                                                    }
                                                },
                                                "method": "get",
                                                "path": "/datasets/@{encodeURIComponent(encodeURIComponent('*************'))}/GetFileContentByPath",
                                                "queries": {
                                                    "inferContentType": true,
                                                    "path": "@{items('For_each_1')?['{FullPath}']}",
                                                    "queryParametersSingleEncoded": true
                                                }
                                            },
                                            "type": "ApiConnection"
                                        }
                                    },
                                    "else": {
                                        "actions": {}
                                    },
                                    "expression": {
                                        "and": [
                                            {
                                                "equals": [
                                                    "@items('For_each_1')?['{isFolder}']",
                                                    false
                                                ]
                                            }
                                        ]
                                    },
                                    "type": "If"
                                }
                            },
                            "foreach": "@body('Get_items_')?['value']",
                            "runAfter": {
                                "Get_items_": [
                                    "SUCCEEDED"
                                ]
                            },
                            "type": "Foreach"
                        },
                        "Get_items_": {
                            "inputs": {
                                "host": {
                                    "connection": {
                                        "referenceName": "sharepointonline"
                                    }
                                },
                                "method": "get",
                                "path": "/datasets/@{encodeURIComponent(encodeURIComponent('*************'))}/tables/@{encodeURIComponent(encodeURIComponent(items('For_each')?['DisplayName']))}/items",
                                "queries": {
                                    "viewScopeOption": "RecursiveAll"
                                }
                            },
                            "type": "ApiConnection"
                        }
                    },
                    "foreach": "@body('Get_all_lists_and_libraries')?['value']",
                    "runAfter": {
                        "Get_all_lists_and_libraries": [
                            "SUCCEEDED"
                        ]
                    },
                    "type": "foreach"
                },
                "Get_all_lists_and_libraries": {
                    "inputs": {
                        "host": {
                            "connection": {
                                "referenceName": "sharepointonline"
                            }
                        },
                        "method": "get",
                        "path": "/datasets/@{encodeURIComponent(encodeURIComponent('***********'))}/alltables"
                    },
                    "runAfter": {},
                    "type": "ApiConnection"
                }
            },
            "contentVersion": "1.0.0.0",
            "outputs": {},
            "triggers": {
                "When_a_HTTP_request_is_received": {
                    "kind": "Http",
                    "type": "Request"
                }
            }
        },
        "kind": "Stateful"
    }
    
    

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.