What is the efficient way to filter and copy files from Azure blob storage that contains million files using Azure Data Factory?

Mohammad Saber 591 Reputation points
2023-12-09T06:18:49.79+00:00

I want to copy files in a Container in an Azure blob storage which contains around 10,000,000 CSV or Zip files.

Filename format looks like "Energy_ReportName_Timestamp_VersionNumber.zip". The sample filename could be "Energy_Payment_20231209110007_0000000404988124.zip". VersionNumber at the end of the filename doesn't have a regular pattern.

I want to filter zip files for a specific ReportName and Date, and copy those files to another container.

For example, files for "ReportName" = Payment; and Date = 20231209 (at any time on this date and any VersionNumber).

Since there are millions of files in the source container, I am looking for an approach which is fast to find the desired files and copy them to the sink container using Copy activity in Azur eData Factory.

Please let me know if there is any idea.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Amira Bedhiafi 40,956 Reputation points Volunteer Moderator
    2023-12-09T14:34:39.7433333+00:00

    Start by creating the linked services in ADF for both the source and sink Azure Blob Storage accounts.

    Then create two datasets in ADF - one for the source container (Container_Source) and another for the sink container (Container_Sink). In the dataset for the source container, you can specify the path to include files starting with "Energy" and ending with ".zip".

    In the Copy Data activity, configure the source dataset to point to Container_Source. Use the wildcard file path to select files starting with "Energy" and ending with ".zip". Set the sink dataset to Container_Sink.

    To implement the incremental copy logic, for the first execution, you can simply run the pipeline to copy all existing files that match your criteria. Then to copy only the new files, I recommend using a metadata store (like Azure SQL Database) where you can log the details of the files already copied. In each subsequent run, your pipeline can check this metadata store to determine which files are new since the last run.

    To filter by the last modified date, you need to store the timestamp of the last pipeline run, where you can configure the source dataset only to pick up files that have been modified since that timestamp and of course you can parameterize the pipeline to make it more flexible and schedule it to run at your desired frequency.


1 additional answer

Sort by: Most helpful
  1. Nandan Hegde 36,716 Reputation points MVP Volunteer Moderator
    2023-12-11T10:12:23.79+00:00

    Hey,

    You can use binary source and sink dataset leveraging the conept of wild card names in the path within the copy activity.

    So the steps would be as below:

    1. Create a binary source and sink dataset till the container name
    2. Use Copy activity and within Source path , provide the below format in file value as wild card

    Payment_20231209.zip

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.