How to skip/ignore zip files that have been unzipped successfully before?

Question

I want to use this system within my work environment
by letting users put zip files into the Azure Blob Storage and use Azure Data Factory to automate the process of extracting/unzipping zip files everyday at the specific time.

I have followed the instruction from following YouTube tutorial.

"How to UnZip Multiple Files which are stored on Azure Blob Storage By using Azure Data Factory"

https://www.youtube.com/watch?v=TEtpvdnULZ8

I found this settings cause the system to extract every zip files (including zip files that has been extracted before) from the directory A to directory B every time it runs.

However, If the zip files have been extracted to the directory B before, it will extract again and replace existing files, which cause unnecessary process time given my situation that the directory A have 1000+ zip files and have around 10 zip files upload daily.

So I want to find a way to configure it to extract only new zip files that have not been extracted yet, while skipping/ignoring 1000+ (already extracted) zip files.

Thank you very much for checking out this question and feels free to provide any solution in your mind.

At the moment, I'm trying to find a way for Azure Data Factory to read the name of every folders in directory B (destination directory) to determine which zip file has been extracted or not, then skip the zip file that has been extracted.

Or another method in mind is by moving all zip files from directory A to somewhere else to avoid extracting old files.

Accepted Answer

Hi @Niramit Cheunprapanusorn

thanks for using MS Q&A portal for posting your question.

As you have mentioned in the last line, it is better to move the processed file to some folder like archive, so that they are not reprocessed. By move, what we mean is, after the zip file is extracted, use another copy activity to copy the file to archive folder, and add a delete activity to delete from source folder.

Otherwise, you may consider using a storage event trigger to trigger the pipeline whenever a new zip file is added.

How to skip/ignore zip files that have been unzipped successfully before?

0 additional answers