How to execute a pipeline just once no matter how many blobs are created? (Azure Data Factory)

Alan Gutierrez Encizo 41 Reputation points
2022-01-06T04:19:33.663+00:00

Hello there,

I've created a Pipeline that's executed by a trigger every time a blob is created, the problem is that are scenarios where the process needs to upload multiple files at the same time, when it happens, the pipeline executes as many times as the number of blobs and it causes that the data is wrong. I tried to cofigure a Copy Data Activity in the main Pipeline in order to copy every blob created, but since this pipeline is inside the first one, it executes many times as well.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,859 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,566 questions
{count} votes

Accepted answer
  1. MartinJaffer-MSFT 26,081 Reputation points
    2022-01-06T18:58:50.697+00:00

    Hello and welcome back @Alan Gutierrez Encizo . I think I see the problem statement you have been struggling with. I have thought of a couple alternative solutions, but let me first restate your requirements and my assumptions as I understand them. Let me know if I got it wrong.

    You have an application where a user uploads 1 or many blobs in a short period of time. All of them must be processed together as one batch, otherwise incorrect data is generated. You do not know when these uploads will happen. If you did, then you could use a scheduled trigger or a tumbling window trigger. A tumbling window trigger might still be an option depending upon edge cases.

    Since you want to copy all the blobs in the folder, and know which folder, it is not necessary to get the blob names from the blob event trigger. We also do not have to worry about excluding old blobs.

    The main issue currently is that the blob event trigger happens once for each blob. I am 70% confident that cannot be changed, so work-arounds are in order.

    So one Idea I had uses 2 pipelines, a blob even trigger and a scheduled trigger. The blob event trigger would run a pipeline which enables/turns on a scheduled trigger and does nothing else. The scheduled trigger runs the pipeline doing all the actual work, with one addition: at the end it disables/turns off its own scheduled trigger.
    After the first new blob turns the scheduled trigger on, more new blobs have no effect because the the sheduled trigger is already on.
    The sceduled trigger does introduce an edge case: when the uploads are spaced out or happen during the sceduled time. I am thinking on solutions for the edge case.

    Another idea, perhaps simpler and cleaner, is to add one step to the upload process. Artificially add one last upload that happens only after all other blobs have finished uploading. This artificial last blob is used as a signal, and you can tie your blob event trigger to this 'signal' blob alone. This way the event trigger only happens once. This solution is easier to implement than the first idea, and doesn't have any downsides beyond the work to add the extra blob.

    What do you think? Either of these sound appealing?


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.