Data Factory Trigger to Pick up only the latest Files

Question

Hi Team,

My Blob storage is partitioned by yyyy-mm-dd-hh and every half an hour a new CSV file is getting dumped. I am trying to trigger the Data Factory pipeline whenever a new file available in my blob storage account.

Target- Every time when it triggers my ADF pipeline I want to load only the new files but currently with my setting it is loading all the available files whenever it triggers the pipeline. One option was to trigger with a tumbling window, but then with that, I have to run the pipeline in a certain time interval, instead, I want to trigger my pipeline automatically whenever I get a new file in the container.

Please guide me, how can I trigger my ADF pipeline to load only the new files every time.

Accepted Answer

Hi @Imran Mondal ,

As per my testing:

Output - In the table storage I want to load the files and if a new column is there it should load that column as well.
So, I had removed all the source schema and target schema, to load everything. The problem is in the source file I have a column named Device_ID, how can I use that
column as a partition key while loading to the Sink table.

If you source column changes then you can use the below setting in your copy activity. Which will insert new entities including new source columns/properties for the PartitionKey = Device_ID

For the second ask:

Second- All the column data type is string, how can change only one column data type, that is for data time column, keeping in mind the dynamic column in the source.

I don't think this can be done in the same copy activity. By default a property is created as type String, unless you specify a different type. To explicitly type a property, specify its data type by using the appropriate OData data type for an Insert Entity or Update Entity operation. For more information, see Inserting and Updating Entities.

Hope this answers your query.

Note: As the original query of this thread was answered it is always recommended to open new thread for new queries so that it will be easy for the community to find the helpful information :)

----------

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

Answer

Have you checked Event based trigger already? With that you get options - As soon as file created/modified, pipeline will get triggered

https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

----------

Please don't forget to Accept Answer and Up-vote if the response helped -- Vaibhav

Answer

Hey @Imran ,
Please refer the below link:
https://stackoverflow.com/questions/66869516/azure-data-factory-storage-event-trigger-only-on-new-files/66869669#66869669

Event trigger would trigger an ADF whenever a new file is uploaded but you can use
parameter :

Or you can even use Getmetadata and identify the child items and then pick the file with the latest date to processs only the latest file which caused the ADF trigger

Answer

Hi @Imran Mondal ,

Thanks for sharing the GIF as it helps alot to identify the issue.
From the GIF, in your Copy activity source settings, you are using File path type = Wildcard file path because of which your pipeline is processing all files with *.csv.

Since you are mapping the folder path and source file details from event trigger parameters to pipeline parameters and then from pipeline parameters to your dataset parameters, could you please make sure that you use File path type = File path in dataset as shown below:

Also I have noticed that you have declared the dataset parameters and mapped the pipeline parameters to your dataset parameters in copy source settings but you haven't used the dataset parameters in your dataset connection settings -> File path. Please make sure to use the dataset parameters as dynamic expression in your dataset connection settings -> File Path as below:

For more details, please refer to this GitHub issue: https://github.com/MicrosoftDocs/azure-docs/issues/42345

Note: In case if the dataset is being used by other pipelines, I would recommend to use a separate dataset for this pipeline.

Hope this helps to resolve your issue. Do let us know how it goes.

Thank you

----------

Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.

Data Factory Trigger to Pick up only the latest Files

3 additional answers