Share via

Delete data between date range in Data Flow

Yash Tamakuwala 1 Reputation point
2022-01-04T02:33:48.1+00:00

0

I have a data Flow that reads from Parquet files, does some filtering and then loads into a Delta Lake. The data flow would run multiple times and I don't want duplicate data in my Delta Lake. To safeguard this, I thought to implement a delete-insert mechanism- Find the minimum and maximum date of the incoming data and delete all the data in destination (delta) that falls under this range. Once deleted, all filtered incoming data would be inserted into delta lake.

From documentation, I saw that I need to add policies at row level in an Alter Row Tx to mark that particular row for deletion. I added Delete-If condition as - between(toDate(date, 'MM/dd/yyyy'), toDate("2021-12-22T01:49:57", 'MM/dd/yyyy'), toDate("2021-12-23T01:49:57", 'MM/dd/yyyy')) where date is a column in incoming data. 'date' is the date column and I am casting it to date as it is a string in incoming dataset.

However, in data preview of Alter Row Tx, all the rows are being marked for insertion and 0 for deletion when there definitely are records that belong to that range.

I suspect that Delete-If condition does not work the way I want it to or I am doing something wrong in the data type conversion. In that case, how do I implement deletion between data range in Data Flow with Delta as destination ?

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.