Best way to find new rows in parquet file to insert into table

PeterSh 171 Reputation points
2022-03-30T03:51:33.997+00:00

I get several JSON files each day which are processed (flattened and a few calculated fields) and stored in a parquet file. There are millions of rows in the parquet file, and each update cycle adds 50-60 thousand additional rows.

One process that runs on this file keeps an SQL table up to date with the changes. The SQL data is mostly the same but has some additional data and calculations to make querying easier.

My question relates to the best way to find only new rows to add to the SQL table.

At present, I run a data flow that has both the parquet file and target table as sources, then uses the 'Exists' component to find rows that don't exist, and adds them to the table.

My concern is that this takes 40+ minutes to run. This seems to be split between around 20 minutes to read the SQL table, then 20 minutes to mode the data from a staging table into the target table.

Does anyone have any suggestions on how to do this better?

Thanks

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
8,505 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 35,796 Reputation points Microsoft Employee
    2022-03-31T17:04:29.853+00:00

    Hi @PeterSh ,

    Thank you for posting query in Microsoft Q&A Platform.

    To summarize your ask, its more about best way to find only new rows form parquet file to process them to SQL table. Please correct me if I am wrong.

    Since you are loading all rows in to single file, you are ending up with using source and target as source transformations and using exists transformation to filter out new rows.

    You could consider either of below approaches if that helps.

    • Save your new rows every time as a separate file with some datetime in file name. So that we can take only that file and process it in to SQL Table.
    • Consider having some flag column or date Time column which tells us that these are new rows. So that we can use source transformation for our source file and then Filter transformation to filter out new rows based on flag or datetime value.

    Hope this helps. Please let us know how it goes. Thank you.