Azure Functions vs Azure Data Factory for CSV file processing

Question

Azure Functions vs Azure Data Factory for CSV file processing

Ershad Nozari 426

We have requirement where we receive CSV files in a blob storage container from where have logic that matches the CSV files based on file name and records within the files (i.e. similar to a SQL join operation). These files are direct dumps from DB tables. For instance, for an Employee entity, we are receiving 2 files, one file containing Employee information and another file containing other Employee related details. In the DB this would correspond to 2 tables, which we are receiving direct dumps of.

In addition, we need to compare the current received batch (again join the files based on filename & the containing records) and compare the content with the previous batch to calculate any deltas, i.e. which records that have been Added/Updated/Deleted between batches.

We then store the outcome (delta records) in a separate storage account for further processing.

As it stands, we are performing the logic in a Function App, but are considering to potentially do the delta processing in Azure Data Factory. I.e. ADF to perform the matching of CSV files, join the records and do the batch comparison to produce the delta records.

We don’t have any control on how the source system is sending us the data.

I’m looking for recommendation/viability for using ADF (or alternatives).

Appreciate any pointers, thought and recommendation.

Cheers.

1 answer

Your answer

Answer 1

Hello @Ershad Nozari ,
Thanks for the question and using MS Q&A platform.
As we understand the ask here is recommendation on join two different dataset ( in two diffrent) files and then find the delta and process it , please do let us know if its not accurate.

As I understand you atleast have two options
Options #1
Use mapping data flow ( MDF ) : MDF is used mostly for transformation . You can read the data from the two files , join them and then use function like CRC32 to have the fingerprint of the row of the incoming files . Do a similar thing on the past file and compare the fongerprints and determine the delta .

Option #2
You also use Synapse Analytics also ,, it offers something called spark pool ( whcih runs on Apache Spark ) and you can read the file and join the dataframes and use the hash function to determine the delta .

Please do let me if you have any queries.
Thanks
Himanshu

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Ershad Nozari 426 Reputation points

2022-08-19T09:14:04.403+00:00

@HimanshuSinha-msft thanks for response. To reiterate, every time we receive a batch, which consists of 2 files, we need to join the two files based on file name and join the containing records inside the files based on some ID, just like a standard SQL join. We do the same thing with the previous batch, join 2 files and their content. Then perform the delta calculation operation..

I’m not following on calculating the fingerprint {Option 1) and the hash in (Option 2) for determining the delta records. The delta calculation is based on IDs. For instance:

if an employee with ID x exists in the current batch but not the previous batch, then it’s an Add operation

if an employee with ID x does not exist in the current batch but exists in the previous batch, then it’s a Delete operation

if any of the fields for an employee with ID x, are not equal between batches, then it’s an Update operation

Cheers,
Ershad

Share via

Azure Functions vs Azure Data Factory for CSV file processing

1 answer

Your answer