Azure Functions vs Azure Data Factory for CSV file processing

Ershad Nozari 421 Reputation points
2022-08-18T03:11:01.52+00:00

We have requirement where we receive CSV files in a blob storage container from where have logic that matches the CSV files based on file name and records within the files (i.e. similar to a SQL join operation). These files are direct dumps from DB tables. For instance, for an Employee entity, we are receiving 2 files, one file containing Employee information and another file containing other Employee related details. In the DB this would correspond to 2 tables, which we are receiving direct dumps of.

In addition, we need to compare the current received batch (again join the files based on filename & the containing records) and compare the content with the previous batch to calculate any deltas, i.e. which records that have been Added/Updated/Deleted between batches.

We then store the outcome (delta records) in a separate storage account for further processing.

As it stands, we are performing the logic in a Function App, but are considering to potentially do the delta processing in Azure Data Factory. I.e. ADF to perform the matching of CSV files, join the records and do the batch comparison to produce the delta records.

We don’t have any control on how the source system is sending us the data.

I’m looking for recommendation/viability for using ADF (or alternatives).

Appreciate any pointers, thought and recommendation.

Cheers.

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
4,390 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,755 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2022-08-18T22:14:05.953+00:00

    Hello @Ershad Nozari ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is recommendation on join two different dataset ( in two diffrent) files and then find the delta and process it , please do let us know if its not accurate.

    As I understand you atleast have two options
    Options #1
    Use mapping data flow ( MDF ) : MDF is used mostly for transformation . You can read the data from the two files , join them and then use function like CRC32 to have the fingerprint of the row of the incoming files . Do a similar thing on the past file and compare the fongerprints and determine the delta .

    Option #2
    You also use Synapse Analytics also ,, it offers something called spark pool ( whcih runs on Apache Spark ) and you can read the file and join the dataframes and use the hash function to determine the delta .

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators