Incrementally Load Data to Azure Data Lake

Subin Pius 26 Reputation points
2021-06-07T11:44:18.787+00:00

Hi,

I understand the concept of incremental load to data lake with each days data stored as different file in the data lake storage.

My question is how to handle to records from source which are updated and not inserted in the incremental load to data lake storage

For example, say I have a record from requests table in onpremise sql server database with the status as open.
When the ADF pipeline runs today, this data is stored in the data lake storage in a csv file.
Tomorrow the status of the record changes to pending and when the ADF pipeline runs again, the modified record is read and copied in a new file in the data lake storage with status as pending.

Now in the Data lake storage, I have 2 files with the same request record but with different status.

How to handle such scenarios in Data lake storage to have a single record without duplication.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,611 questions
{count} vote

Accepted answer
  1. Nasreen Akter 10,811 Reputation points Volunteer Moderator
    2021-06-07T16:16:16.927+00:00

    Hi @Subin Pius ,

    Thank you for using MS Q&A.

    I think you can do the following:

    option#1: you can have an updated_time for each record. When Consumer process will pick up the data from the datalake, it will sort the record by updated_time for each recordId and only process the latest item/row for that recordId

    option#2: If it's a full load each time, you can have a timespan in the filename e.g., 20210607 that is yyMMdd or you can maintain a folder hierarchy to save the csv file. And then let the 'Consumer` process only pick up the latest file.

    Hope this will help. Thanks!

    --Nasreen

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.