Low Level design Data lake house in Azure

Question

Low Level design Data lake house in Azure

Relay 180

I am designing a Datalake solution in Azure as shown in Diagram.

I am considers Azure Data Lake for Bronze layer. How to do automate for Bronze Layer for Monitoring, Archiving).

How to do Data consistency, Deduplication checks.

The data ingestion from Bronze layer is controlled over by Azure Data Factory.

How to design ADF which will handle the amount of files posted regularly?

how to design For handling bad files in the ADF pipeline

Do I need to fail complete batch or just that records and how to do that.

How to trigger for processes and unprocessed files.?

The source SAP CAR provides 500 files for Germany and 300 files for UK and etc. How to design-wise we distinguish this in the ADLS Gen2 structure

How to design if file from Germany or UK processed or not

Please share your thoughts.

Many Thanks

Accepted answer

0 additional answers

Your answer

Answer 1

The architecture designing differs based on multiple scenarios that your organization needs.

What do you mean by monitoring in your ADLS Gen 2?
You can have your structure in ADLS Gen 2 as
1. Region
  1. Active
  2. Archive

So initially copy all raw files into active folders based on regional bifurcations and post successful processing of files, archive them to archive folder.

Then you can use lifecycle management in Azure Data Lake Storage Gen2 (ADLS Gen2). It allows you to define rules to automatically transition data to different access tiers (Hot, Cool, Archive) or delete data based on age or other criteria, helping to optimize storage costs and manage data lifecycle.

In order to validate data, unfortunately ADF doesnt provide native support. You can integrate with Azure function or databricks and leverage Great_expectations for data validations

https://greatexpectations.io/

Copy activity within ADF has something called as fault tolerance via which you can ensure that the job can proceed ahaed even in case if there are issues in your files and those erronous records would be logged out but the decision of whether to continue or not and whether to have dependency on regional files etc depends on the business and the end goal of the data (meaning do they want all data to be reflected as same time or regional wise can be independant etc)

Share via

Low Level design Data lake house in Azure

0 additional answers

Your answer