Do you keep historical data in the curated zone of the Azure Data Lake storage folders?

Question

Do you keep historical data in the curated zone of the Azure Data Lake storage folders?

berean_100 6

I'm storing data in ADLS zones (Raw > Staging > Curated) which will be fed to a data warehouse. Think for example of Customer data from a CRM application:

Raw\CRM\Customer\2022\03\05\raw_crm_customer_2022_03_05.csv containing entries
1, John Smith,100
2, Mario Castillo,200

Raw\CRM\Customer\2022\03\06\raw_crm_customer_2022_03_06.csv containing entries
2, Mario Castillo,300 //record has been modified
3 Mary Tyler, 500 // new record on 3rd March

So what should I store in the curated zone? What is the best naming convention?

Do I show an aggregation showing the latest modified version of each record? Or do I show historical records as well? For example, using aggregation on the last modified record.

1 John Smith, 100
2 Mario Castillo, 300
3 Mary Tyler,500

And what is the naming convention? Is it curated\CRM\Customer ?

My plan is to load historical data in the data warehouse from the curated zone.

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-03-11T09:27:50.323+00:00
Hello @berean_100 ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

1 answer

Your answer

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-03-11T09:27:50.323+00:00

Hello @berean_100 ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

Answer 1

Hello @berean_100 ,

Thanks for the question and using MS Q&A platform.

It's important to plan the structure of your data before you land it into a data lake. This planning then allows security, partitioning, and processing to be used effectively. The three data lakes outlines a starting point for data management and analytics scenario.

The three data lake accounts should align to the typical zones within a data lake.

Note: There should be a single container per data lake zone.

Raw zone or data lake one

Using the water-based analogy, think of this layer as a reservoir that stores data in its natural and original state. It's unfiltered and unpurified. You might choose to store the data in its original format, such as JSON or .CSV. But there might be scenarios where it makes sense to store it as a column in compressed format such as Avro, Parquet, or Databricks Delta Lake.

Enriched zone or data lake two

The next layer can be thought of as a filtration zone that removes impurities but may also involve enrichment.
Typical activities found in this layer are schema and data type definition, removal of unnecessary columns, and application of cleaning rules whether it be validation, standardization, harmonization. Enrichment processes may also combine data sets to further improve the value of insights.

Curated zone or data lake two

The curated zone or data lake two is the consumption layer. It's optimized for analytics rather than data ingestion or data processing. It might store data in de-normalized data marts or star schemas.

Data is taken from the golden layer, in enriched data, and transformed into high-value data products that are served to the consumers of the data. Consumers of the data might include BI analysts and data scientists. This data has structure and can be served to the consumers either as-is such as data science notebooks, or through another read data store such as Azure SQL Database.

Workspace zone or data lake three

Along with the data that's ingested by the data integration team from the source, the consumers of the data can also bring other useful datasets.
In this scenario, the data platform should allocate a workspace for these consumers so they can use the curated data along with the other datasets they bring, to generate valuable insights. For example, if a data science team wants to determine the product placement strategy for a new region, they can bring other datasets such as customer demographics and usage data of similar products from that region. This high-value sales insights data can be used to analyze the product market fit and the offering strategy.

For more details, refer to Provision three Azure Data Lake Storage Gen2 accounts for each data landing zone

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-03-14T10:02:25.473+00:00

Hello @berean_100 ,

Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.
Le Fur, Herve 1 Reputation point

2022-04-11T13:00:32.493+00:00

Interesting response from an architecture point of view and for general rules to apply.
But it seems you do not reply about how to store those data at the last level (curated).

If at the raw level we have :
Raw\CRM\Customer\2022\03\06\raw_crm_customer_2022_03_06.csv
Raw\CRM\Customer\2022\03\07\raw_crm_customer_2022_03_07.csv

General question is :
Do we have to keep all versions in the last level or to keep only the last view of the data :
Curated\CRM\Customer\crm_customer.parquet

The interest to keep a structure with no date will permit to linked directly the Synapse level to this curated space using the external object feature, isn't it ?

Or do you have another approach to manage this ?
PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-04-13T07:16:07.57+00:00

Hello @Le Fur, Herve ,

Since this thread is too old, I would recommend creating a new thread on the same forum with as much details about your issue as possible. That would make sure that your issue has better visibility in the community.

Share via

Do you keep historical data in the curated zone of the Azure Data Lake storage folders?

1 answer

Your answer