Do you keep historical data in the curated zone of the Azure Data Lake storage folders?

berean_100 6 Reputation points
2022-03-08T05:32:30.957+00:00

I'm storing data in ADLS zones (Raw > Staging > Curated) which will be fed to a data warehouse. Think for example of Customer data from a CRM application:

Raw\CRM\Customer\2022\03\05\raw_crm_customer_2022_03_05.csv containing entries
1, John Smith,100
2, Mario Castillo,200

Raw\CRM\Customer\2022\03\06\raw_crm_customer_2022_03_06.csv containing entries
2, Mario Castillo,300 //record has been modified
3 Mary Tyler, 500 // new record on 3rd March

So what should I store in the curated zone? What is the best naming convention?

Do I show an aggregation showing the latest modified version of each record? Or do I show historical records as well? For example, using aggregation on the last modified record.

1 John Smith, 100
2 Mario Castillo, 300
3 Mary Tyler,500

And what is the naming convention? Is it curated\CRM\Customer ?

My plan is to load historical data in the data warehouse from the curated zone.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,562 questions
{count} vote

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,661 Reputation points Moderator
    2022-03-09T05:18:37.797+00:00

    Hello @berean_100 ,

    Thanks for the question and using MS Q&A platform.

    It's important to plan the structure of your data before you land it into a data lake. This planning then allows security, partitioning, and processing to be used effectively. The three data lakes outlines a starting point for data management and analytics scenario.

    The three data lake accounts should align to the typical zones within a data lake.

    181150-image.png

    Note: There should be a single container per data lake zone.

    181069-image.png

    Raw zone or data lake one

    Using the water-based analogy, think of this layer as a reservoir that stores data in its natural and original state. It's unfiltered and unpurified. You might choose to store the data in its original format, such as JSON or .CSV. But there might be scenarios where it makes sense to store it as a column in compressed format such as Avro, Parquet, or Databricks Delta Lake.

    Enriched zone or data lake two

    The next layer can be thought of as a filtration zone that removes impurities but may also involve enrichment.
    Typical activities found in this layer are schema and data type definition, removal of unnecessary columns, and application of cleaning rules whether it be validation, standardization, harmonization. Enrichment processes may also combine data sets to further improve the value of insights.

    Curated zone or data lake two

    The curated zone or data lake two is the consumption layer. It's optimized for analytics rather than data ingestion or data processing. It might store data in de-normalized data marts or star schemas.

    Data is taken from the golden layer, in enriched data, and transformed into high-value data products that are served to the consumers of the data. Consumers of the data might include BI analysts and data scientists. This data has structure and can be served to the consumers either as-is such as data science notebooks, or through another read data store such as Azure SQL Database.

    Workspace zone or data lake three

    Along with the data that's ingested by the data integration team from the source, the consumers of the data can also bring other useful datasets.
    In this scenario, the data platform should allocate a workspace for these consumers so they can use the curated data along with the other datasets they bring, to generate valuable insights. For example, if a data science team wants to determine the product placement strategy for a new region, they can bring other datasets such as customer demographics and usage data of similar products from that region. This high-value sales insights data can be used to analyze the product market fit and the offering strategy.

    For more details, refer to Provision three Azure Data Lake Storage Gen2 accounts for each data landing zone

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.