For the raw data layer, it is recommended to follow this structure :
adls-gen2-container/
├── raw/
├── project-a/
├── project-b/
├── project-c/
├── project-d/
├── project-e/
Each project folder (like project-a/
) would contain the raw data files from the respective project.
For the staging layer, this is where you'd store the data after some level of initial processing or cleansing, but before it's loaded into your data warehouse.
adls-gen2-container/
├── stage/
├── table1/
├── table2/
├── ...
Each table folder (like table1/
) would contain delta files representing a table in your "Stage" database in Databricks.
Finally for the DWH layer, this is where you find more refined, clean and denormalized data that is ready for analytical querying.
adls-gen2-container/
├── dw/
├── fact-table1/
├── dim-table1/
├── ...
Similar to the staging layer, each folder here represents a table in your DWH.
Don't forget to follow these best practices :
When working with Delta Lake on ADLS Gen2, consider partitioning your data. This can greatly improve query performance. The folder structure would reflect these partitions. For instance, if you're partitioning by date, your folders might look like table1/date=2023-09-20/
.
With ADLS Gen2, you can manage fine-grained access control at the folder and file level. Ensure that permissions are appropriately set so that only authorized users and applications can access or modify the data.
As data grows, you might want to archive or delete old data. Consider setting up lifecycle management policies on Try to establish a consistent naming convention for your folders and files to ensure clarity and avoid confusion.
And don't forget to document your folder structure, naming conventions, data sources, and any transformation logic.