What is the folder structure we create for databricks project in ADLS Gen2

Question

What is the folder structure we create for databricks project in ADLS Gen2

manish verma 516

Hi All,

We are not expert to setup a folder structure for Databricks project.

We have projects for example we get data from project a,b,c,d,e we will keep these project data in RAW folder as it is same as Source.

Delta Table Location (Stage DB)-We will push RAW data in Databricks database “Stage”, so we give database location as well as location of delta Table in ADLS gen2.

Delta Table Location (DW)- what is the folder structure of DW database location as well as location of delta Table in ADLS gen2.

2 answers

Your answer

Answer 1

Amira Bedhiafi 33,071 Volunteer Moderator

For the raw data layer, it is recommended to follow this structure :

   adls-gen2-container/
   ├── raw/
       ├── project-a/
       ├── project-b/
       ├── project-c/
       ├── project-d/
       ├── project-e/

Each project folder (like project-a/) would contain the raw data files from the respective project.

For the staging layer, this is where you'd store the data after some level of initial processing or cleansing, but before it's loaded into your data warehouse.

 adls-gen2-container/
   ├── stage/
       ├── table1/
       ├── table2/
       ├── ...

Each table folder (like table1/) would contain delta files representing a table in your "Stage" database in Databricks.

Finally for the DWH layer, this is where you find more refined, clean and denormalized data that is ready for analytical querying.

adls-gen2-container/
├── dw/
    ├── fact-table1/
    ├── dim-table1/
    ├── ...

Similar to the staging layer, each folder here represents a table in your DWH.

Don't forget to follow these best practices :

When working with Delta Lake on ADLS Gen2, consider partitioning your data. This can greatly improve query performance. The folder structure would reflect these partitions. For instance, if you're partitioning by date, your folders might look like table1/date=2023-09-20/.

With ADLS Gen2, you can manage fine-grained access control at the folder and file level. Ensure that permissions are appropriately set so that only authorized users and applications can access or modify the data.

As data grows, you might want to archive or delete old data. Consider setting up lifecycle management policies on Try to establish a consistent naming convention for your folders and files to ensure clarity and avoid confusion.

And don't forget to document your folder structure, naming conventions, data sources, and any transformation logic.

manish verma 516 Reputation points

2023-09-20T14:50:04.7733333+00:00

Hi, Tanks a lot for your time and effort, but I have confusion if we see this Microsoft reference

Site, it is not simplified I hope you understand when refer this link.

https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-zones
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-20T15:02:07.4033333+00:00

What is your confusion exactly ?
manish verma 516 Reputation points

2023-09-20T16:37:30.9733333+00:00

We need to follow some best practices, so in future rework should be avoid.
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-20T19:13:00.02+00:00

Can you please be more precise ?
manish verma 516 Reputation points

2023-09-21T08:24:44.6866667+00:00

Hi I have gone through below links https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjbgtz5ybOBAxX1TmwGHaQXB7sQFnoECCQQAQ&url=https%3A%2F%2Fwww.mssqltips.com%2Fsqlservertip%2F6807%2Fdesign-azure-data-lake-store-gen2%2F&usg=AOvVaw0Rhfu2KzXsyxaNiM_laCTy&opi=89978449

we can see here some more example of folder name and arrangement, request someone from Microsoft ADLS team comment on this
Amira Bedhiafi 33,071 Reputation points Volunteer Moderator

2023-09-21T08:37:28.63+00:00

Yes I totally agree but you need to understand that you can align that with your need.

My answer was based on the details you provided. :)
manish verma 516 Reputation points

2023-09-21T10:34:33.0966667+00:00

Hi we need understand what is recommend by Microsoft, if it is not clear, then we will ask, if we are expert in ADLS Gen2 to design data lake with azure Databricks will not ask this question. thanks a lot for your time and effort.
manish verma 516 Reputation points

2023-09-23T12:27:13.8533333+00:00

please close this question, hence i didn't get any expert answer

Answer 2

manish verma 516

I got answer from https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

Thanks

Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-10-09T17:05:15.22+00:00

Hello manish verma,

I am glad to know that you were able to find the answer from the Microsoft document and thanks for sharing the link as it helps the other community members looking for answers to similar questions.

Share via

What is the folder structure we create for databricks project in ADLS Gen2

2 answers

Your answer