What is the folder structure we create for databricks project in ADLS Gen2

manish verma 516 Reputation points
2023-09-20T04:59:41.61+00:00

Hi All,

We are not expert to setup a folder structure for Databricks project.

 

We have projects for example we get data from project a,b,c,d,e we will keep these project data in RAW folder as it is same as Source.

Delta Table Location (Stage DB)-We will push RAW data in Databricks database “Stage”, so we give database location as well as location of delta Table in ADLS gen2.

Delta Table Location (DW)- what is the folder structure of DW database location as well as location of delta Table in ADLS gen2.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2023-09-20T14:36:51.22+00:00

    For the raw data layer, it is recommended to follow this structure :

       adls-gen2-container/
       ├── raw/
           ├── project-a/
           ├── project-b/
           ├── project-c/
           ├── project-d/
           ├── project-e/
    

    Each project folder (like project-a/) would contain the raw data files from the respective project.

    For the staging layer, this is where you'd store the data after some level of initial processing or cleansing, but before it's loaded into your data warehouse.

     adls-gen2-container/
       ├── stage/
           ├── table1/
           ├── table2/
           ├── ...
    

    Each table folder (like table1/) would contain delta files representing a table in your "Stage" database in Databricks.

    Finally for the DWH layer, this is where you find more refined, clean and denormalized data that is ready for analytical querying.

    adls-gen2-container/
    ├── dw/
        ├── fact-table1/
        ├── dim-table1/
        ├── ...
    

    Similar to the staging layer, each folder here represents a table in your DWH.

    Don't forget to follow these best practices :

    When working with Delta Lake on ADLS Gen2, consider partitioning your data. This can greatly improve query performance. The folder structure would reflect these partitions. For instance, if you're partitioning by date, your folders might look like table1/date=2023-09-20/.

    With ADLS Gen2, you can manage fine-grained access control at the folder and file level. Ensure that permissions are appropriately set so that only authorized users and applications can access or modify the data.

    As data grows, you might want to archive or delete old data. Consider setting up lifecycle management policies on Try to establish a consistent naming convention for your folders and files to ensure clarity and avoid confusion.

    And don't forget to document your folder structure, naming conventions, data sources, and any transformation logic.


  2. manish verma 516 Reputation points
    2023-10-07T17:12:21.9466667+00:00

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.