How to model thousands of files from Azure Data Lake Gen 2 to Single dataset for analysis?

Ramanathan Dakshina Murthy 21 Reputation points
2021-09-21T21:30:45.83+00:00

Hi,

I have an initial 1000s of delimited files in Azure Data Lake Gen 2 storage account. I need to read all these files and create them as single dataset for analysis. This dataset must be preserved for future files. After these files are processed, there will be only few files every day which will need to be read, then the new data should be added to the existing dataset. Business users might use this single modeled data set for analysis.

Currently it takes lot of time to query all these files. We want a faster and cost effective approach.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,338 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,366 questions
Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
199 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,917 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,535 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Thomas Boersma 806 Reputation points
    2021-09-22T08:40:45.313+00:00

    Hi @Ramanathan Dakshina Murthy ,

    A possible solution is to create a staging folder inside you Azure Data Lake Gen2 (ADLS Gen2). Inside this staging folder you will place the delimited files that needs to be processed / transformed. Processing can be done with a Data Flow within Azure Synapse or Azure Data Factory, if the transformations are too complex you could use a notebook via Databricks or with the integrated Synapse notebooks. The source of the process should import all the delimited files, for example with a wildcard: staging_folder\*.csv.
    When the data is loaded into the data flow or data frame you can process data. After the processing is done, you can store the data as a Delta Lake table within your ADLS Gen2. The delimited files should be deleted from the staging folder at this moment, because you don't want that the files are processed for a second time. At the end you can create a Synapse database view (with the Synpase serverless pool) on top of the Delta Lake table within you ADLS Gen2 for retrieving the data. I added an image where you can see in an abstract way the steps I described.
    A benefit of Delta Lake is that you can create partitions on your data, what makes it more performant and cost effective, if you set the partitions right. It also saves the files as Parquet which is more compressed than csv. This works with hundreds of millions of rows, but if your dataset grows by more than a few billions I would recommend replacing the Delta Lake table with a Synapse Dedicated database. A dedicated database is very expensive compared with a serverless pool, but a lot faster.

    Hope this helps.

    134209-image.png

    1 person found this answer helpful.