Read only specific partition in Synapse Data Flow from a delta source

Mathias Opland 150 Reputation points
2024-01-04T13:58:23.1433333+00:00

Hi,

I have a large Delta file in an Azure Gen 2 Storage Account that is partitioned on the date. I want to preform an aggregate job on data for the current date in an Azure Data Flow, however I can not find where to specify which partition to query in the source component. As a result, the source flow is my longest running activity, as I have to read 100x the amount of data that I want and filter out the relevant data in the next activity. How can I avoid the source reading the whole delta file, and only the relevant partition(s)? This will reduce the cost and time of running the Data Flow.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,176 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 28,381 Reputation points
    2024-01-04T19:58:37.2766667+00:00

    In your Synapse Data Flow, start by setting up your source to connect to the Azure Gen 2 Storage Account where your Delta file is stored. You can use parameters in your Data Flow to dynamically filter data so create a parameter, like CurrentDate, to hold the value of the date you want to process.

    When configuring your source in the Data Flow, you can use a SQL-like query to select data from the partition that matches your CurrentDate parameter, an example :

    SELECT * FROM your_delta_table WHERE date_column = @CurrentDate

    Azure Synapse has native support for Delta Lake, which should allow you to efficiently query specific partitions.

    As a backup, you can add a Filter transformation after the source to further ensure only data from the desired partition is used. However, I can say it is more efficient if done at the source query level.

    Don't forget to think about setting up the incremental load.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.