Read only specific partition in Synapse Data Flow from a delta source

Question

Read only specific partition in Synapse Data Flow from a delta source

Mathias Opland 155

Hi,

I have a large Delta file in an Azure Gen 2 Storage Account that is partitioned on the date. I want to preform an aggregate job on data for the current date in an Azure Data Flow, however I can not find where to specify which partition to query in the source component. As a result, the source flow is my longest running activity, as I have to read 100x the amount of data that I want and filter out the relevant data in the next activity. How can I avoid the source reading the whole delta file, and only the relevant partition(s)? This will reduce the cost and time of running the Data Flow.

Mathias Opland 155 Reputation points

2024-01-05T08:35:12.33+00:00

Hi Amira,

Thanks for your response! What you are explaining are exactly what I'm looking for, however I can not find where to use the SQL-like query in the source activity. Adding some pictures of the source settings and options.

When you talk about incremental load, do you mean something like change data capture? As I've understood, this is not supported with Delta files at the moment, which is why I've constructed something similar with partitions. However, it's not as good as CDC.
Amira Bedhiafi 41,121 Reputation points Volunteer Moderator

2024-01-05T21:53:38.07+00:00

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-sqlhttps://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-sql

https://learn.microsoft.com/en-us/azure/data-factory/how-to-change-data-capture-resource
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-08T13:35:20.97+00:00

@Mathias Opland Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Mathias Opland 155 Reputation points

2024-01-08T14:15:50.0033333+00:00

Hi, I've already checked, and Change Data Capture (CDC) is not available for delta files according to Microsoft (https://learn.microsoft.com/en-us/answers/questions/1414384/change-data-capture-in-synapse).

Regarding Synapse SQL, I could not find any information in the linked you provided that it is available in a Azure Data Flow, which is where I want to query a specific partition.

Regards,
Mathias
Mathias Opland 155 Reputation points

2024-01-08T14:17:32.5+00:00

I've unfortunately not been able to find an answer to my question jet.
Mathias Opland 155 Reputation points

2024-01-08T14:18:32.34+00:00

I've unfortunately not been able to find a answer to my question jet.
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-09T09:33:31.16+00:00
Hi @Mathias Opland

Thank you for reaching out to the Azure community forum with your query. To read only a specific partition in Synapse Data Flow from a delta source, you can use the "partitionBy" option in the source settings of the data flow. This option allows you to specify the partition column and the partition value to read only the relevant partition(s) from the delta file.

Here's an example of how to use the "partitionBy" option in the source settings:

Open your Synapse Data Flow and select the source component that reads from the delta file.

In the source settings, scroll down to the "partitionBy" option and click on the "+" button to add a new partition.

In the "Column" field, enter the name of the partition column in your delta file (e.g., "date").

In the "Value" field, enter the partition value for the current date (e.g., "2022-01-01").

Save the changes and run the data flow.

This will ensure that only the relevant partition(s) are read from the delta file, reducing the cost and time of running the data flow.
Mathias Opland 155 Reputation points

2024-01-09T09:51:04.91+00:00

Hi @Smaran Thoomu

Thanks for your reply. It may be that I've misunderstood, but I can not for the life of me find the "partitionBy" option when using delta files. I will provide pictures of what I see in the delta source activity in Synapse Data Flow.

Regards,
Mathias
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-12T02:31:43.25+00:00

@Mathias Opland For a deeper investigation and immediate assistance on this issue , if you have a support plan you may file a support ticket. In case you don't have a support plan, I will enable a one-time free support request for your subscription. Please do let us know.

1 answer

Your answer

Mathias Opland 155 Reputation points

2024-01-05T08:35:12.33+00:00

Hi Amira,

Thanks for your response! What you are explaining are exactly what I'm looking for, however I can not find where to use the SQL-like query in the source activity. Adding some pictures of the source settings and options.

When you talk about incremental load, do you mean something like change data capture? As I've understood, this is not supported with Delta files at the moment, which is why I've constructed something similar with partitions. However, it's not as good as CDC.
Amira Bedhiafi 41,121 Reputation points Volunteer Moderator

2024-01-05T21:53:38.07+00:00

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-sqlhttps://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-sql

https://learn.microsoft.com/en-us/azure/data-factory/how-to-change-data-capture-resource
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-08T13:35:20.97+00:00

@Mathias Opland Just checking in to see if the below answer provided by @Amira Bedhiafi helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Mathias Opland 155 Reputation points

2024-01-08T14:15:50.0033333+00:00

Hi, I've already checked, and Change Data Capture (CDC) is not available for delta files according to Microsoft (https://learn.microsoft.com/en-us/answers/questions/1414384/change-data-capture-in-synapse).

Regarding Synapse SQL, I could not find any information in the linked you provided that it is available in a Azure Data Flow, which is where I want to query a specific partition.

Regards,
Mathias
Mathias Opland 155 Reputation points

2024-01-08T14:17:32.5+00:00

I've unfortunately not been able to find an answer to my question jet.
Mathias Opland 155 Reputation points

2024-01-08T14:18:32.34+00:00

I've unfortunately not been able to find a answer to my question jet.
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-09T09:33:31.16+00:00

Hi @Mathias Opland

Thank you for reaching out to the Azure community forum with your query. To read only a specific partition in Synapse Data Flow from a delta source, you can use the "partitionBy" option in the source settings of the data flow. This option allows you to specify the partition column and the partition value to read only the relevant partition(s) from the delta file.

Here's an example of how to use the "partitionBy" option in the source settings:

Open your Synapse Data Flow and select the source component that reads from the delta file.

In the source settings, scroll down to the "partitionBy" option and click on the "+" button to add a new partition.

In the "Column" field, enter the name of the partition column in your delta file (e.g., "date").

In the "Value" field, enter the partition value for the current date (e.g., "2022-01-01").

Save the changes and run the data flow.

This will ensure that only the relevant partition(s) are read from the delta file, reducing the cost and time of running the data flow.
Mathias Opland 155 Reputation points

2024-01-09T09:51:04.91+00:00

Hi @Smaran Thoomu

Thanks for your reply. It may be that I've misunderstood, but I can not for the life of me find the "partitionBy" option when using delta files. I will provide pictures of what I see in the delta source activity in Synapse Data Flow.

Regards,
Mathias
Smaran Thoomu 32,535 Reputation points Microsoft External Staff Moderator

2024-01-12T02:31:43.25+00:00

@Mathias Opland For a deeper investigation and immediate assistance on this issue , if you have a support plan you may file a support ticket. In case you don't have a support plan, I will enable a one-time free support request for your subscription. Please do let us know.

Answer 1

In your Synapse Data Flow, start by setting up your source to connect to the Azure Gen 2 Storage Account where your Delta file is stored. You can use parameters in your Data Flow to dynamically filter data so create a parameter, like CurrentDate, to hold the value of the date you want to process.

When configuring your source in the Data Flow, you can use a SQL-like query to select data from the partition that matches your CurrentDate parameter, an example :

SELECT * FROM your_delta_table WHERE date_column = @CurrentDate

Azure Synapse has native support for Delta Lake, which should allow you to efficiently query specific partitions.

As a backup, you can add a Filter transformation after the source to further ensure only data from the desired partition is used. However, I can say it is more efficient if done at the source query level.

Don't forget to think about setting up the incremental load.

Share via

Read only specific partition in Synapse Data Flow from a delta source

1 answer

Your answer