Dynamic Partition Pruning support in ADF/Synapse mapping data flows

Question

We have constructed a proof of concept mapping data flow to determine whether Dynamic Partition Pruning is supported in Synapse mapping data flows. Our tests indicate that this is not supported for our use case, but would be grateful for confirmation.

We are using two sources, both parquet files in Azure Data Lake Storage, both expressed in the mapping flow source as a folder containing sub-folders as partitioner, with "Key=Value" naming. We have used the source options to set the partitions root path, to enable partitions to be leveraged as columns. We have then joined both sourced in a join operation, including the partition value as part of the join condition, as well as a guid column contained within the files.

The left-side source includes only 1 partition, the right-side source includes 3 partitions, only one of which matches the partition value available on the left-side.

When running this flow, all three partitions on the right-side appear to be read into the mapping flow, indicating that no dynamic pruning of the partitions read has taken place.

Is there another way to leverage dynamic partition pruning withing mapping data flows?

User's image

Accepted Answer

Hi Bill Wood,

Thanks for reaching out to Microsoft Q&A.

Currently, Synapse and ADFs mapping data flows do not explicitly support dynamic partition pruning in the same way that spark or some other database engines do. The lack of this feature means that even if partitions are specified, all relevant data might still be scanned during operations like joins.

Recommendations for Leveraging Partitions

Review Join Conditions: Ensure that the join conditions are correctly defined and that the partition column is included in the join. This is crucial for DPP(Dynamic partiton pruning) to potentially take effect.
Optimize Partitioning Settings: Use the Optimize tab in your mapping data flow to configure the partitioning scheme appropriately. You can experiment with different partitioning strategies like Hash or Key partitioning to see if that affects how partitions are read.
Data Flow Performance Tuning: Consult the performance tuning guide for mapping data flows. It provides insights into how to manage partitioning and optimize data flow performance, which may help in your scenario.
Testing with Different Configurations: If possible, create a simplified version of your data flow to isolate the issue. Test with different partitioning configurations to see if any adjustments lead to the expected DPP behavior.
Partitioned Views: If feasible, create partitioned views or separate datasets for each partition that can be selected dynamically based on the join condition. This would involve some pre-processing to determine the appropriate partition before the data flow execution.
Data Flow Expressions: Experiment with data flow expressions to apply more granular filtering within the flow, though this might not prevent the full read of partitions at the source.
Alternative Approaches: If DPP is critical for your use case and not functioning as expected, consider using Synapse SQL pools or other querying methods that may better support partition pruning.

Unfortunately, without built-in support for dpp, these workarounds involve extra processing or logic outside the mapping data flow itself. If partition pruning is critical to your use case, you might want to consider other ETL tools or query engines that offer more robust support for this feature.

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Share via

Dynamic Partition Pruning support in ADF/Synapse mapping data flows

0 additional answers

Your answer