Delta to Parquet mapping data flow resulting in one empty partition of 2

Erp, Wessel van 21 Reputation points
2024-05-31T13:29:25.1766667+00:00

Hi,

I've been working on something, but I can't get it to work. I seem to have found the issue, but I"m not able to fix it. I've isolated in a single run which I'll explain below:

I have a Delta table on ADLSv2 (based on one partition) that I want to convert partitioned parquet. I run this, but somehow I get one partition that has data and one that has not. I don't use no set partitioning and it seems ADF splits them up in two. The source Delta file has only one parquet in it.

Why is this happening?

For reference the pipeline run id: f5d4fede-285f-4e23-ac8c-454ff6c479fb

Hopefully someone can explain what's going wrong!

Thank you in advance.

Wessel

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,893 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 11,370 Reputation points Microsoft Vendor
    2024-06-05T09:31:21.4066667+00:00

    Hi @Erp, Wessel van

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer .

    Issue: I've been working on something, but I can't get it to work. I seem to have found the issue, but I"m not able to fix it. I've isolated in a single run which I'll explain below:

    I have a Delta table on ADLSv2 (based on one partition) that I want to convert partitioned parquet. I run this, but somehow I get one partition that has data and one that has not. I don't use no set partitioning and it seems ADF splits them up in two. The source Delta file has only one parquet in it.

    Why is this happening?

    For reference the pipeline run id: f5d4fede-285f-4e23-ac8c-454ff6c479fb

    Hopefully someone can explain what's going wrong!

    Solution: This question can be closed. Fixed it my self by adding a select between the source and the sink. Using a select you have more possibilities to set partitions.

    Not the most ideal solution and I certainly see the cause as one of the many bugs in mapping data flows, but it is what it is.

    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Amira Bedhiafi 17,791 Reputation points
    2024-06-01T09:18:52.72+00:00

    By default, ADF might create partitions even if you haven't specified any partitioning scheme. This can lead to scenarios where the data is unevenly distributed, resulting in some partitions being empty.

    You can explicitly specify the partitioning scheme in your ADF copy activity.

    "partitionOptions": {
        "partitionOption": "None"
    }
    

    The data might inherently be skewed, causing one partition to have all the data and the other to be empty.

    ADF might be using parallel copy, sometimes leading to uneven data distribution.

    Setting this to 1 can help ensure that all data is copied in a single batch, avoiding the creation of empty partitions:

    "maxParallelCopies": 1