Failures on partitioning data

Pravalika-randstad 240 Reputation points
2023-10-25T19:49:33.0266667+00:00

I have a data flow that generates a 10GB CSV file and saves it to a blob. Currently, I’ve set the sink as “Single partition,” but this approach is causing long execution times and occasional failures. If I switch to the “*Default Partitioning” option, it doesn’t allow me to customize the file names as “part1.csv,” “part2.csv,” etc., which is a requirement for my scenario. Additionally, there is no column-based key partitioning. Can you provide guidance on how to achieve custom file naming and partitioning without using the default options? Your assistance is greatly appreciated.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Anonymous
    2023-10-26T05:57:14.6033333+00:00

    @Pravalika-randstad

    Thanks for using Microsoft Q&A

    • Create a new Data Flow

    You are going to create a very simple Data Flow just to leverage file partitioning. There will not be any column or row transformations. Just a Source and a Sink that will take a large file and produce smaller part files.

    User's image

    • Add a Source file
    • Add a Sink folder

    For the Sink dataset, choose the type of output files you would like to produce. 

    User's image

    • In the Optimize tab of the Sink transformation, select the "Set Partitioning" radio button. You will be presented with a series of options for partitioning define the partitioning.

    User's image

    • This is where you will define how you would like the partitioned files to be generated. for equal distribution use round robin method
    • set the output file names using the “pattern” option like"part1.csv", "part2.csv", etc.

    User's image

    • Notice I’ve also set “Clear the folder”. This will ask ADF to wipe the contents of the destination folder clean before loading new part files.
    • Save your data flow and create a new pipeline.
    • Add an Execute Data Flow activity and select your new file split data flow.
    • Execute the pipeline using the pipeline debug button.
    • You must execute data flows from a pipeline in order to generate file output. Debugging from Data Flow does not write any data.
    • After execution, you should now see  files that resulted from round robin partitioning of your large source file. You’re done:

    In the output of your pipeline debug run, you’ll see the execution results of the data flow activity. Click on eyeglasses icon to show the details of your data flow execution. You’ll see the statistics of the distribution of records in your partitioned files:

     

    Reference: Sink performance and best practices in mapping data flow - Azure Data Factory & Azure Synapse | Microsoft Learn

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.