Failures on partitioning data

Question

Failures on partitioning data

Pravalika-randstad 240

I have a data flow that generates a 10GB CSV file and saves it to a blob. Currently, I’ve set the sink as “Single partition,” but this approach is causing long execution times and occasional failures. If I switch to the “*Default Partitioning” option, it doesn’t allow me to customize the file names as “part1.csv,” “part2.csv,” etc., which is a requirement for my scenario. Additionally, there is no column-based key partitioning. Can you provide guidance on how to achieve custom file naming and partitioning without using the default options? Your assistance is greatly appreciated.

Answer accepted by question author

0 additional answers

Your answer

Answer 1

@Pravalika-randstad

Thanks for using Microsoft Q&A

Create a new Data Flow

You are going to create a very simple Data Flow just to leverage file partitioning. There will not be any column or row transformations. Just a Source and a Sink that will take a large file and produce smaller part files.

User's image

Add a Source file
Add a Sink folder

For the Sink dataset, choose the type of output files you would like to produce.

User's image

In the Optimize tab of the Sink transformation, select the "Set Partitioning" radio button. You will be presented with a series of options for partitioning define the partitioning.

User's image

This is where you will define how you would like the partitioned files to be generated. for equal distribution use round robin method
set the output file names using the “pattern” option like"part1.csv", "part2.csv", etc.

User's image

Notice I’ve also set “Clear the folder”. This will ask ADF to wipe the contents of the destination folder clean before loading new part files.
Save your data flow and create a new pipeline.
Add an Execute Data Flow activity and select your new file split data flow.
Execute the pipeline using the pipeline debug button.
You must execute data flows from a pipeline in order to generate file output. Debugging from Data Flow does not write any data.
After execution, you should now see files that resulted from round robin partitioning of your large source file. You’re done:

In the output of your pipeline debug run, you’ll see the execution results of the data flow activity. Click on eyeglasses icon to show the details of your data flow execution. You’ll see the statistics of the distribution of records in your partitioned files:

Reference: Sink performance and best practices in mapping data flow - Azure Data Factory & Azure Synapse | Microsoft Learn

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Anonymous

2023-10-26T09:08:12.75+00:00

@Pravalika-randstad

Thanks for accepting answer!

Share via

Failures on partitioning data

0 additional answers

Your answer