One CSV to multiple Parquet files

Ryan Abbey
1,106
Reputation points
We have one large CSV file that we are looking to transfer in to Parquet and based on the recommended standard of up to 1GB parquet files, splitting across a few files however running in to a few issues
- If we don't specify a file within the parquet definition and specify e.g. 10,000,000 rows per file, what we find is the copy activity is autogenerating a subfolder based on the input file name which we don't want.
- If we extend 1 to specify a "File name prefix", we get error FileNamePrefixNotSupportFileBasedSource (I note the info box does say you can't specify a prefix with file based sources)
So how do we stop it generating a subfolder based on the source file name? It seems pretty restrictive and illogical to force an unwanted subfolder (a MS trait that hasn't stopped through the years!)
{count} votes
Hi @Ryan Abbey ,
Could you please share details on above commented clarifications. This will help to understand issue better and provide detailed resolution
Apologies, forgot all about the questions as we moved on...
Hopefully the images actually show...
Hi @Ryan Abbey ,
Thank you for reframing your ask. Small clarifications here,
Kindly share details on above clarifications, that helps to provide detailed resolution. Thank you
Hi @Ryan Abbey ,
Just checking is below provided answer helps you? If yes please
Accept Answer
. Accepting answer will help community. Thank youHi @Ryan Abbey ,
Following up to check is below provided answer helps you? If yes please
Accept Answer
. Accepting answer will help community. Thank you.Sign in to comment
1 answer
Sort by: Most helpful
Hi @Ryan Abbey ,
Please check detailed example, Which Copies file to folder(folder name will be dynamically created as you requested above(iri_FCT_yyyyMMdd))
Step1: Create a variable in your pipeline to hold current date. Use set variable activity to set value in it.
Step2: Use Copy activity to copy zip file. Source and Sink dataset types should be binary. In sink data set we should create a parameter which will dynamically give us target folder name as "iri_FCT_yyyyMMdd"
Hop this will help.
----------------------------------
accept an answer
if correct. Original posters help the community find answers faster by identifying the correct answer. Here is how.That would copy the zip file to a directory of same name? We are trying to convert to parquet and split in to multiple files at the same time
Hi @Ryan Abbey ,
In data flows Sink Transformation you can use partitions to partition your data and save as separate files.
To know more about partitions in data flow, please check below link,
https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab
In below example I am partitioning file in to 2 partitions.
Hope this helps.
------------------------------------------
Please
Accept Answer
if this helps. Thank you.Hi @Ryan Abbey ,
Just checking is below provided answer helps you? If yes please Accept Answer. Accepting answer will help community. Thank you
You go to a lot of effort, which is appreciated! However, using data mapping is known about and not sought... really, because the process is clearly capable of splitting files as well as creating folders, it should not be so difficult to split files and put them to where the user requires them... since it currently isn't capable, maybe a feature request... (can't remember if I did that or not)
Hi @Ryan Abbey ,
Thank you.
Please feel free to share your feed back in below link. Azure data factory Product team closely monitor feedbacks there and consider them for future releases. Thank you.
https://feedback.azure.com/forums/270578-data-factory
Please
Accept Answer
. Accepting answer will help community too.Sign in to comment
Activity