empty parquet files not being persisted

Question

I have a Synapse data flow that takes files from an Inline Common Data Model source and parses them to parquet format.
The CDM does a great job at flattening json files AND pointing to the data in csv format to populate the parquet file with. WHEN THE DATA EXIST.
My ADLS container has a list of json files

which are schema definitions.
In a "folder" called 'data' in the same root container as the json files there are "subfolders" named after the json files.

In these subfolders are the correlating data in csv format.

The process takes these data points and creates the parquet files as expected, in another container named 'parquet'.

The problem is the process works fine over the subfolders that DO have data. There are also elements that DO NOT have data, but do still have schema. The pipeline creates the parquet folders, but, b/c there are no data to put in them and the folders wind up being empty on ADLS, they are immediately deleted.
What I want to happen is for at least 1 default row of data to populate for each file regardless if there is data or not so they are not deleted by ADLS.
So my first attempt was to add a Derived Column transformation with a column named 'column1' and a static value of 'test'. My hope was that even if the other columns are null (and they are) it will still create the parquet file with only 1 row of data with all columns null except for 'column1' which has a value of 'test'. This is b/c the parquet files will later have data (they just don't right now). And I need the structure to be there in parquet b/c in the future these parquet files will be used for data analysis downstream.

But this attempt did not work. The static data seems not to persist, as evidenced by the fact the file still gets deleted as empty.

So, how do I get these parquet files to persist even with just 1 row of mock data?

Answer

Hello @TerriblyVexed ,
Thanks for the ask and using Microsoft Q&A platform .
As I understand your ask to add a row in the sink side ( is this case its a paraquet file ) . I do not have a CDM setup but then I am using CSV to implement this logic .

Lets assume the actual data is in main.csv
Create a csv ( you can take any format ) file with dummy data . I have added an columnnamed Sort order , you may have the situation wwere you cannot add a new field , take any field of type int and that will help .
Use data flow and add the Union transformation , one of the source will be actual data .csv and the other the dummy CSV .
Add a sort activity to sort the data so the dummy row is always on top .

Please do let me know how it goes .
Thanks
Himanshu

-------------------------------------------------------------------------------------------------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Share via

empty parquet files not being persisted

1 answer

Your answer