Share via

empty parquet files not being persisted

TerriblyVexed 51 Reputation points
Nov 12, 2021, 7:46 PM

I have a Synapse data flow that takes files from an Inline Common Data Model source and parses them to parquet format.
The CDM does a great job at flattening json files AND pointing to the data in csv format to populate the parquet file with. WHEN THE DATA EXIST.
My ADLS container has a list of json files

148976-image.png

which are schema definitions.
In a "folder" called 'data' in the same root container as the json files there are "subfolders" named after the json files.

148897-image.png

In these subfolders are the correlating data in csv format.

148962-image.png

The process takes these data points and creates the parquet files as expected, in another container named 'parquet'.

148907-image.png

The problem is the process works fine over the subfolders that DO have data. There are also elements that DO NOT have data, but do still have schema. The pipeline creates the parquet folders, but, b/c there are no data to put in them and the folders wind up being empty on ADLS, they are immediately deleted.
What I want to happen is for at least 1 default row of data to populate for each file regardless if there is data or not so they are not deleted by ADLS.
So my first attempt was to add a Derived Column transformation with a column named 'column1' and a static value of 'test'. My hope was that even if the other columns are null (and they are) it will still create the parquet file with only 1 row of data with all columns null except for 'column1' which has a value of 'test'. This is b/c the parquet files will later have data (they just don't right now). And I need the structure to be there in parquet b/c in the future these parquet files will be used for data analysis downstream.

148908-image.png

But this attempt did not work. The static data seems not to persist, as evidenced by the fact the file still gets deleted as empty.

So, how do I get these parquet files to persist even with just 1 row of mock data?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,545 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,222 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,476 Reputation points Microsoft Employee
    Nov 15, 2021, 10:42 PM

    Hello @TerriblyVexed ,
    Thanks for the ask and using Microsoft Q&A platform .
    As I understand your ask to add a row in the sink side ( is this case its a paraquet file ) . I do not have a CDM setup but then I am using CSV to implement this logic .

    1. Lets assume the actual data is in main.csv
    2. Create a csv ( you can take any format ) file with dummy data . I have added an columnnamed Sort order , you may have the situation wwere you cannot add a new field , take any field of type int and that will help .
    3. Use data flow and add the Union transformation , one of the source will be actual data .csv and the other the dummy CSV .
    4. Add a sort activity to sort the data so the dummy row is always on top .
      149553-2021-11-15-14-29-16-maincsv-microsoft-azure.png
      149535-2021-11-15-14-38-08-greenshot-image-editor.png
      149536-2021-11-15-14-30-04-analytics-moveandtransform-v1.png
      149537-2021-11-15-14-36-34-analytics-moveandtransform-v1.png
      Please do let me know how it goes .
      Thanks
      Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.