delta parquet files in sink dataset

Question

delta parquet files in sink dataset

arkiboys 9,706

Hello,
I am using a sink dataset in ADF as follows:
dataset1
file path: containername/foldername/subfoldername/
Note that the file box is blank

Once the ADF job is completed, the storage folder looks like it has files such as:
_committed...
_started...
_success
part-0000-...parquet
part-0007...parquet
...

Question:
Are these what is referred to as Delta files?
I ask because as you see in screenshots, I do not use Inline Sink Type or Delta dataset in Sink.

Thank you

Accepted answer

0 additional answers

Your answer

Answer 1

Hi @arkiboys ,
Thankyou for using Microsoft Q&A platform and posting your query.

As I understand your query , you want to know the significance of different metadata files generated after the pipeline execution completes. Also, looks like you want to know why the output parquet files are split across into different small files. Could you please confirm that your pipeline consists of activity that runs on Spark cluster , eg: Dataflow / Notebook or Databricks activities ?

_committed,_started,_success are the transactional files created using a commit protocol when we run a spark job. These are generally used to implement fault tolerance in Apache spark.

To better understand why job commit is necessary, let’s compare two different failure scenarios if Spark were to not use a commit protocol:

If a task fails, it could leave partially written data on S3 (or other storage). The Spark scheduler will re-attempt the task, which could result in duplicated output data.
If an entire job fails, it could leave partial results from individual tasks on S3.

Either of these scenarios can be extremely detrimental to a business. To avoid these data corruption scenarios, Spark relies on commit protocol classes from Hadoop that first stage task output into temporary locations, only moving the data to its final location upon task or job completion.

When DBIO transactional commit is enabled, metadata files starting with started<id> and committed<id> will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

You can refer to the following documents and video for more details:
Transactional Writes to Cloud Storage on Databricks
Transactional writes to cloud storage with DBIO
https://youtu.be/w1_aOPj5ILw?t=533

Coming to the output parquet file which are partitioned into small chunk of files , since the file name in your sink dataset is empty , ADF will automatically create multiple output files with automatically generated filename for the partitioned files.

In copy activity sink settings, you can specify the maximum number of rows per file you want to allow for each partitioned file and also can provide the FileName prefix which would create the files in this format: <fileNamePrefix>_00000.parquet. If not specified, prefix will be auto generated.

If you want to see your output in a single file instead of partitioned files , then you need to provide the fileName in your sink dataset which is currently empty in your case.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.
Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

arkiboys 9,706 Reputation points

2022-04-07T08:28:15.197+00:00

I am using dataflow which produces split files.
Does this mean I am using delta ? and everytime the data is updated then these split files get updated rather than re-written over?
Thank you

Share via

delta parquet files in sink dataset

0 additional answers

Your answer