delta parquet files in sink dataset

arkiboys 9,706 Reputation points
2022-04-06T10:51:47.733+00:00

Hello,
I am using a sink dataset in ADF as follows:
dataset1
file path: containername/foldername/subfoldername/
Note that the file box is blank

Once the ADF job is completed, the storage folder looks like it has files such as:
_committed...
_started...
_success
part-0000-...parquet
part-0007...parquet
...

Question:
Are these what is referred to as Delta files?
I ask because as you see in screenshots, I do not use Inline Sink Type or Delta dataset in Sink.

Thank you

190516-image.png

190503-image.png

190517-image.png

190572-image.png

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,661 questions
0 comments No comments
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,561 Reputation points Microsoft Employee Moderator
    2022-04-07T08:10:59.437+00:00

    Hi @arkiboys ,
    Thankyou for using Microsoft Q&A platform and posting your query.

    As I understand your query , you want to know the significance of different metadata files generated after the pipeline execution completes. Also, looks like you want to know why the output parquet files are split across into different small files. Could you please confirm that your pipeline consists of activity that runs on Spark cluster , eg: Dataflow / Notebook or Databricks activities ?

    _committed,_started,_success are the transactional files created using a commit protocol when we run a spark job. These are generally used to implement fault tolerance in Apache spark.

    To better understand why job commit is necessary, let’s compare two different failure scenarios if Spark were to not use a commit protocol:

    • If a task fails, it could leave partially written data on S3 (or other storage). The Spark scheduler will re-attempt the task, which could result in duplicated output data.
    • If an entire job fails, it could leave partial results from individual tasks on S3.

    Either of these scenarios can be extremely detrimental to a business. To avoid these data corruption scenarios, Spark relies on commit protocol classes from Hadoop that first stage task output into temporary locations, only moving the data to its final location upon task or job completion.

    When DBIO transactional commit is enabled, metadata files starting with started<id> and committed<id> will accompany data files created by Apache Spark jobs. Generally you shouldn’t alter these files directly. Rather, use the VACUUM command to clean the files.

    You can refer to the following documents and video for more details:
    Transactional Writes to Cloud Storage on Databricks
    Transactional writes to cloud storage with DBIO
    https://youtu.be/w1_aOPj5ILw?t=533

    Coming to the output parquet file which are partitioned into small chunk of files , since the file name in your sink dataset is empty , ADF will automatically create multiple output files with automatically generated filename for the partitioned files.

    In copy activity sink settings, you can specify the maximum number of rows per file you want to allow for each partitioned file and also can provide the FileName prefix which would create the files in this format: <fileNamePrefix>_00000.parquet. If not specified, prefix will be auto generated.

    190779-image.png
    If you want to see your output in a single file instead of partitioned files , then you need to provide the fileName in your sink dataset which is currently empty in your case.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.
      Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.