Are the files in ADLS gen2 appear atomically only or a partially written file is a possiblity?

Ravi 20 Reputation points
2023-05-17T17:09:02.5966667+00:00

I am using structured streaming in Azure setup with Databricks. I have two pipelines that run sequentially.

Pipeline 1: Sourcing from Kafka and loading the parquet files in an output directory in ADLS Gen 2 using structured streaming.

Pipeline 2: Pick the available parquet files from ADLS Gen 2, processes and loads into delta table using structured streaming.

Q1) I have read in many forums that partially written files(output of pipeline 1) can be picked by spark before the files are finished writing. In case of failure, will spark structured streaming ensures picking the processing from the partially processed file?

Q2) I am aware that on Amazon S3, objects normally only appear once fully written. is it the same case with ADLS Gen 2? Are the files available only atomically to be processed?

Q3) is Autoloader a preferred option in this scenario, considering that i am using databricks for pipeline creation?

Q4) does structured stream writing produces only immutable files?

My whole idea is to ensure that in the file system read, structured streaming should not miss any data.

Thanks.!

Ravi

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,338 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,916 questions
0 comments No comments
{count} votes

Accepted answer
  1. ShaikMaheer-MSFT 37,896 Reputation points Microsoft Employee
    2023-05-22T06:07:29.41+00:00

    Hi Ravi,

    Thank you for posting query in Microsoft Q&A Platform.

    If I am not wrong, In your case data should not get miss. Files will appear in ADLS only after the complete load.

    So hopefully you should be good here. But Still I would recommend you have some validation process implemented as part of your testing to make sure all data moved or not.

    Are you seeing any data misses? Please let me know how it goes. Thank you.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful