Are the files in ADLS gen2 appear atomically only or a partially written file is a possiblity?

Question

I am using structured streaming in Azure setup with Databricks. I have two pipelines that run sequentially.

Pipeline 1: Sourcing from Kafka and loading the parquet files in an output directory in ADLS Gen 2 using structured streaming.

Pipeline 2: Pick the available parquet files from ADLS Gen 2, processes and loads into delta table using structured streaming.

Q1) I have read in many forums that partially written files(output of pipeline 1) can be picked by spark before the files are finished writing. In case of failure, will spark structured streaming ensures picking the processing from the partially processed file?

Q2) I am aware that on Amazon S3, objects normally only appear once fully written. is it the same case with ADLS Gen 2? Are the files available only atomically to be processed?

Q3) is Autoloader a preferred option in this scenario, considering that i am using databricks for pipeline creation?

Q4) does structured stream writing produces only immutable files?

My whole idea is to ensure that in the file system read, structured streaming should not miss any data.

Thanks.!

Ravi

Accepted Answer

Hi Ravi,

Thank you for posting query in Microsoft Q&A Platform.

If I am not wrong, In your case data should not get miss. Files will appear in ADLS only after the complete load.

So hopefully you should be good here. But Still I would recommend you have some validation process implemented as part of your testing to make sure all data moved or not.

Are you seeing any data misses? Please let me know how it goes. Thank you.

Are the files in ADLS gen2 appear atomically only or a partially written file is a possiblity?

0 additional answers