I am using structured streaming in Azure setup with Databricks. I have two pipelines that run sequentially.
Pipeline 1: Sourcing from Kafka and loading the parquet files in an output directory in ADLS Gen 2 using structured streaming.
Pipeline 2: Pick the available parquet files from ADLS Gen 2, processes and loads into delta table using structured streaming.
Q1) I have read in many forums that partially written files(output of pipeline 1) can be picked by spark before the files are finished writing. In case of failure, will spark structured streaming ensures picking the processing from the partially processed file?
Q2) I am aware that on Amazon S3, objects normally only appear once fully written. is it the same case with ADLS Gen 2? Are the files available only atomically to be processed?
Q3) is Autoloader a preferred option in this scenario, considering that i am using databricks for pipeline creation?
Q4) does structured stream writing produces only immutable files?
My whole idea is to ensure that in the file system read, structured streaming should not miss any data.
Thanks.!
Ravi