how to write multiple parquet files from a blob storage into a datalake table using databrick

Question

how to write multiple parquet files from a blob storage into a datalake table using databrick

Gulhasan.Siddiquee 101

I have many parquet files (300 files ,all are containing same number of columns and order of columns are also same ) present in a blob container , I want to write/copy all these files into a single table of datalake using databrick . Table is already present in datalake.

2 answers

Your answer

Answer 1

Vidhya Sagar Karthikeyan 396

@Gulhasan.Siddiquee You can use repartition or coalesce to write it back to a single file. Sample code below Just keep in mind when you write it as a single file, you loose the able to do parallel reads. If the files are smaller in size then it should be fine but if the end file is too big then spark cannot do a parallel read and that might slow down the read queries.

# Read from a folder where you have multiple files with same schema
df=spark.read.parquet("blob_source_address")

# Write it back as a single file
df.repartition(1).write.parquet("blob_destination_address")

Mark as answer if this helps you

Gulhasan.Siddiquee 101 Reputation points

2023-04-10T07:41:57.4166667+00:00

Thanks , But in our case we have to write data in a table and data volume is huge like in TBs .

Answer 2

Hi Gulhasan.Siddiquee, Thank you for posting query in Microsoft Q&A Platform.

You can load data in of parquet files in to pyspark dataframe and then use Dataframe.write() to write to delta table.

In below code, delta_table_path holds your delta table path. and df is dataframe in which you loaded all parquet files data.

delta_table_path = "path/to/delta/table"
df=spark.read.parquet("<folder path of your parquet files>")
df.write.format("delta").mode("overwrite").save(delta_table_path)

Hope this helps. Please let me know if any further queries.

Please consider hitting Accept Answer button. Accepted answers help community as well.

Share via

how to write multiple parquet files from a blob storage into a datalake table using databrick

2 answers

Your answer