how to write multiple parquet files from a blob storage into a datalake table using databrick

Gulhasan.Siddiquee 101 Reputation points
2023-04-05T06:57:36.6+00:00

I have many parquet files (300 files ,all are containing same number of columns and order of columns are also same ) present in a blob container , I want to write/copy all these files into a single table of datalake using databrick . Table is already present in datalake.

Azure SQL Database
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Vidhya Sagar Karthikeyan 396 Reputation points
    2023-04-05T10:12:56.9966667+00:00

    @Gulhasan.Siddiquee You can use repartition or coalesce to write it back to a single file. Sample code below Just keep in mind when you write it as a single file, you loose the able to do parallel reads. If the files are smaller in size then it should be fine but if the end file is too big then spark cannot do a parallel read and that might slow down the read queries.

    # Read from a folder where you have multiple files with same schema
    df=spark.read.parquet("blob_source_address")
    
    # Write it back as a single file
    df.repartition(1).write.parquet("blob_destination_address")
    

    Mark as answer if this helps you


  2. ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator
    2023-04-13T16:39:36.52+00:00

    Hi Gulhasan.Siddiquee, Thank you for posting query in Microsoft Q&A Platform.

    You can load data in of parquet files in to pyspark dataframe and then use Dataframe.write() to write to delta table.

    In below code, delta_table_path holds your delta table path. and df is dataframe in which you loaded all parquet files data.

    delta_table_path = "path/to/delta/table"
    df=spark.read.parquet("<folder path of your parquet files>")
    df.write.format("delta").mode("overwrite").save(delta_table_path)
    
    

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.