Best practise to copy updating parquet files into sql database

Jani Hämäläinen 80 Reputation points
2024-09-05T08:52:05.1166667+00:00

I have to copy over 300 million rows from parquet files to Azure SQL Database. (The Parquet files come from Synapselink for dataverse.) I have tried using copy activity but because the parquet files update constantly, it feels funky. I have also tried using data flow but they have been failing in insert side. What is the best practise to use in this case?

Azure SQL Database
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,874 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 32,821 Reputation points Microsoft Employee
    2024-09-06T06:19:18.79+00:00

    Hi Jani Hämäläinen ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your query here.

    As per my understanding you are trying to copy data from large volume parquet files to Azure SQL Db using ADF/synapse pipelines.

    I would like to understand what it means when you say parquet files are getting updated constantly. My understanding is that the files would be getting replaced with the new updated files.

    To perform incremental load for parquet files as the source, copy activity would not be suffice, you can either use dataflow or write pyspark code in synapse notebook to identify the incremental data and load it in the Azure SQL DB.

    In case there is any possibility to treat Synapse SQL pool as the target DB, then you can use external tables to load the data from parquet to the Synapse DB. Or use external tables in Synapse DB as the intermediate staging layer before Azure SQL.

    Hope it helps . Kindly let us know how it goes. Thankyou

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.