Data flow task using synapse serverless pool vs copy activity in synapse dedicated pool

Das, Mahasweta 20 Reputation points
2023-08-17T15:29:03.5966667+00:00

Hi All,

My need is to process source files (.dat format, size is >30 GB and some file is of 86 GB as well) from source to synapse data mart schema and below are 2 possible ways,

1- Process the source file using parquet format in serverless pool using data flow activity and load the processed data to synapse table

2- Use copy activity to load the source file to Synapse staging schema and then do processing using stored procedure and load to data mart schema in synapse.

Now I need to know which will be faster and cost effective?

Thanks in advance.

Mahasweta Das

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator
    2023-08-18T06:29:31.3233333+00:00

    Hi Das, Mahasweta ,

    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    It seems that you want to determine the most efficient and cost-effective way to process large source files in Synapse Analytics and load the processed data .

    Both options that you have mentioned have their advantages and disadvantages, and the choice depends on your specific requirements and constraints.

    Here are some factors to consider:

    1. Data volume: If you have large data volumes, the copy activity may be faster than the data flow activity as it uses bulk insert/polybase(recommended) operations to load data into Synapse. However, the data flow can handle complex data transformations and may be more flexible in some cases.
    2. Cost: The cost of the data flow activity depends on the compute size of integration runtime and the amount of data processed. The cost of the copy activity depends on the number of copy operations and the amount of data copied. You need to consider the cost of both options and choose the one that is more cost-effective for your specific scenario.
    3. Processing complexity: If you need to perform complex data transformations, the data flow activity may be more suitable as it provides a visual interface for building data transformation logic. However, if your data transformations are simple, you may be able to use the copy activity and a stored procedure to load data into Synapse.

    Based on the above factors, you can choose the option that best meets your requirements. If you have large data volumes and need to load data quickly, the copy activity may be a better option. If you need to perform complex data transformations, the data flow activity may be more suitable.

    Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou


1 additional answer

Sort by: Most helpful
  1. Das, Mahasweta 20 Reputation points
    2023-08-24T05:14:09.6933333+00:00

    Dear Annu,

    Thanks for your answer.

    Yes it helped me, but could you please give me the cost calculation for serverless vs dedicated please?

    Thanks,

    Mahasweta


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.