Data flow task using synapse serverless pool vs copy activity in synapse dedicated pool

Question

Data flow task using synapse serverless pool vs copy activity in synapse dedicated pool

Das, Mahasweta 20

Hi All,

My need is to process source files (.dat format, size is >30 GB and some file is of 86 GB as well) from source to synapse data mart schema and below are 2 possible ways,

1- Process the source file using parquet format in serverless pool using data flow activity and load the processed data to synapse table

2- Use copy activity to load the source file to Synapse staging schema and then do processing using stored procedure and load to data mart schema in synapse.

Now I need to know which will be faster and cost effective?

Thanks in advance.

Mahasweta Das

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-08-24T04:37:18.8+00:00

Hi Das, Mahasweta ,

Just following up to see if the below answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Accepted answer

1 additional answer

Your answer

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-08-24T04:37:18.8+00:00

Hi Das, Mahasweta ,

Just following up to see if the below answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Answer 1

Hi Das, Mahasweta ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

It seems that you want to determine the most efficient and cost-effective way to process large source files in Synapse Analytics and load the processed data .

Both options that you have mentioned have their advantages and disadvantages, and the choice depends on your specific requirements and constraints.

Here are some factors to consider:

Data volume: If you have large data volumes, the copy activity may be faster than the data flow activity as it uses bulk insert/polybase(recommended) operations to load data into Synapse. However, the data flow can handle complex data transformations and may be more flexible in some cases.
Cost: The cost of the data flow activity depends on the compute size of integration runtime and the amount of data processed. The cost of the copy activity depends on the number of copy operations and the amount of data copied. You need to consider the cost of both options and choose the one that is more cost-effective for your specific scenario.
Processing complexity: If you need to perform complex data transformations, the data flow activity may be more suitable as it provides a visual interface for building data transformation logic. However, if your data transformations are simple, you may be able to use the copy activity and a stored procedure to load data into Synapse.

Based on the above factors, you can choose the option that best meets your requirements. If you have large data volumes and need to load data quickly, the copy activity may be a better option. If you need to perform complex data transformations, the data flow activity may be more suitable.

Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-08-23T11:26:49.9466667+00:00

Hi Das, Mahasweta ,

Just checking in to see if the above answer helped. Please do consider clicking Accept Answer as accepted answers help community as well. Also, please click on Yes for the survey 'Was the answer helpful'

Answer 2

Das, Mahasweta 20

Dear Annu,

Thanks for your answer.

Yes it helped me, but could you please give me the cost calculation for serverless vs dedicated please?

Thanks,

Mahasweta

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2023-08-25T08:12:42.4766667+00:00

Hi Das, Mahasweta ,

In dedicated sql pool, Cost is incurred for the resources reserved, however in case of serverless, Cost is incurred for the data processed per query. Unlike dedicated SQL pools where the user pays for reserved resources at a pre-decided scale, the cost of serverless SQL is calculated per TB or data processed based on the queries run.

For more insights on the pricing, you can use the Price calculator: https://azure.microsoft.com/en-in/pricing/calculator/

Hope it helps. Thankyou

Share via

Data flow task using synapse serverless pool vs copy activity in synapse dedicated pool

1 additional answer

Your answer