Why does my pipeline with a copy/sink task read a huge amount of data?

Question

Why does my pipeline with a copy/sink task read a huge amount of data?

Iben Buus Tricker 5

I have a ADF pipeline with a copy/sink task.
It reads from parquet files containing in total 7 million rows, but as shown on the screenshot the task has a Data Read on 228 GB of data. Why? What am I doing wrong and how do I lower this?

User's image

Iben Buus Tricker 5 Reputation points

2023-11-10T12:37:03.1266667+00:00

Note: I see the same amount of Data Read when I try to read the parquet file from an Azure Storage Account Gen 2.
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2023-11-11T18:02:24.0666667+00:00

@Iben Buus Tricker We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Iben Buus Tricker 5 Reputation points

2023-11-10T12:37:03.1266667+00:00

Note: I see the same amount of Data Read when I try to read the parquet file from an Azure Storage Account Gen 2.
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2023-11-11T18:02:24.0666667+00:00

@Iben Buus Tricker We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hi @Iben Buus Tricker

Thank you for posting query in Microsoft Q&A Platform.

Based on the information you provided, it seems like your ADF pipeline is reading a lot more data than expected. This could be due to the way ADF reads data from Parquet files.

When ADF reads data from Parquet files, it reads the entire file, not just the rows that match your query. This is because Parquet files are columnar data stores, and ADF needs to read the entire column to retrieve the data.

Therefore, if your Parquet files contain a lot of columns or if the columns contain a lot of data, ADF will read a lot of data.

To reduce the amount of data read by ADF, you can try the following:

You can use the filter activity in ADF to filter out unnecessary data before you copy it to SQL Server. This can reduce the amount of data that ADF needs to read from S3 and improve the performance of the copy operation.
You can use a parallel copy to split the data into multiple partitions and then copy the partitions to SQL Server in parallel. This can improve the performance of the copy operation and reduce the amount of data that ADF needs to read from S3.
You can use a staging area, such as Azure Blob Storage or Azure Data Lake Storage, to store the data temporarily before you copy it to SQL Server. This can reduce the amount of data that ADF needs to read from S3, especially if you are migrating a large dataset.

Reference:

I hope this helps! Let me know if you have any further questions.

Iben Buus Tricker 5 Reputation points

2023-11-13T07:43:28.7133333+00:00

Hi @Smaran Thoomu
Thanks for your reply, im sorry I wasn't answering earlier - I've was off this weekend.

I am not filtering in the data I want to sink to my SQL server, its a "simple" operation where I want all the data from the parquet files in to columns. No filter or data operations.
I have re-worked my solution so I first do a binary dump of the parquet files from Amazon S3 to Azure data lake and then do the copy/sink activity from the new parquet files in ADLS, but I still see the same huge amount of data being read in my pipeline run overview.
I'll investigate the solution you described about parallel copy and see if that helps, either way you'll hear back from me :)
Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator

2023-11-14T11:54:24.27+00:00

Iben Buus Tricker Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Why does my pipeline with a copy/sink task read a huge amount of data?

1 answer

Your answer