Why does my pipeline with a copy/sink task read a huge amount of data?

Iben Buus Tricker 5 Reputation points
2023-11-10T12:34:46.6766667+00:00

I have a ADF pipeline with a copy/sink task.
It reads from parquet files containing in total 7 million rows, but as shown on the screenshot the task has a Data Read on 228 GB of data. Why? What am I doing wrong and how do I lower this?

User's image

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2023-11-10T14:05:50.0166667+00:00

    Hi @Iben Buus Tricker

    Thank you for posting query in Microsoft Q&A Platform.

    Based on the information you provided, it seems like your ADF pipeline is reading a lot more data than expected. This could be due to the way ADF reads data from Parquet files.

    When ADF reads data from Parquet files, it reads the entire file, not just the rows that match your query. This is because Parquet files are columnar data stores, and ADF needs to read the entire column to retrieve the data.

    Therefore, if your Parquet files contain a lot of columns or if the columns contain a lot of data, ADF will read a lot of data.

    To reduce the amount of data read by ADF, you can try the following:

    1. You can use the filter activity in ADF to filter out unnecessary data before you copy it to SQL Server. This can reduce the amount of data that ADF needs to read from S3 and improve the performance of the copy operation.
    2. You can use a parallel copy to split the data into multiple partitions and then copy the partitions to SQL Server in parallel. This can improve the performance of the copy operation and reduce the amount of data that ADF needs to read from S3.
    3. You can use a staging area, such as Azure Blob Storage or Azure Data Lake Storage, to store the data temporarily before you copy it to SQL Server. This can reduce the amount of data that ADF needs to read from S3, especially if you are migrating a large dataset.

    Reference:

    I hope this helps! Let me know if you have any further questions.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.