Why the size of Data Read in ADF copy activity is actually much bigger than the original size of source data?

Question

Why the size of Data Read in ADF copy activity is actually much bigger than the original size of source data?

Thatchapoom Thumsawatra 10

I have initiated a data transfer operation from Google Cloud Storage to Azure Data Lake Storage within Azure Data Factory. The objective was to transfer approximately 3,000 files in snappy.parquet format, with a combined size of approximately 30 GB.

Upon the completion of the data transfer operation, I observed that Azure Data Factory reported the processed read data as being up to 5.8 TB. This unexpected result has raised concerns about the associated cost on the Google Cloud Platform side.

In an effort to gain clarity and understanding of the underlying reasons for this substantial increase in processed data, I have conducted extensive research, referring to resources such as Microsoft Learn and Azure documentation. Unfortunately, I was unable to find a comprehensive explanation that would shed light on how and why the processed data volume reached 5.8 TB. I attached the image to provide more details about the task below.

Resolving this question is of utmost importance as it directly impacts our decision-making process regarding Azure as our preferred multicloud solution.

Thank you in advance for your answer

Cheers

Screenshot 2566-11-01 at 18.01.03

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-11-02T05:40:25.8633333+00:00

@Thatchapoom Thumsawatra Welcome to Microsoft Q&A forum and thanks for reaching out here.

I agree that there is a huge difference between original data size and the data size showed in your copy operation. As this is critical concern, I would recommend filing a support ticket for faster analysis by looking into your logs. In case if you don't have a support plan, please let me know here so that I can work with you offline in opening a free support ticket.

We are looking forward to your response.

Thank you
Thatchapoom Thumsawatra 10 Reputation points

2023-11-02T07:02:48.9133333+00:00

@KranthiPakala-MSFT Thank you for your kindly answer, please help me to figure out how to open free support ticket, because currently we are in the stage of proof of concept to expand our single cloud solution to multicloud solution, so we didn't have any paid support ticket for now.

Thank you for your help.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-11-13T22:50:07.8466667+00:00

@Thatchapoom Thumsawatra As this issue requires deeper analysis. could you please send an email to AzCommunity[at]Microsoft[dot]com with the below details, so that we can enable a one-time-free support ticket for you to work closely on this matter.

Email subject: <Attn - Kranthi : Microsoft Q&A Thread title>

Thread URL: <Microsoft Q&A Thread>

Subscription ID: <your subscription id>

Looking forward to your reply.

Thank you
Guilherme Matte 0 Reputation points

2024-01-12T04:00:19.9233333+00:00

Any solutions on this? Im seeing the same thing happening to me. The copy takes much longer because it is reading from somewhere that a file of 1GB reads 80GB before finishing the copy
Thatchapoom Thumsawatra 10 Reputation points

2024-01-12T04:07:34.23+00:00

@Guilherme Matte I opened the ticket to Azure support, after investigation they suggested I should fly the ticket to Google Cloud instead. Based on my observation I found it usually happened when exporting SNAPPY PARQUET from Google BigQuery table to Google Cloud Storage and consumed by ADF.
Guilherme Matte 0 Reputation points

2024-01-12T04:21:35.44+00:00

Interesting! In my case I'm consuming a uncompressed parquet file from S3 and copying into a SQL table. I was doing it with a .csv file format without issues, but now with the parquet Im seeing this problem. I wonder if its not an ADF and parquet issue, considering both cases.
Guilherme Matte 0 Reputation points

2024-01-14T03:36:24.33+00:00

@Thatchapoom Thumsawatra did changing the parquet compression type to another one solve the issue or did you use another method to solve this problem? Cheers!
Thatchapoom Thumsawatra 10 Reputation points

2024-01-16T09:10:57.5533333+00:00

@Guilherme Matte In my case, changing parquet compression type didn't help, it always happened when we tried to execute copy job with quite huge parquet file. However I suggested you can try "Binary Copy" mode instead, then you can ingest it into SQL table later. It can help in some case. I'm wondering the cause of this unusual behavior may raise from package or library which they used to handle parquet file in ADF. We cannot proof my skepticism until we can reach out the engineering team to help to figure out. Support team wouldn’t let us contact them easily. One thing they did is trying to proof there is no something wrong with their product.
Brandon Roberts 0 Reputation points

2024-08-09T21:34:18.1+00:00

I am seeing something similar copying from SQL Server to Avro via SHIR 5.41.8909.1, but only Avro misreports the Data Read size.

Format Data Read

CSV 850 MB

Parquet 850 MB

Avro 1574 MB

1 answer

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-11-02T05:40:25.8633333+00:00

@Thatchapoom Thumsawatra Welcome to Microsoft Q&A forum and thanks for reaching out here.

I agree that there is a huge difference between original data size and the data size showed in your copy operation. As this is critical concern, I would recommend filing a support ticket for faster analysis by looking into your logs. In case if you don't have a support plan, please let me know here so that I can work with you offline in opening a free support ticket.

We are looking forward to your response.

Thank you
Thatchapoom Thumsawatra 10 Reputation points

2023-11-02T07:02:48.9133333+00:00

@KranthiPakala-MSFT Thank you for your kindly answer, please help me to figure out how to open free support ticket, because currently we are in the stage of proof of concept to expand our single cloud solution to multicloud solution, so we didn't have any paid support ticket for now.

Thank you for your help.
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-11-13T22:50:07.8466667+00:00

@Thatchapoom Thumsawatra As this issue requires deeper analysis. could you please send an email to AzCommunity[at]Microsoft[dot]com with the below details, so that we can enable a one-time-free support ticket for you to work closely on this matter.

Email subject: <Attn - Kranthi : Microsoft Q&A Thread title>

Thread URL: <Microsoft Q&A Thread>

Subscription ID: <your subscription id>

Looking forward to your reply.

Thank you
Guilherme Matte 0 Reputation points

2024-01-12T04:00:19.9233333+00:00

Any solutions on this? Im seeing the same thing happening to me. The copy takes much longer because it is reading from somewhere that a file of 1GB reads 80GB before finishing the copy
Thatchapoom Thumsawatra 10 Reputation points

2024-01-12T04:07:34.23+00:00

@Guilherme Matte I opened the ticket to Azure support, after investigation they suggested I should fly the ticket to Google Cloud instead. Based on my observation I found it usually happened when exporting SNAPPY PARQUET from Google BigQuery table to Google Cloud Storage and consumed by ADF.
Guilherme Matte 0 Reputation points

2024-01-12T04:21:35.44+00:00

Interesting! In my case I'm consuming a uncompressed parquet file from S3 and copying into a SQL table. I was doing it with a .csv file format without issues, but now with the parquet Im seeing this problem. I wonder if its not an ADF and parquet issue, considering both cases.
Guilherme Matte 0 Reputation points

2024-01-14T03:36:24.33+00:00

@Thatchapoom Thumsawatra did changing the parquet compression type to another one solve the issue or did you use another method to solve this problem? Cheers!
Thatchapoom Thumsawatra 10 Reputation points

2024-01-16T09:10:57.5533333+00:00

@Guilherme Matte In my case, changing parquet compression type didn't help, it always happened when we tried to execute copy job with quite huge parquet file. However I suggested you can try "Binary Copy" mode instead, then you can ingest it into SQL table later. It can help in some case. I'm wondering the cause of this unusual behavior may raise from package or library which they used to handle parquet file in ADF. We cannot proof my skepticism until we can reach out the engineering team to help to figure out. Support team wouldn’t let us contact them easily. One thing they did is trying to proof there is no something wrong with their product.
Brandon Roberts 0 Reputation points

2024-08-09T21:34:18.1+00:00

I am seeing something similar copying from SQL Server to Avro via SHIR 5.41.8909.1, but only Avro misreports the Data Read size.

Format Data Read

CSV 850 MB

Parquet 850 MB

Avro 1574 MB

Answer 1

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Format	Data Read
CSV	850 MB
Parquet	850 MB
Avro	1574 MB

Share via

Why the size of Data Read in ADF copy activity is actually much bigger than the original size of source data?

1 answer

Your answer