Data written size increase with COPY COMMAND in Copy activity

Question

Data written size increase with COPY COMMAND in Copy activity

rajendar erabathini 616

Hi - I have loaded very big single PARQUET file in ADLSs to Synapse analytics sql pool table using data factory. I have used COPY activity and set the Copy method as "Copy command". The copy is successful and all the rows are copied successfully. But I see the Data read size and Data written size(12.229 GB Vs 151.113 GB) is not matching, there is a huge difference in size. Please see below details captured from output. What could be the reason for increased Data Written size. thanks

Data read: 12.229 GB

rows read: 413187900

peak connection :1

Data written: 151.113 GB

rows written: 413187900

rajendar erabathini 616 Reputation points

2023-02-17T00:11:24.77+00:00

thanks @Bhargava-MSFT ! The explanation provided helps to further analyze the issue. What makes me more surprise is the table size is 8 GB after all rows copied. How come data written 151.113 GB size is fit into 8 GB table in synapse table. Any thoughts!

thanks
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-17T00:28:05.37+00:00

Hello @rajendar erabathini,

The data written size refers to the amount of data written to the target during the copy process, while the table size refers to the amount of actual storage space that the table occupies in the target.

The difference in size between the data written and table size could be due to the data compression and encoding applied to the data when it was written to the target. For example, if the data were compressed using a compression algorithm, the data written size would be larger than the table size.

also if the source data type is a string and the target data type is a decimal, the table size could be smaller than the data written size.

Can you check the un-used, reserved and data size on the table using sp_spaceused stored proc to see the actual space used on the table?
rajendar erabathini 616 Reputation points

2023-02-17T09:10:23.6733333+00:00

thanks - The data column is showing 8 GB size consumed after converting KB to GB.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-17T20:23:59.1633333+00:00

Hello @rajendar erabathini,

Thanks for checking this.

Most likely, this could be due to the data compression and encoding applied to the data when it was written to the target.

If you need more accurate details, I suggest filing a support case so the support engineer can work with you directly and provide more accurate details.

Please let me know if you have any further questions here.
rajendar erabathini 616 Reputation points

2023-02-20T05:08:38.64+00:00

Thanks @Bhargava-MSFT ! Will take it forward as per your advise.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-21T22:45:35.6133333+00:00

Thank you, @rajendar erabathini

1 answer

Your answer

rajendar erabathini 616 Reputation points

2023-02-17T00:11:24.77+00:00

thanks @Bhargava-MSFT ! The explanation provided helps to further analyze the issue. What makes me more surprise is the table size is 8 GB after all rows copied. How come data written 151.113 GB size is fit into 8 GB table in synapse table. Any thoughts!

thanks
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-17T00:28:05.37+00:00

Hello @rajendar erabathini,

The data written size refers to the amount of data written to the target during the copy process, while the table size refers to the amount of actual storage space that the table occupies in the target.

The difference in size between the data written and table size could be due to the data compression and encoding applied to the data when it was written to the target. For example, if the data were compressed using a compression algorithm, the data written size would be larger than the table size.

also if the source data type is a string and the target data type is a decimal, the table size could be smaller than the data written size.

Can you check the un-used, reserved and data size on the table using sp_spaceused stored proc to see the actual space used on the table?
rajendar erabathini 616 Reputation points

2023-02-17T09:10:23.6733333+00:00

thanks - The data column is showing 8 GB size consumed after converting KB to GB.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-17T20:23:59.1633333+00:00

Hello @rajendar erabathini,

Thanks for checking this.

Most likely, this could be due to the data compression and encoding applied to the data when it was written to the target.

If you need more accurate details, I suggest filing a support case so the support engineer can work with you directly and provide more accurate details.

Please let me know if you have any further questions here.
rajendar erabathini 616 Reputation points

2023-02-20T05:08:38.64+00:00

Thanks @Bhargava-MSFT ! Will take it forward as per your advise.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-21T22:45:35.6133333+00:00

Thank you, @rajendar erabathini

Answer 1

Hello @rajendar erabathini,

Welcome to the MS Q&A platform.

There could be many reasons for the increased data Written size in the Synapse SQL pool. Here are some of them I can think of

The data in the PARQUET file may be compressed, reducing the data's size on disk. However, when the data is loaded into Synapse Analytics, it is decompressed, which can increase the size of the data.
The data in the PARQUET file may be in a binary format, which is not human-readable. When the data is loaded into Synapse Analytics, it is converted to a human-readable format, increasing the data's size.
The data in the PARQUET file may have a different data type than the target table in Synapse Analytics. For example, if the data in the PARQUET file is stored as a string, but the target table has a numeric data type, the data will be converted to the numeric data type, which may increase the data size.
If the PARQUET file contains null values, the data written size may be larger than the data read size because null values require additional storage space. Additionally, suppose the PARQUET file contains complex data types, such as arrays or maps. In that case, the data written size may be larger than the data read size because complex data types require additional storage space.
The default block size for a SQL pool table in Synapse Analytics is 1MB. If the rows in the PARQUET file are smaller than the block size, then the data written size will be larger than the data read size, as each block will contain additional metadata.

I hope this helps. Please let me know if you have any further questions.

Share via

Data written size increase with COPY COMMAND in Copy activity

1 answer

Your answer