Data written size increase with COPY COMMAND in Copy activity

rajendar erabathini 616 Reputation points
2023-02-14T11:50:43.1533333+00:00

Hi - I have loaded very big single PARQUET file in ADLSs to Synapse analytics sql pool table using data factory. I have used COPY activity and set the Copy method as "Copy command". The copy is successful and all the rows are copied successfully. But I see the Data read size and Data written size(12.229 GB Vs 151.113 GB) is not matching, there is a huge difference in size. Please see below details captured from output. What could be the reason for increased Data Written size. thanks

Data read: 12.229 GB

rows read: 413187900

peak connection :1

Data written: 151.113 GB

rows written: 413187900

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,644 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-02-17T00:11:36.35+00:00

    Hello @rajendar erabathini,

    Welcome to the MS Q&A platform.

    There could be many reasons for the increased data Written size in the Synapse SQL pool. Here are some of them I can think of

    • The data in the PARQUET file may be compressed, reducing the data's size on disk. However, when the data is loaded into Synapse Analytics, it is decompressed, which can increase the size of the data.
    • The data in the PARQUET file may be in a binary format, which is not human-readable. When the data is loaded into Synapse Analytics, it is converted to a human-readable format, increasing the data's size.
    • The data in the PARQUET file may have a different data type than the target table in Synapse Analytics. For example, if the data in the PARQUET file is stored as a string, but the target table has a numeric data type, the data will be converted to the numeric data type, which may increase the data size.
    • If the PARQUET file contains null values, the data written size may be larger than the data read size because null values require additional storage space. Additionally, suppose the PARQUET file contains complex data types, such as arrays or maps. In that case, the data written size may be larger than the data read size because complex data types require additional storage space.
    • The default block size for a SQL pool table in Synapse Analytics is 1MB. If the rows in the PARQUET file are smaller than the block size, then the data written size will be larger than the data read size, as each block will contain additional metadata.

    I hope this helps. Please let me know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.