Why the size of Data Read in ADF copy activity is much bigger than the table size from BigQuery?

Sheldon Wong 0 Reputation points
2025-06-25T02:30:45.58+00:00

I am currently using Azure Data Factory (ADF) to copy data from a BigQuery table to Azure Blob Storage. However, I noticed a discrepancy where the "Data Read" metric in the copy activity is significantly larger than the actual size of the source table in BigQuery.

For example, when copying data for June 18, 2025, the BigQuery table size was much smaller, yet ADF reported 71.56GB of data read.
User's image

In Big Query, the logical storage is 9.72 GB
User's image

Could you kindly help clarify why this difference occurs and if there are any optimizations to reduce the data transfer volume?

Thank you in advance for your answer

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2025-06-25T12:12:47.01+00:00

    Hi @Sheldon Wong

    Thank you for the details - this helps clarify the situation.

    Summary of what we’re seeing:

    • BigQuery table size (logical): 9.72 GB
    • Rows read: 14.56 million
    • ADF Copy Activity - Data Read: 71.56 GB
    • Data Written to Blob: 36.56 GB

    Why Is ADF “Data Read” much larger?

    This behavior is expected due to the way ADF pulls data from BigQuery:

    Data materialized during transfer

    BigQuery stores data in compressed, columnar format. When ADF reads the data, it:

    • Converts it into row-based JSON-like structure via the BigQuery Storage API
    • This leads to a significant expansion in size, especially with:
      • Long text fields
      • Repeated or nested fields
      • Wide schemas

    Serialization overhead

    When data is transmitted, it includes:

    • Field names, delimiters, and formatting overhead (especially in formats like JSON or Avro)
    • UTF-8 encoding, which can further inflate string-based data

    Data written is smaller because of compression or column selection

    In your case, the written size (36.56 GB) is about half of what was read - this suggests either:

    • Compression was applied on write to Azure Blob
    • Or the sink format is more storage-efficient

    Optimization Tips

    To minimize inflated read size from BigQuery:

    • Avoid SELECT: Specify only the required columns in your ADF dataset query.
    • Flatten Nested Fields: If your schema has nested structures or arrays, flattening them can reduce transfer overhead.
    • Use Query Folding in Source: Use a custom SQL query in the source dataset to filter and limit the dataset early.
    • Use Avro or Parquet in Sink: These formats are more compact and reduce Blob storage size and write times.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.