Thank you for the details - this helps clarify the situation.
Summary of what we’re seeing:
- BigQuery table size (logical): 9.72 GB
- Rows read: 14.56 million
- ADF Copy Activity - Data Read: 71.56 GB
- Data Written to Blob: 36.56 GB
Why Is ADF “Data Read” much larger?
This behavior is expected due to the way ADF pulls data from BigQuery:
Data materialized during transfer
BigQuery stores data in compressed, columnar format. When ADF reads the data, it:
- Converts it into row-based JSON-like structure via the BigQuery Storage API
- This leads to a significant expansion in size, especially with:
- Long text fields
- Repeated or nested fields
- Wide schemas
Serialization overhead
When data is transmitted, it includes:
- Field names, delimiters, and formatting overhead (especially in formats like JSON or Avro)
- UTF-8 encoding, which can further inflate string-based data
Data written is smaller because of compression or column selection
In your case, the written size (36.56 GB) is about half of what was read - this suggests either:
- Compression was applied on write to Azure Blob
- Or the sink format is more storage-efficient
Optimization Tips
To minimize inflated read size from BigQuery:
- Avoid SELECT: Specify only the required columns in your ADF dataset query.
- Flatten Nested Fields: If your schema has nested structures or arrays, flattening them can reduce transfer overhead.
- Use Query Folding in Source: Use a custom SQL query in the source dataset to filter and limit the dataset early.
- Use Avro or Parquet in Sink: These formats are more compact and reduce Blob storage size and write times.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.