how azure synapse on demand calculate data processed ?

Question

how azure synapse on demand calculate data processed ?

mim 26

for testing purpose I am using a small parquet file, it is only 1.2 MB, but I notice when I check the data processed metrics, I see numbers like 12 mb.

my question is does On demand mode for Azure Synapse charge for the compressed data or uncompressed

HarithaMaddi-MSFT 10,146 Reputation points

2020-07-27T10:44:45.067+00:00

Hi @mim ,

In order to investigate further, can you please share more details on queries or snap from portal where you are looking for to check the data processed metrics.

Thanks.
mim 26 Reputation points

2020-07-27T12:22:04.43+00:00

can someone from Microsoft confirm if the data charged is compressed or uncompressed, it should be in the documentation.
HarithaMaddi-MSFT 10,146 Reputation points

2020-07-27T14:17:42.957+00:00

Hi @mim ,

I’m working with the product team and get back to you when I have more information.

Meanwhile, could you please share more details on where exactly you are seeing data processed metrics shows as 12 mb.
HarithaMaddi-MSFT 10,146 Reputation points

2020-07-28T05:27:18.337+00:00

Hi @mim ,

Thanks for posting the question. SQL on-demand is used to query data in the data lake, charging is per data processed. If your data is in Parquet format which is compressed, it will be cheaper. Also, Parquet format is columnar so you will be charged only for columns you need in your query, not all columns, making it even cheaper. In case of CSV, data is not only uncompressed, but SQL on-demand would need to read whole rows for you to extract columns you target. For more pricing details, please check this link

Hope this helps! Do let us know for further queries.
mim 26 Reputation points

2020-07-28T10:30:20.853+00:00

I am afraid that's not the case just try it with a parquet file 157.8 mb, using select * from that file, I would expect the data processed to be 157.8 mb, but it is processing 1.2 GB

so appreciate a direct answer is the data processed means compressed or not in the case of Parquet ?

2 answers

Your answer

HarithaMaddi-MSFT 10,146 Reputation points

2020-07-27T10:44:45.067+00:00

Hi @mim ,

In order to investigate further, can you please share more details on queries or snap from portal where you are looking for to check the data processed metrics.

Thanks.
mim 26 Reputation points

2020-07-27T12:22:04.43+00:00

can someone from Microsoft confirm if the data charged is compressed or uncompressed, it should be in the documentation.
HarithaMaddi-MSFT 10,146 Reputation points

2020-07-27T14:17:42.957+00:00

Hi @mim ,

I’m working with the product team and get back to you when I have more information.

Meanwhile, could you please share more details on where exactly you are seeing data processed metrics shows as 12 mb.
HarithaMaddi-MSFT 10,146 Reputation points

2020-07-28T05:27:18.337+00:00

Hi @mim ,

Thanks for posting the question. SQL on-demand is used to query data in the data lake, charging is per data processed. If your data is in Parquet format which is compressed, it will be cheaper. Also, Parquet format is columnar so you will be charged only for columns you need in your query, not all columns, making it even cheaper. In case of CSV, data is not only uncompressed, but SQL on-demand would need to read whole rows for you to extract columns you target. For more pricing details, please check this link

Hope this helps! Do let us know for further queries.
mim 26 Reputation points

2020-07-28T10:30:20.853+00:00

I am afraid that's not the case just try it with a parquet file 157.8 mb, using select * from that file, I would expect the data processed to be 157.8 mb, but it is processing 1.2 GB

so appreciate a direct answer is the data processed means compressed or not in the case of Parquet ?

Answer 1

Hi @mim ,

Thanks for your patience. Product team confirmed that Synapse SQL serverless billing is based on data processed and Data processed is amount of data stored internally while executing your query. It consists of data read (compressed data+metadata reads) and intermediate results (data shuffled which is in uncompressed format always). In case of your query, it read all columns and all rows, which means that data processed = compressed (data that is read + metadata reads) + uncompressed (data that is shuffled to your endpoint) and few more like autostats and read-ahead. If we are running aggregated queries, data processed is equivalent to compressed file size because on top of it there would be metadata reads and shuffling of result of SUM function (single value) which would add insignificant overhead comparing to actual data read.

Product team is working on updating pricing page with better explanation and samples. Hope this helps!

Answer 2

Hi @mim ,

Thanks for your valuable insights. I reproduced and observed the discrepancy in "Data Processed" metrics in Azure Portal compared to the file size in the Synapse query as per below snaps. I have shared the details with Product team, they are working on the issue to understand the root cause of the amplification in metric and its impact to billing. However, Product team confirmed that charge will happen only on compressed data for parquet file and they will work on fixing it if it is not that way in current product after further investigation. I will closely work with the product team and will get back to you once I hear more updates.

Stay tuned!

Share via

how azure synapse on demand calculate data processed ?

2 answers

Your answer