Error while reading parquet files that are constantly changing from extrenal storage using Synapse Analytics

Question

Error while reading parquet files that are constantly changing from extrenal storage using Synapse Analytics

Shlomi Lanton 76

Hello,

I have a dataset stored in blob storage in parquet files, each file has ~600 columns and in total I have ~500K files.

The files hierarchy is:

STORAGE_ACCOUNT/CONTAINER/item_id=xxxx/month=yy/partition_version=last/part-0.parquet

At any moment a service is running that is updating the dataset (re-writing the relevant month file), most changes are in the current month file, as a result when I try to query data from the current month I'm getting the following error:

Error handling external file: 'IO request completed with an error. ERROR = 0x0000000C'. File/External table name: '/item_id=xxxx/month=6/partition_version=last/part-0.parquet'.

If I limit my query to not read the current month I can get results.

I saw the ALLOW_INCONSISTENT_READS option (link) but as I understand this is only available for datasets based on CSV.

Is there any way to prevent this error?

Thanks

1 answer

Your answer

Answer 1

QuantumCache 20,366 Moderator

Hello @Shlomi Lanton,

You could use any of the below suggestions!

Use partitioning: You can partition your data based on the month or any other relevant column, so that each partition contains a subset of the data. This can help reduce the number of files that are updated at any given time, and therefore reduce the probability of inconsistent reads.
Use external tables: You can create an external table that references the Parquet files, and use the external table to query the data. External tables provide a layer of abstraction between the data and the query engine, and can help reduce the impact of file updates on query execution.
Use a different storage tier: You can consider using a different storage tier, such as hot or cool, instead of the default tier. This can help reduce the cost of storage, but may also impact the performance of read and write operations.

CREATE EXTERNAL TABLE MyExternalTable (
  col1 string,
  col2 int,
  ...
)
WITH (
  LOCATION = 'wasbs://STORAGE_ACCOUNT@CONTAINER/item_id=xxxx/month=yy/partition_version=last/',
  DATA_SOURCE = MyDataSource,
  FILE_FORMAT = MyFileFormat,
  REJECT_TYPE = VALUE,
  REJECT_VALUE = 0,
  ALLOW_INCONSISTENT_READS = ON
);

"ALLOW_INCONSISTENT_READS" option is set to "ON" to allow queries to read data from files that are currently being written to. You can also specify other options like the location of the files, the data source, and the file format.

Shlomi Lanton 76 Reputation points

2023-06-27T14:25:34.67+00:00
Thanks for the reply. I'm not sure what you mean by storage tier, we are using "Premium" performance. As to your other suggestions, we already using partitions with a month partition.

To read the data we use, external file format, external data source and a view.

Are you saying that instead of the view we can use external table for better results?

Is there a flag that we can use (like "ALLOW_INCONSISTENT_READS" in the CSV case) to not make the query fail in this case?

Thanks
QuantumCache 20,366 Reputation points Moderator

2023-07-05T22:06:10.92+00:00

Hello @Shlomi Lanton Sorry for the delayed response.

Yes you may try to create the external table in this scenario!
Using an external table can provide better performance, scalability, and security compared to querying data directly from Parquet files in blob storage. However, it does require some additional setup and configuration, so you should evaluate whether the benefits outweigh the costs for your specific use case.
Shlomi Lanton 76 Reputation points

2023-07-06T13:13:14.8233333+00:00

Thanks for the reply. Looking at https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-external-tables I see the following, that means that if I will use external table, maybe my query will not fail but I will lose the ability to leverage the partitions in the data?

The table is created on partitioned folder structure, but you cannot leverage some partition elimination. If you want to get better performance by skipping the files that do not satisfy some criterion (like specific year or month in this case), use views on external data.

Share via

Error while reading parquet files that are constantly changing from extrenal storage using Synapse Analytics

1 answer

Your answer