Error while reading parquet files that are constantly changing from extrenal storage using Synapse Analytics

Shlomi Lanton 76 Reputation points
2023-06-22T14:27:01.55+00:00

Hello,

I have a dataset stored in blob storage in parquet files, each file has ~600 columns and in total I have ~500K files.

The files hierarchy is:

STORAGE_ACCOUNT/CONTAINER/item_id=xxxx/month=yy/partition_version=last/part-0.parquet

At any moment a service is running that is updating the dataset (re-writing the relevant month file), most changes are in the current month file, as a result when I try to query data from the current month I'm getting the following error:

Error handling external file: 'IO request completed with an error. ERROR = 0x0000000C'. File/External table name: '/item_id=xxxx/month=6/partition_version=last/part-0.parquet'.

If I limit my query to not read the current month I can get results.

I saw the ALLOW_INCONSISTENT_READS option (link) but as I understand this is only available for datasets based on CSV.

Is there any way to prevent this error?

Thanks

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. QuantumCache 20,366 Reputation points Moderator
    2023-06-23T16:24:22.6733333+00:00

    Hello @Shlomi Lanton,

    You could use any of the below suggestions!

    1. Use partitioning: You can partition your data based on the month or any other relevant column, so that each partition contains a subset of the data. This can help reduce the number of files that are updated at any given time, and therefore reduce the probability of inconsistent reads.
    2. Use external tables: You can create an external table that references the Parquet files, and use the external table to query the data. External tables provide a layer of abstraction between the data and the query engine, and can help reduce the impact of file updates on query execution.
    3. Use a different storage tier: You can consider using a different storage tier, such as hot or cool, instead of the default tier. This can help reduce the cost of storage, but may also impact the performance of read and write operations.
    CREATE EXTERNAL TABLE MyExternalTable (
      col1 string,
      col2 int,
      ...
    )
    WITH (
      LOCATION = 'wasbs://STORAGE_ACCOUNT@CONTAINER/item_id=xxxx/month=yy/partition_version=last/',
      DATA_SOURCE = MyDataSource,
      FILE_FORMAT = MyFileFormat,
      REJECT_TYPE = VALUE,
      REJECT_VALUE = 0,
      ALLOW_INCONSISTENT_READS = ON
    );
    

    "ALLOW_INCONSISTENT_READS" option is set to "ON" to allow queries to read data from files that are currently being written to. You can also specify other options like the location of the files, the data source, and the file format.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.