Hello @Shlomi Lanton,
You could use any of the below suggestions!
- Use partitioning: You can partition your data based on the month or any other relevant column, so that each partition contains a subset of the data. This can help reduce the number of files that are updated at any given time, and therefore reduce the probability of inconsistent reads.
- Use external tables: You can create an external table that references the Parquet files, and use the external table to query the data. External tables provide a layer of abstraction between the data and the query engine, and can help reduce the impact of file updates on query execution.
- Use a different storage tier: You can consider using a different storage tier, such as hot or cool, instead of the default tier. This can help reduce the cost of storage, but may also impact the performance of read and write operations.
CREATE EXTERNAL TABLE MyExternalTable (
col1 string,
col2 int,
...
)
WITH (
LOCATION = 'wasbs://STORAGE_ACCOUNT@CONTAINER/item_id=xxxx/month=yy/partition_version=last/',
DATA_SOURCE = MyDataSource,
FILE_FORMAT = MyFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0,
ALLOW_INCONSISTENT_READS = ON
);
"ALLOW_INCONSISTENT_READS" option is set to "ON" to allow queries to read data from files that are currently being written to. You can also specify other options like the location of the files, the data source, and the file format.