Slow performance reading parquet files from Data Lake Gen2
I'm using Azure Data Lake Gen2 to store data in parquet file format. I have split the data using partitions by year, month and day to benefit from the filtering functionality. I'm reading the data from python directly (no Spark or Synapse here) using the pyarrofs_adlgen2 library as suggested in other Q&A of this forum. However, the performance is much worse compared to what I get locally (storing and reading the data from my local file system). The following query takes 0.2s in local Vs 11.7s from Azure Data Lake Gen2:
import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2
handler=pyarrowfs_adlgen2.AccountHandler.from_account_name(account_name, azure.identity.DefaultAzureCredential()
fs = pyarrow.fs.PyFileSystem(handler)
filters = [('year', '=', 2020), ('month', '=', 1), ('day', 'in', (9, 10, 11))]
df = pd.read_parquet('container/data', filesystem=fs, engine='pyarrow', dtype_backend='pyarrow', filters=filters)
There is something I'm missing? How can I improve the performance using Azure Storage?