Slow performance reading parquet files from Data Lake Gen2

Jorge Lopez 36 Reputation points
2024-02-15T14:36:23.51+00:00

I'm using Azure Data Lake Gen2 to store data in parquet file format. I have split the data using partitions by year, month and day to benefit from the filtering functionality. I'm reading the data from python directly (no Spark or Synapse here) using the pyarrofs_adlgen2 library as suggested in other Q&A of this forum. However, the performance is much worse compared to what I get locally (storing and reading the data from my local file system). The following query takes 0.2s in local Vs 11.7s from Azure Data Lake Gen2:

import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler=pyarrowfs_adlgen2.AccountHandler.from_account_name(account_name, azure.identity.DefaultAzureCredential()
fs = pyarrow.fs.PyFileSystem(handler)

filters = [('year', '=', 2020), ('month', '=', 1), ('day', 'in', (9, 10, 11))]
df = pd.read_parquet('container/data', filesystem=fs, engine='pyarrow', dtype_backend='pyarrow', filters=filters)

There is something I'm missing? How can I improve the performance using Azure Storage?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,471 questions
{count} vote

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.