Hello randeep gurjar,
Greetings! Welcome to Microsoft Q&A Platform.
To read the data in Blob storage, you can try using Azure Blob Storage REST API. Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format
of pyarrow
, manually to list the blob names with the prefix like dataset_name
using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storage SDK for Python as the figure below, then to read these blobs one by one.
There are a few things you can try to consider for better performance:
1.If your workloads require a low consistent latency and/or require a high number of input output operations per second (IOP), consider using a premium block blob storage account. This type of account makes data available via high-performance hardware. Data is stored on solid-state drives (SSDs) which are optimized for low latency. SSDs provide higher throughput compared to traditional hard drives. The storage costs of premium performance are higher, but transaction costs are lower. Therefore, if your workloads execute a large number of transactions, a premium performance block blob account can be economical. https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#consider-premium
2.To achieve the best performance, use all available throughput by performing as many reads and writes in parallel as possible.
3.Larger files lead to better performance and reduced costs.
4.File format, file size, and directory structure can all impact performance and cost.
Refer to detailed documentation here: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Hope this answer helps! please let us know if you have any further queries. I’m happy to assist you further.
Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.