Read Parquet directly from blob storage ADLS in Python Azure Function.

Question

Read Parquet directly from blob storage ADLS in Python Azure Function.

Javier Guerrero 61

Hi, I have a PyArrow table (parquet file) in an ADLS storage account. Using a Python function, I need to query that Parquet file and return a value. The way I see it is by downloading the file and performing the filter. However, there seems to be a way to do it directly in the storage account by using PyArrowfs_adlgen2. In the following example: https://pypi.org/project/pyarrowfs-adlgen2/, it uses 'azure.identity.DefaultAzureCredential()' which I do not have. I would like to use the connection string. Is this possible?

KarishmaTiwari-MSFT 20,777 Reputation points Microsoft Employee Moderator

2024-02-02T21:51:31.5266667+00:00

@Javier Guerrero Thanks for posting your query on Microsoft Q&A. Our team is looking into this query.

Accepted answer

0 additional answers

Your answer

KarishmaTiwari-MSFT 20,777 Reputation points Microsoft Employee Moderator

2024-02-02T21:51:31.5266667+00:00

@Javier Guerrero Thanks for posting your query on Microsoft Q&A. Our team is looking into this query.

Answer 1

Hi Javier Guerrero, Thank you for posting query in Microsoft Q&A Platform.

Looks like, pyarrowfs-adlsgen2 library cannot directly take connection string. It will uses azure.identity.DefaultAzureCredential. As per azure.identity.DefaultAzureCredential it will looks for credentials at many placee including Environment Variables. To know more about it check this link.In your case, since you wanted to work with connection string only, I would suggest you to consider using azure.storage.filedatalake.DataLakeFileContent and pandas. Below is the code.

import io
from azure.storage.filedatalake import DataLakeFileClient
import pandas as pd

# Replace these values with your own
account_url = "https://accountName.dfs.core.windows.net"
file_system_name = "containerName"
file_path = "folder/sample.parquet"
credential = "accountkey"

# Create a DataLakeFileClient object for the specified file
file_client = DataLakeFileClient(account_url=account_url, file_system_name=file_system_name, file_path=file_path, credential=credential)

# # Download the parquet file as a stream
# with file_client.download_file() as stream:
#     # Read the parquet file into a pandas DataFrame
#     df = pd.read_parquet(stream, engine='pyarrow')


# Download the parquet file as a stream
stream = file_client.download_file()
data = stream.readall()

# Read the parquet file into a pandas DataFrame
df = pd.read_parquet(io.BytesIO(data), engine='pyarrow')
print(df)

In the above code, we are not downloading file to local. We are downloading file directly to stream and reading from there.
Hope this helps. Please let me know how it goes.

Please consider hitting Accept Answer button. Accepted answers help community as well.

Javier Guerrero 61 Reputation points

2024-02-05T13:27:25.01+00:00

Hi, thanks for your response. I was trying to avoid downloading the file and filtering the Parquet file directly in the storage account ADLS gem2. Does the DataLakeFileClient support reading files like *.parquet, or do I have to build a loop around it?
ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2024-02-06T06:39:50.62+00:00

Hi Javier Guerrero, Above approach, will not download file to local. It will read the file in to stream and load it to pandas dataframe object.
DataLakeFileClient not supports *.parquet. We need to consider mentioning whole file path. Please check below documentation link for same. https://learn.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.datalakefileclient?view=azure-python Hope this helps. Please let me know if any further queries.

Please consider hitting Accept Answer button. Accepted answers help community as well. Thank you.

Share via

Read Parquet directly from blob storage ADLS in Python Azure Function.

0 additional answers

Your answer