Read Parquet directly from blob storage ADLS in Python Azure Function.

Javier Guerrero 61 Reputation points
2024-02-02T17:17:13.2733333+00:00

Hi, I have a PyArrow table (parquet file) in an ADLS storage account. Using a Python function, I need to query that Parquet file and return a value. The way I see it is by downloading the file and performing the filter. However, there seems to be a way to do it directly in the storage account by using PyArrowfs_adlgen2. In the following example: https://pypi.org/project/pyarrowfs-adlgen2/, it uses 'azure.identity.DefaultAzureCredential()' which I do not have. I would like to use the connection string. Is this possible?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,911 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
{count} votes

Accepted answer
  1. ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator
    2024-02-05T11:22:03.1333333+00:00

    Hi Javier Guerrero, Thank you for posting query in Microsoft Q&A Platform.

    Looks like, pyarrowfs-adlsgen2 library cannot directly take connection string. It will uses azure.identity.DefaultAzureCredential. As per azure.identity.DefaultAzureCredential it will looks for credentials at many placee including Environment Variables. To know more about it check this link.In your case, since you wanted to work with connection string only, I would suggest you to consider using azure.storage.filedatalake.DataLakeFileContent and pandas. Below is the code.

    import io
    from azure.storage.filedatalake import DataLakeFileClient
    import pandas as pd
    
    # Replace these values with your own
    account_url = "https://accountName.dfs.core.windows.net"
    file_system_name = "containerName"
    file_path = "folder/sample.parquet"
    credential = "accountkey"
    
    # Create a DataLakeFileClient object for the specified file
    file_client = DataLakeFileClient(account_url=account_url, file_system_name=file_system_name, file_path=file_path, credential=credential)
    
    # # Download the parquet file as a stream
    # with file_client.download_file() as stream:
    #     # Read the parquet file into a pandas DataFrame
    #     df = pd.read_parquet(stream, engine='pyarrow')
    
    
    # Download the parquet file as a stream
    stream = file_client.download_file()
    data = stream.readall()
    
    # Read the parquet file into a pandas DataFrame
    df = pd.read_parquet(io.BytesIO(data), engine='pyarrow')
    print(df)
    
    
    

    In the above code, we are not downloading file to local. We are downloading file directly to stream and reading from there.
    Hope this helps. Please let me know how it goes.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.