cannot read parquet from blob storage in python azure function?

Marti Kevin 31 Reputation points
2022-07-06T13:59:26.307+00:00

Hey guys I want to read a small parquet file from azure blob storage over a python azure function. This could look something like this:

import logging  
from io import BytesIO  
import azure.functions as func  
import pandas as pd  
  
  
def main(req: func.HttpRequest, inputBlob: func.InputStream) -> func.HttpResponse:  
  
    # Read the blob as bytes  
    try:  
        blob_bytes = inputBlob.read()  
        blob_to_read = BytesIO(blob_bytes)  
        df = pd.read_parquet(blob_to_read, engine='pyarrow')  
        logging.info("Length of the parquet file:" + str(len(df.index)))  
      
    except Exception as e:  
        logging.error("Error reading" + str(e))  
  
    finally:  
        return func.HttpResponse(  
                "finished",  
                status_code=200  
        )  

I always get an error such that I cannot load the data correctly. What am I doing wrong?

Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,911 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Marti Kevin 31 Reputation points
    2022-07-13T06:07:38.19+00:00

    I did I got an error message that the data was invalid. I think it has something to do with the enconding of the func.InputStream bytes.

    I solved it by using the BlobService Class and droped all the bindings.

    import os  
    from azure.storage.blob import   BlobServiceClient  
    import pandas as pd  
    from io import BytesIO  
    import logging  
    import azure.functions as func  
      
    def read_parquet_from_blob_to_pandas_df(connection_str, container, blob_path):  
        blob_service_client = BlobServiceClient.from_connection_string(connection_str)  
        blob_client = blob_service_client.get_blob_client(container = container, blob = blob_path)  
        stream_downloader = blob_client.download_blob()  
        stream = BytesIO()  
        stream_downloader.readinto(stream)  
        df = pd.read_parquet(stream, engine = 'pyarrow')  
      
        return df  
      
      
      
    def main(req: func.HttpRequest) -> func.HttpResponse:  
      
        connect_str = os.environ['<your_storage_connection_string']  
        container = '<your_container>'  
        blob_path = '<your_blob_path>'  
      
      
        df = read_parquet_from_blob_to_pandas_df(connect_str, container, blob_path)  
        logging.info(str(df.head()))  
      
      
        return func.HttpResponse(f"This approach works!")  
    
    2 people found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.