cannot read parquet from blob storage in python azure function?

Question

cannot read parquet from blob storage in python azure function?

Marti Kevin 31

Hey guys I want to read a small parquet file from azure blob storage over a python azure function. This could look something like this:

import logging  
from io import BytesIO  
import azure.functions as func  
import pandas as pd  
  
  
def main(req: func.HttpRequest, inputBlob: func.InputStream) -> func.HttpResponse:  
  
    # Read the blob as bytes  
    try:  
        blob_bytes = inputBlob.read()  
        blob_to_read = BytesIO(blob_bytes)  
        df = pd.read_parquet(blob_to_read, engine='pyarrow')  
        logging.info("Length of the parquet file:" + str(len(df.index)))  
      
    except Exception as e:  
        logging.error("Error reading" + str(e))  
  
    finally:  
        return func.HttpResponse(  
                "finished",  
                status_code=200  
        )

I always get an error such that I cannot load the data correctly. What am I doing wrong?

Carlos Solís Salazar 18,191 Reputation points MVP Volunteer Moderator

2022-07-12T00:29:55.167+00:00

Thank you for asking this question on the **Microsoft Q&A Platform. **

You have not received answers or comments to your question because it may be ambiguous or confusing.

I recommend you visit How to write a quality question and verify that your question meets some of the recommendations.

Hope this helps,
Carlos Solís Salazar

----------

NOTE: To answer you as quickly as possible, please mention me in your reply.
MughundhanRaveendran-MSFT 12,506 Reputation points

2022-07-13T04:41:05.003+00:00

@Marti Kevin , Have you installed pyarrow?
pip install pyarrow

Also are you getting the data in the blob_to_read ? Please check if it has value or it is null

1 answer

Your answer

Carlos Solís Salazar 18,191 Reputation points MVP Volunteer Moderator

2022-07-12T00:29:55.167+00:00

Thank you for asking this question on the **Microsoft Q&A Platform. **

You have not received answers or comments to your question because it may be ambiguous or confusing.

I recommend you visit How to write a quality question and verify that your question meets some of the recommendations.

Hope this helps,
Carlos Solís Salazar

----------

NOTE: To answer you as quickly as possible, please mention me in your reply.
MughundhanRaveendran-MSFT 12,506 Reputation points

2022-07-13T04:41:05.003+00:00

@Marti Kevin , Have you installed pyarrow?
pip install pyarrow

Also are you getting the data in the blob_to_read ? Please check if it has value or it is null

Answer 1

I did I got an error message that the data was invalid. I think it has something to do with the enconding of the func.InputStream bytes.

I solved it by using the BlobService Class and droped all the bindings.

import os  
from azure.storage.blob import   BlobServiceClient  
import pandas as pd  
from io import BytesIO  
import logging  
import azure.functions as func  
  
def read_parquet_from_blob_to_pandas_df(connection_str, container, blob_path):  
    blob_service_client = BlobServiceClient.from_connection_string(connection_str)  
    blob_client = blob_service_client.get_blob_client(container = container, blob = blob_path)  
    stream_downloader = blob_client.download_blob()  
    stream = BytesIO()  
    stream_downloader.readinto(stream)  
    df = pd.read_parquet(stream, engine = 'pyarrow')  
  
    return df  
  
  
  
def main(req: func.HttpRequest) -> func.HttpResponse:  
  
    connect_str = os.environ['<your_storage_connection_string']  
    container = '<your_container>'  
    blob_path = '<your_blob_path>'  
  
  
    df = read_parquet_from_blob_to_pandas_df(connect_str, container, blob_path)  
    logging.info(str(df.head()))  
  
  
    return func.HttpResponse(f"This approach works!")

MughundhanRaveendran-MSFT 12,506 Reputation points

2022-07-14T05:11:13.89+00:00

@Marti Kevin , Thanks for the update

Share via

cannot read parquet from blob storage in python azure function?

1 answer

Your answer