How to read multiple parquet.gzip files incrementally into pandas from Azure blob storage?

Samyak 41 Reputation points
2022-05-11T07:08:28.69+00:00

Hi all,

I want to read multiple parquet.gzip files incrementally into a pandas dataframe from my blob storage, do manipulation on them and store them using python. How can this be done effectively?
Note: Tried to read them directly using pd.read_parquet but i guess it doesn't work that way in Azure.
Can you guys help me out with a code snippet?

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,192 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2022-05-12T08:01:46.747+00:00

    Hello @Samyak ,

    Thanks for the question and using MS Q&A platform.

    When I tired to read multiple parquet.gzip files using pandas got it error message: OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

    201279-image.png

    After bit of research, found this document - Azure Databricks - Zip Files which explains to unzip the files and then load the files directly.

    You can invoke the Azure Databricks %sh zip magic command to unzip the file and read using pandas as shown below:

    201289-image.png

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.