How to read multiple parquet.gzip files incrementally into pandas from Azure blob storage?

Question

How to read multiple parquet.gzip files incrementally into pandas from Azure blob storage?

Samyak 41

Hi all,

I want to read multiple parquet.gzip files incrementally into a pandas dataframe from my blob storage, do manipulation on them and store them using python. How can this be done effectively?
Note: Tried to read them directly using pd.read_parquet but i guess it doesn't work that way in Azure.
Can you guys help me out with a code snippet?

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-06-01T05:39:07.83+00:00
Hello @Samyak ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

1 answer

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-06-01T05:39:07.83+00:00

Hello @Samyak ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you.

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

Hello @Samyak ,

Thanks for the question and using MS Q&A platform.

When I tired to read multiple parquet.gzip files using pandas got it error message: OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

After bit of research, found this document - Azure Databricks - Zip Files which explains to unzip the files and then load the files directly.

You can invoke the Azure Databricks %sh zip magic command to unzip the file and read using pandas as shown below:

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Samyak 41 Reputation points

2022-05-12T11:54:15.343+00:00

Hi @PRADEEPCHEEKATLA
Thanks for the response. But what I'm doing is uploading a python script with a parquet dataset on blob storage and then triggering that script using ADF and using Batch service for computation. When i trigger the script and read the parquet file, it requires pyarrow. This is where I am stuck and need help with. Also, if in future i need to install certain dependencies then i would like to know how that is to be done.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-05-17T05:01:16.77+00:00

Hello @Samyak ,

Thanks for sharing additional details.

Could you please share the python script which you are using along with the error message which you are experiencing?
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2022-05-30T05:08:53.043+00:00

Hello @Samyak ,

Just checking in if you have had a chance to see the previous response. We need the following information to understand/investigate this issue further.

Share via

How to read multiple parquet.gzip files incrementally into pandas from Azure blob storage?

1 answer

Your answer