How to download all partitions of a parquet file in Python from Azure Data Lake?

Question

How to download all partitions of a parquet file in Python from Azure Data Lake?

Manash 51

I am able to read a parquet file from Azure blob storage generated using python. This file does not have any the partition structure.
Example: containername/project/year/month/File.parquet

Code
blob_name = f'{file_path}.parquet'
blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
file_data = pd.read_parquet(stream, engine='pyarrow')

But if its a parquet file is generated by a spark engine then the file has partitions in it. I am not able to read this kind of parquet file using python module. I tried to look up for resources in Azure Python SDK but unable to find it.

Found an example from Apache arrow here but its similar to the above example for unpartitioned parquet file.

Is it possible to read download the parquet file from ADLS using python, having partitions in it ?
Is it possible to read using stream without downloading the blob?

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-10-25T05:42:56.277+00:00
Hello @Manash ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

---------

Please don't forget to click on or upvote button whenever the information provided helps you.
Manash 51 Reputation points

2022-10-25T09:53:25.58+00:00

When I tried it did not work for me. I will check this again and get back to you.

3 answers

Your answer

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-10-25T05:42:56.277+00:00

Hello @Manash ,

Following up to see if the below suggestion was helpful. And, if you have any further query do let us know.

---------

Please don't forget to click on or upvote button whenever the information provided helps you.
Manash 51 Reputation points

2022-10-25T09:53:25.58+00:00

When I tried it did not work for me. I will check this again and get back to you.

Answer 1

Manash 51

Unfortunately direct way of reading parquet file did not work for me so I have to perform the following.

Repartition my parquet file into a single partition
Use container client and get the list of blobs from the specified path
Use list_blob function with prefix="part-" which will get my single partition parquet file (filter only parquet partition)
Read the parquet file from step3.

or

Use container client and get the list of blobs from the specified path
Use list_blob function with prefix="part-" which will get one partition at a time (filter only parquet partition)
Read the parquet file from step 2
Repeat step 2 and 3 and append the partitions to form a complete dataframe.

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-10-31T05:36:45.427+00:00

Hello @Manash ,

Glad to know that your issue has been resolved. And thanks for sharing the solution, which might be beneficial to other community members reading this thread.

Answer 2

Hello @Manash ,

Thanks for the question and using MS Q&A platform.

Use pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

Note: It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

And also checkout the Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is jhow you can be part of Q&A Volunteer Moderators

Answer 3

For some reason its not allowing me to post the below as comments though its withing the 1600 character range.

Thanks for your suggestion @PRADEEPCHEEKATLA .
My file is present at : Container_name/Project_name/Year/Month/File.parquet/
_SUCCESS
_committed_1387639002
_started_1387639002
part-00000-tid-3006757700744507944-2dececd1-5fa7-429a-9fbe-3b21c1415734-16-1-c000.snappy.parquet

After importing all libraries,

handler = pyarrowfs_adlgen2.AccountHandler.from_account_name('acc_name', azure.identity.DefaultAzureCredential())  
fs = pyarrow.fs.PyFileSystem(handler)  
print("Path is {}".format("Container_name/Project_name/Year/Month/File.parquet/" ))  
ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/"  , filesystem=fs)  
data from blob= ds.to_table()

Error message
azure.core.exceptions.HttpResponseError: (InvalidResourceName) The specifed resource name contains invalid characters.
Code: InvalidResourceName
Message: The specifed resource name contains invalid characters.

Then I tried this,

  ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/part-00000-tid-3006757700744507944-2dececd1-5fa7-429a-9fbe-3b21c1415734-16-1-c000.snappy.parquet"  , filesystem=fs)

Same error message as above

Then I renamed the partition as

ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/renamed.parquet"  , filesystem=fs)  
#or   
ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/renamed.snappy.parquet"  , filesystem=fs)

Same error as above.
Is there anything wrong that I am doing?

PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-10-20T08:54:25.917+00:00

Hello @Manash ,

When you provide the file path of the parquet file it will reads all the part* files in the table.

Example: The below parquet file name parquet-table-984371 has multiple part* files.

I'm able to read parquet files using spark API or panda as shown below:

Hope this helps.
Manash 51 Reputation points

2022-10-20T10:58:50.083+00:00

@PRADEEPCHEEKATLA , I above method works only when the blob has public access permitted. In my case it's not allowed and so when I try your latest syntax I get the error as

ErrorCode:PublicAccessNotPermitted
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>PublicAccessNotPermitted</Code><Message>Public access is not permitted on this storage account.
RequestId:32dcca6c-401e-000b-2865-e48f9d000000

I have no troubles to read this file in spark. But I want to avoid spark because for reading one file I don't want to spin up a cluster. Even SAS token does not work.

Pyarrow.parquet and many other libraries says that it's possible to read multiple partitions, but I did not succeed in replicating it and there is no clear example available on the internet.
PRADEEPCHEEKATLA 90,661 Reputation points Moderator

2022-10-21T06:21:38.413+00:00

Hello @Manash ,

I had tested this on the Public access level as Private as shown below and it works as excepted.

Share via

How to download all partitions of a parquet file in Python from Azure Data Lake?

3 answers

Your answer