How to download all partitions of a parquet file in Python from Azure Data Lake?

Manash 51 Reputation points
2022-10-18T11:06:49.473+00:00

I am able to read a parquet file from Azure blob storage generated using python. This file does not have any the partition structure.
Example: containername/project/year/month/File.parquet

Code
blob_name = f'{file_path}.parquet'
blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
file_data = pd.read_parquet(stream, engine='pyarrow')

But if its a parquet file is generated by a spark engine then the file has partitions in it. I am not able to read this kind of parquet file using python module. I tried to look up for resources in Azure Python SDK but unable to find it.

Found an example from Apache arrow here but its similar to the above example for unpartitioned parquet file.

Is it possible to read download the parquet file from ADLS using python, having partitions in it ?
Is it possible to read using stream without downloading the blob?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,426 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Manash 51 Reputation points
    2022-10-28T05:15:45.783+00:00

    Unfortunately direct way of reading parquet file did not work for me so I have to perform the following.

    1. Repartition my parquet file into a single partition
    2. Use container client and get the list of blobs from the specified path
    3. Use list_blob function with prefix="part-" which will get my single partition parquet file (filter only parquet partition)
    4. Read the parquet file from step3.

    or

    1. Use container client and get the list of blobs from the specified path
    2. Use list_blob function with prefix="part-" which will get one partition at a time (filter only parquet partition)
    3. Read the parquet file from step 2
    4. Repeat step 2 and 3 and append the partitions to form a complete dataframe.
    1 person found this answer helpful.

  2. PRADEEPCHEEKATLA-MSFT 85,586 Reputation points Microsoft Employee
    2022-10-19T05:03:26.03+00:00

    Hello @Manash ,

    Thanks for the question and using MS Q&A platform.

    Use pyarrowfs-adlgen2 is an implementation of a pyarrow filesystem for Azure Data Lake Gen2.

    Note: It allows you to use pyarrow and pandas to read parquet datasets directly from Azure without the need to copy files to local storage first.

    And also checkout the Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is jhow you can be part of Q&A Volunteer Moderators
    0 comments No comments

  3. Manash 51 Reputation points
    2022-10-19T07:53:40.52+00:00

    For some reason its not allowing me to post the below as comments though its withing the 1600 character range.

    Thanks for your suggestion @PRADEEPCHEEKATLA-MSFT .
    My file is present at : Container_name/Project_name/Year/Month/File.parquet/
    _SUCCESS
    _committed_1387639002
    _started_1387639002
    part-00000-tid-3006757700744507944-2dececd1-5fa7-429a-9fbe-3b21c1415734-16-1-c000.snappy.parquet

    After importing all libraries,

    handler = pyarrowfs_adlgen2.AccountHandler.from_account_name('acc_name', azure.identity.DefaultAzureCredential())  
    fs = pyarrow.fs.PyFileSystem(handler)  
    print("Path is {}".format("Container_name/Project_name/Year/Month/File.parquet/" ))  
    ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/"  , filesystem=fs)  
    data from blob= ds.to_table()  
    

    Error message
    azure.core.exceptions.HttpResponseError: (InvalidResourceName) The specifed resource name contains invalid characters.
    Code: InvalidResourceName
    Message: The specifed resource name contains invalid characters.

    Then I tried this,

      ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/part-00000-tid-3006757700744507944-2dececd1-5fa7-429a-9fbe-3b21c1415734-16-1-c000.snappy.parquet"  , filesystem=fs)   
    

    Same error message as above

    Then I renamed the partition as

    ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/renamed.parquet"  , filesystem=fs)  
    #or   
    ds = pyarrow.dataset.dataset( "Container_name/Project_name/Year/Month/File.parquet/renamed.snappy.parquet"  , filesystem=fs)  
    

    Same error as above.
    Is there anything wrong that I am doing?