Resolving FileNotFoundError When Reading Parquet Files in Synapse Notebook

Clover J 200 Reputation points
2024-05-30T12:32:12.17+00:00

In my Synapse Notebook, I aimed to read Parquet files. However, I encountered a 'FileNotFoundError' when attempting to use a wildcard. The folder structure I intend to access is as follows: 'test/year={yyyy}/month={MM}/day={dd}/*.parquet'. Here's the code snippet I executed:

df = pd.read_parquet('abfss://xxx@xxx.dfs.core.windows.net/test/*/*/*/*.parquet', storage_options='')

Any insights on resolving this issue would be appreciated.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,557 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 11,535 Reputation points Microsoft Vendor
    2024-05-30T13:43:50.5066667+00:00

    Hi @Clover J

    Thanks for the question and using MS Q&A platform.

    As I see that pd.read_parquet() function does not support the wildcard character * in the path, which is why you are getting the FileNotFoundError. Instead, you can use the spark.read.parquet() function to read all the files under the specified folder.

    Here's the corrected code snippet:

    df = spark.read.parquet('abfss://xxx@xxx.dfs.core.windows.net/test/*/*/*/*') 
    df.show()
    

    This code will read all the Parquet files under the test folder, with the folder structure year={yyyy}/month={MM}/day={dd}. The df.show() function will display the contents of the DataFrame.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful