Read Parquet files from the folder in Dataflow

Kumar, Arun 236 Reputation points


I have parquet files generated in the blob folder - staging/ABC/XYZ/data_1.parquet like in below screenshot. Inside the folders (data_1.parquet, data_2.parquet etc) there are parquet files. I need to read those parquet files from inside the folders (data_1.parquet, data_2.parquet etc) into a Dataflow. Can Dataflow pickup parquet files from the folders automatically if we point to these folders?

User's image

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
3,797 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
8,453 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 69,771 Reputation points Microsoft Employee

    @Kumar, Arun - Thanks for the question and using MS Q&A platform.

    To read Parquet files dynamically in a Data Flow in Azure Data Factory, you can use the "Wildcard file path" option in the source settings. This allows you to specify a pattern for the file names or folder names that you want to read.

    In your case, you can use the following wildcard pattern to read all the Parquet files inside the "staging/ABC/XYZ" folder and its subfolders: staging/ABC/XYZ/*.parquet

    This pattern will match all the Parquet files with a ".parquet" extension inside the "staging/ABC/XYZ" folder and its subfolders.

    To use this pattern in a Data Flow source, follow these steps:

    In source transformation, you can read from a container, folder, or individual file in Azure Blob Storage. Use the Source options tab to manage how the files are read.

    Screenshot of source options tab in mapping data flow source transformation.

    Once you have configured the source settings, you can use the source in your Data Flow to read the Parquet files dynamically. The Data Flow will automatically detect all the Parquet files that match the wildcard pattern and read them into the Data Flow.
    Here is the complete steps read the parquet files from inside the folders in Azure Blob Storage:

    Step1: Created three parquet files generated in the blob folder - staging/ABC/XYZ/userdata1.parquet

    User's image

    Step2: Create dataset file formats parquet and select the linked service and under file path select only container name: staging as shown:
    User's image

    Step3: Create data flow and add source and select linked service under source options select wildcard path as: /ABC/XYZ/*.parquetas shown: User's image

    Step4: Click on data purview to see the data in all parquet files as shown:
    User's image

    Note: To add dynamic content use this: `/ABC/XYZ/*.parquet`as shown below:User's image

    Dataflow expression builder:
    User's image

    For more details, refer to Wildcards in Data Flow and SO thread addressing similar issue.

    Hope this helps. Do let us know if you any further queries.

    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful