Read Parquet files from the folder in Dataflow

Question

Read Parquet files from the folder in Dataflow

Kumar, Arun 336

Hi,

I have parquet files generated in the blob folder - staging/ABC/XYZ/data_1.parquet like in below screenshot. Inside the folders (data_1.parquet, data_2.parquet etc) there are parquet files. I need to read those parquet files from inside the folders (data_1.parquet, data_2.parquet etc) into a Dataflow. Can Dataflow pickup parquet files from the folders automatically if we point to these folders?

User's image

Accepted answer

0 additional answers

Your answer

Answer 1

@Kumar, Arun - Thanks for the question and using MS Q&A platform.

To read Parquet files dynamically in a Data Flow in Azure Data Factory, you can use the "Wildcard file path" option in the source settings. This allows you to specify a pattern for the file names or folder names that you want to read.

In your case, you can use the following wildcard pattern to read all the Parquet files inside the "staging/ABC/XYZ" folder and its subfolders: staging/ABC/XYZ/*.parquet

This pattern will match all the Parquet files with a ".parquet" extension inside the "staging/ABC/XYZ" folder and its subfolders.

To use this pattern in a Data Flow source, follow these steps:

In source transformation, you can read from a container, folder, or individual file in Azure Blob Storage. Use the Source options tab to manage how the files are read.

Screenshot of source options tab in mapping data flow source transformation.

Once you have configured the source settings, you can use the source in your Data Flow to read the Parquet files dynamically. The Data Flow will automatically detect all the Parquet files that match the wildcard pattern and read them into the Data Flow.
Here is the complete steps read the parquet files from inside the folders in Azure Blob Storage:

Step1: Created three parquet files generated in the blob folder - staging/ABC/XYZ/userdata1.parquet

User's image

Step2: Create dataset file formats parquet and select the linked service and under file path select only container name: staging as shown:
User's image

Step3: Create data flow and add source and select linked service under source options select wildcard path as: /ABC/XYZ/*.parquetas shown: User's image

Step4: Click on data purview to see the data in all parquet files as shown:
User's image

Note: To add dynamic content use this: `/ABC/XYZ/*.parquet`as shown below: User's image

Dataflow expression builder:
User's image

For more details, refer to Wildcards in Data Flow and SO thread addressing similar issue.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Kumar, Arun 336 Reputation points

2023-11-21T22:26:53.0566667+00:00

@PRADEEP CHEEKATLA Thank you for the very detailed explanation. It helped
PRADEEPCHEEKATLA 90,651 Reputation points Moderator

2023-11-22T02:50:19.64+00:00

@Kumar, Arun - Glad to know it helped. Please do continue to use MS Q&A platform for any question related to Azure!

Share via

Read Parquet files from the folder in Dataflow

0 additional answers

Your answer