How to build a pipeline that takes inputs from multiple folders, uses different notebook for transforming, and outputs the transformed files to different folders?

Question

How to build a pipeline that takes inputs from multiple folders, uses different notebook for transforming, and outputs the transformed files to different folders?

Yunyi Huang 0

Hi, I'm working on a project which involves the use of Azure data factory pipeline. Here is my current pipeline: User's image

It works this way: In the Notebook block, the databricks notebook takes in data file from input folder A, transforms that data, and put the processed data into output folder A. Then, in the Copy data block, it takes the processed data from output folder A into a designated database table A. However, there are more than 1 type of data (not the datatype, but categories). For example, here I have data of A all stored in the input folder A, and I also have data of B stored in the input folder B, data of C stored in the input folder C. There are also Notebook A, B, C for processing. Accordingly, they should be inserted into the table A, B, C in my database. In this situation, I am not sure how to enable this pipeline to handle multiple input and output folders. I am a newbie in Azure, so I have watched a lot of tutorials and most of them only talk about 1 input folder. I also read this thread: https://learn.microsoft.com/en-us/answers/questions/90530/copying-data-from-multiple-locations-to-multiple-d, and I found it a bit helpful but it looks like not applicable to my case since mine includes the processing step. Ideally, I want to build something like this (the work flow, not necessarily to be what the pipeline look like): WechatIMG483

I think the parametrization can be used here, but I am really not sure how to accomplish/adopt it to my case. Could someone help me? It would be appreciated if someone has a tutorial video or step by step guide. Thank you in advance.

ShaikMaheer-MSFT 38,556 Reputation points Microsoft Employee Moderator

2023-05-16T17:15:46.3333333+00:00

Hi Yunyi Huang, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

1 answer

Your answer

ShaikMaheer-MSFT 38,556 Reputation points Microsoft Employee Moderator

2023-05-16T17:15:46.3333333+00:00

Hi Yunyi Huang, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Answer 1

Hi Yunyi Huang,

Thank you for posting query in Microsoft Q&A Platform.

Yes, you are correct, you need to consider parameterization in your case. Consider having parameters of notebook and input path and output paths and parameters for database table. For all these parameters you need to supply values from trigger. So you can create 3 triggers A, B & C. The above create pipeline parameters you need to Paas into datasets of Notebook activity and copy activity.

Please check below videos of parameterization. That may help you to understand parameterization better.

Parameterize Datasets in Azure Data Factory

Parameterize Linked Services in Azure Data Factory

Parameterize Pipelines in Azure Data Factory

Hope this helps. Please let me know if any further queries.

Please consider hitting Accept Answer button. Accepted answers help community as well.

ShaikMaheer-MSFT 38,556 Reputation points Microsoft Employee Moderator

2023-04-29T11:33:23.1666667+00:00

Hi Yunyi Huang, Just checking if above answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Share via

How to build a pipeline that takes inputs from multiple folders, uses different notebook for transforming, and outputs the transformed files to different folders?

1 answer

Your answer