How to build a pipeline that takes inputs from multiple folders, uses different notebook for transforming, and outputs the transformed files to different folders?

Yunyi Huang 0 Reputation points
2023-04-19T08:32:13.42+00:00

Hi, I'm working on a project which involves the use of Azure data factory pipeline. Here is my current pipeline: User's image

It works this way: In the Notebook block, the databricks notebook takes in data file from input folder A, transforms that data, and put the processed data into output folder A. Then, in the Copy data block, it takes the processed data from output folder A into a designated database table A. However, there are more than 1 type of data (not the datatype, but categories). For example, here I have data of A all stored in the input folder A, and I also have data of B stored in the input folder B, data of C stored in the input folder C. There are also Notebook A, B, C for processing. Accordingly, they should be inserted into the table A, B, C in my database. In this situation, I am not sure how to enable this pipeline to handle multiple input and output folders. I am a newbie in Azure, so I have watched a lot of tutorials and most of them only talk about 1 input folder. I also read this thread: https://learn.microsoft.com/en-us/answers/questions/90530/copying-data-from-multiple-locations-to-multiple-d, and I found it a bit helpful but it looks like not applicable to my case since mine includes the processing step. Ideally, I want to build something like this (the work flow, not necessarily to be what the pipeline look like): WechatIMG483

I think the parametrization can be used here, but I am really not sure how to accomplish/adopt it to my case. Could someone help me? It would be appreciated if someone has a tutorial video or step by step guide. Thank you in advance.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
{count} votes

1 answer

Sort by: Most helpful
  1. ShaikMaheer-MSFT 38,556 Reputation points Microsoft Employee Moderator
    2023-04-20T16:54:25.26+00:00

    Hi Yunyi Huang,

    Thank you for posting query in Microsoft Q&A Platform.

    Yes, you are correct, you need to consider parameterization in your case. Consider having parameters of notebook and input path and output paths and parameters for database table. For all these parameters you need to supply values from trigger. So you can create 3 triggers A, B & C. The above create pipeline parameters you need to Paas into datasets of Notebook activity and copy activity.

    Please check below videos of parameterization. That may help you to understand parameterization better.

    Parameterize Datasets in Azure Data Factory

    Parameterize Linked Services in Azure Data Factory

    Parameterize Pipelines in Azure Data Factory

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.