How to Implement a Metadata-Driven ADF Pipeline to Dynamically Process Latest Files from ADLS?

YERNAIDU SIRAPARAPU 45 Reputation points
2025-06-20T09:57:32.63+00:00

Hello Community,

I need help designing a single, reusable, and metadata-driven Azure Data Factory (ADF) pipeline for a sales order ingestion use case involving multiple source systems like SAP and Senior.

Business Scenario

We receive Excel files from two systems — SAP and Senior — at different frequencies during the day. Each source drops files into its own folder in Azure Data Lake Storage (ADLS).

Each file:

Has a unique and consistent naming pattern (e.g., SAP_Sales_*.xlsx)

Needs to be processed only once

Should be loaded into a Raw table, then transformed into a Clean (unified) table

We want to automate this using scheduled ADF triggers that invoke a pipeline multiple times per day per source.

High-Level Requirements

We want to build a single ADF pipeline that:

Accepts a Source System parameter (e.g., "SAP" or "Senior").

Uses metadata stored in a SQL table (like folder path, file pattern, raw table, clean table) to drive the logic.

Dynamically connects to the right ADLS folder.

Identifies the latest file based on the LastModified timestamp (not file name).

Before processing, checks a log table to see if the file was already processed.

If new:

Loads the data into a source-specific Raw table

Transforms it into a unified Clean Sales Orders table

Logs the file details and status

If already processed, skips and optionally logs that the file was ignored.

Triggering Strategy

Each source has its own trigger schedule:

Trigger Name Time Parameter

Trigger_SAP_8AM 8:00 AM SAP

Trigger_SAP_6PM 6:00 PM SAP

Trigger_Senior_6AM 6:00 AM Senior

Trigger_Senior_12PM 12:00 PM Senior

Trigger_Senior_8PM 8:00 PM Senior

Each trigger only passes the SourceSystem parameter to the pipeline.

Desired Solution Characteristics

Only one pipeline should handle all source systems.

Logic must be fully metadata-driven via SQL config table (no hardcoded values).

File detection must be based on LastModified timestamp (not sorted filenames).

Should log all processing activities and prevent reprocessing of already-handled files.

Must support easy onboarding of new sources by adding rows to the metadata table.

Reusable, scalable, and maintainable.

My Ask:

Can anyone guide me on how to implement this end-to-end using ADF features like:

Dynamic datasets

Parameterized pipelines

Metadata-based design

LastModified file detection

Conditional flows (to skip already processed files)

Logging mechanisms

I would appreciate any best practices, sample architectures, or community resources (videos, blogs, GitHub samples) that can help me get started.

Thanks in advance for your support!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
0 comments No comments
{count} votes

Accepted answer
  1. J N S S Kasyap 3,625 Reputation points Microsoft External Staff Moderator
    2025-06-20T10:50:02.8533333+00:00

    Hi @Yernaidu Siraparapu

    Steps to Implement Metadata-Driven File Processing Pipeline in ADF 

    Start by creating a SQL metadata table to store configuration details for each source system. This table should include columns like SourceSystem, FolderPath, FilePattern, RawTable, and CleanTable. This allows the pipeline to drive its logic based on metadata instead of hardcoding values. 

    Next, define a pipeline parameter in Azure Data Factory called SourceSystem. This parameter will be passed by triggers for each source system (e.g., "SAP", "Senior") to control the pipeline’s behavior dynamically. 

    Use a Lookup activity to query the metadata table using the SourceSystem parameter. This fetches dynamic values such as the folder path in ADLS, the file name pattern, and the target table names. 

    To detect available files, add a Get Metadata activity to list all files in the source folder (retrieved from metadata). Then, apply a Filter activity to keep only files that match the desired pattern (e.g., SAP_Sales_*.xlsx). 

    Loop through the filtered file list using ForEach, and within the loop, use another Get Metadata activity to fetch the LastModified timestamp for each file. Store file names and timestamps in an array variable. 

    After collecting all timestamps, use a Data Flow or expressions in variables to identify the most recently modified file. This ensures the pipeline always processes the latest file, regardless of file name sorting. 

    Before processing, check if the file has already been handled. Use another Lookup activity to query a log table in SQL that records all previously processed files. If the file exists in the log, skip it. 

    Use an If Condition activity to control the flow. If the file is new, proceed to process it. If it's already logged, skip processing and record a "Skipped" status in the log table. 

    For new files, use a Copy Activity with a dynamic dataset to load data into the Raw table. The dataset should accept folderPath and fileName as parameters based on earlier metadata lookup. 

    After loading into Raw, use a Mapping Data Flow or a Stored Procedure activity to transform and load the data into the Clean unified sales table, as defined in your metadata table. 

    Finally, log the processing status using a Stored Procedure, recording details like file name, source system, timestamp, and status (Success, Skipped, or Failed) into a centralized audit table. 

    To automate this, create multiple time-based triggers, each configured for a specific source system and time. Each trigger will pass the appropriate SourceSystem value to the pipeline for scheduled execution. 

    I hope this helps! Let me know if you have any other questions or need further assistance.


2 additional answers

Sort by: Most helpful
  1. YERNAIDU SIRAPARAPU 45 Reputation points
    2025-06-25T17:46:35.6+00:00
    0 comments No comments

  2. YERNAIDU SIRAPARAPU 45 Reputation points
    2025-06-25T17:49:19.57+00:00
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.