Share via

Azure Datafactory Cost and Optimization

Sudhakar P 165 Reputation points
2025-07-25T05:12:35.9566667+00:00

Hello,

We have an Azure Data Factory (ADF) pipeline that runs daily at 7 AM and performs the following steps:

  1. Ingests data from an on-premises SQL Server into the Landing Zone (ADLS Gen2).
  2. Uses a Lookup activity to fetch active tables (based on a watermark table). Some tables are incremental, others are full load.
  3. Triggers a Databricks notebook to load data to the Bronze layer.
  4. Sequentially runs Databricks notebooks for Bronze to Silver, and then Silver to Gold (for both dimension and fact tables).

User's image

The total pipeline run time is about 2 hours and 30 minutes daily.

We want to explore parallelization of these notebook activities to reduce the overall duration. However, we’re also cautious about the cost impact especially considering ADF activity costs.

Could you please suggest:

Best practices for optimizing pipeline performance through parallel execution.

Guidelines to manage and monitor cost impact when running multiple notebook activities in parallel.

Any recommended architectural patterns or configurations for improving performance in ADF-Databricks integrated workflows?

Thank you in advance for your help!

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


1 answer

Sort by: Most helpful
  1. Chandra Boorla 15,475 Reputation points Microsoft External Staff Moderator
    2025-07-25T05:26:17.0733333+00:00

    Hi Sudhakar P

    Thanks for the detailed overview of your Azure Data Factory (ADF) and Databricks pipeline. It's great that you're looking to optimize both performance and cost. Based on your current setup and the screenshot you shared, here are some suggestions:

    Performance optimization with Parallel Execution

    You are already on the right track using a ForEach loop with Sequential unchecked, which enables parallel execution.
    To avoid overwhelming the Databricks cluster or exceeding ADF concurrency limits, I recommend setting a controlled Batch count - for example, start with 5 or 10 depending on the cluster size and table workload.
    You can dynamically process tables in parallel by grouping them logically (e.g., dimensions vs. facts, or Bronze vs. Silver vs. Gold layers) and creating separate ForEach loops for each group.

    Cost management tips

    Each Databricks Notebook activity in ADF incurs compute and activity cost. Running too many in parallel can spike costs.
    Use ADF's cost monitoring (via Azure Cost Management or Log Analytics) to track how activity parallelism impacts daily spend.
    To optimize Databricks usage:

    • Reuse existing clusters with auto-scaling and auto-termination enabled.
    • Avoid spinning up separate clusters per notebook if not required.
    • Use Job Clusters only when isolation is needed.

    Architecture & Design best practices

    Consider breaking the pipeline into modular stages: e.g., one pipeline for Bronze load, one for Silver, and one for Gold. Trigger each stage conditionally or sequentially.
    Use pipeline parameters and metadata-driven design so that logic remains dynamic and maintainable.
    For high-volume tables, isolate them into their own ForEach loop or child pipeline to prevent bottlenecks.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.