Azure Data Factory Pipelines - calling Synapse SQL Serverless uses lots of DIUs?

Ian Coetzer 66 Reputation points


I have created SQL Serverless databases in my Synapse Analytics workspace.

The SQL serverless database contains a view which performs a BULK openrowset operation to open parquet files stored in a container in Azure Storage Account Gen 2.

When I execute this view in Synapse Analytics it scans and returns several hundred MB of records.

The cost of this is low, since Synapse Analytics charge 5 USD per TB data processed.

However, when I call the same view from Azure Data Factory using a simple Copy Activity it costs much more to read the results and store them in another parquet sink?

I noticed that the Azure Data Factory pipeline uses a lot of DIU hours.

Why is this?

How can I automate the calling of a view defined in Synapse serverless without incurring additional azure data factory costs?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,672 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,136 questions
{count} votes

1 answer

Sort by: Most helpful
  1. phemanth 8,645 Reputation points Microsoft Vendor

    @Ian Coetzer

    Thanks For Reaching out MS Q&A

    The cost difference you’re seeing between Synapse Analytics and Azure Data Factory (ADF) is due to the way these services are billed.

    In Synapse Analytics, you’re charged based on the amount of data processed, which is $5 per TB1. However, in ADF, you’re charged based on the number of Data Integration Units (DIUs) you use and the duration they run. For example, if you use 10 DIUs for 2 hours, and each DIU costs $1 per hour, your cost would be: 10 DIUs x 2 hours x $1 per hour = $202. Even for short-running activities, the minimum amount of billable DIU-hours is 1/60th of an hour, or 0.0166667 hours3. This means that if an activity runs for less than 1 minute, it will still be billed for 0.0166667 hours of DIU usage. 

    To automate the calling of a view defined in Synapse serverless without incurring additional ADF costs, you could consider the following options

    Dynamically Create SQL Serverless Views: You can dynamically update SQL Serverless views from within a Synapse Pipeline so that they’re always up-to-date with the data in your data lake. This approach allows you to maintain the schema of your SQL Serverless views and ensure they match the data in your data lake.

    Use Partitioned Views: If you have a set of files that is partitioned in the hierarchical folder structure, you can describe the partition pattern using the wildcards in the file path. Use the FILEPATH function to expose parts of the folder path as partitioning columns. Partitioned views can improve the performance of your queries by performing partition elimination when you query them with filters on the partitioning columns.

    Remember, these are just a few options and the best solution would depend on your specific use case and requirements. Always consider the trade-offs in terms of cost, performance, and complexity when choosing a solution.

    Hope this helps. Do let us know if you any further queries.

    0 comments No comments