Azure Data Factory- Best Practices

Question

Azure Data Factory- Best Practices

Aks 0

Hello,

I have few data ingestion pipelines using ADF.

The overall structure of the pipelines is-

Fetch initial snapshot from SQL server tables
Fetch incremental data from SQL server using SQL Change Data Capture (CDC) for subsequent Loads The process loops through each table in dataset config to fetch data based on LSN. The loop is set to run sequentially. the Batch count property is left blank (So, the default= 20 is used) 3.Upload the data to Azure Blob storage in json format.

Please can you advice of the best practices to be adopted to fasten the pipeline?

Please let me know if you need more information in order to assist with this.

Thank you

Kind regards,

Aks

ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2023-05-29T13:27:12.0633333+00:00

Hi Aks, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

2 answers

Your answer

ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2023-05-29T13:27:12.0633333+00:00

Hi Aks, Just checking if below answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Answer 1

To optimize and improve the performance of your data ingestion pipelines in Azure Data Factory (ADF), you can follow these best practices:

Use Parallel Execution: Instead of running your pipeline sequentially, consider using parallel execution to process multiple tables simultaneously. This can significantly improve the overall performance. You can achieve parallel execution by using the "ForEach" activity in ADF and configuring it to process tables in parallel.

Use Incremental Loading Techniques: Incremental loading reduces the amount of data processed by only capturing and processing the changes since the last run. Since you're already using SQL Change Data Capture (CDC), make sure to leverage it effectively to fetch only the incremental data instead of processing the entire dataset. This can greatly reduce processing time and improve pipeline performance.

Optimize Batch Size: The default batch size in ADF is set to 20. However, you can experiment with different batch sizes to find the optimal value for your scenario. Increasing the batch size can reduce the number of round trips and improve throughput. Consider testing different batch sizes to find the most efficient one for your specific workload.

Utilize Data Compression: If your data allows for it, consider compressing the output data in Azure Blob storage. Compressed data takes up less storage space and reduces the time required for data transfer and processing. You can use compression techniques like GZip or Snappy to compress the JSON data before storing it in Azure Blob storage.

Monitor and Optimize Performance: Regularly monitor the performance of your pipeline using Azure Monitor and ADF pipeline monitoring tools. Identify any bottlenecks, long-running activities, or data processing issues. Optimize your pipeline accordingly based on the insights gathered from monitoring data.

Use Optimized Data Movement: When moving data between different sources and sinks, utilize optimized connectors and integration runtimes provided by ADF. For example, use the PolyBase feature for SQL Server or use the Azure Data Lake Storage connector for Blob storage to achieve faster data movement.

Implement Data Partitioning and Filtering: If possible, partition your data based on relevant columns during extraction, such as date or region. This allows for more efficient data processing, as you can filter and process smaller subsets of data at a time.

Provision Adequate Resources: Ensure that you have provisioned enough resources for your ADF pipeline. This includes the appropriate number of data integration units (DIUs) and integration runtime (IR) nodes based on your workload and data volume. Scaling up or out your ADF resources can improve performance and reduce processing time.

By following these best practices, you can optimize the performance of your data ingestion pipelines in Azure Data Factory and achieve faster data processing and transfer times.

Answer 2

ShaikMaheer-MSFT 38,546 Microsoft Employee Moderator

Hi Aks,

Thank you for posting query in Microsoft Q&A Platform.

You can consider using ADF CDC feature as well. Kindly consider checking below video for better idea.

CDC(change data capture) for SQL Source in Mapping data flows in Azure Data Factory or Synapse

CDC resource can also be considered. Kindly check below video for better idea.

CDC (change data capture) Resource in Azure Data Factory

Hope this helps. Please let me know if any further queries.

Please consider hitting Accept Answer button. Accepted answers help community as well.

ShaikMaheer-MSFT 38,546 Reputation points Microsoft Employee Moderator

2023-05-23T10:28:25.7733333+00:00

Hi Aks, Just checking if above answer helps. If yes, please consider hitting Accept Answer button. Accepted answers help community as well. Please let me know if any further queries. Thank you.

Share via

Azure Data Factory- Best Practices

2 answers

Your answer