Azure Data Factory- Best Practices

Aks 0 Reputation points
2023-05-12T15:01:06.2233333+00:00

Hello,

I have few data ingestion pipelines using ADF.

The overall structure of the pipelines is-

  1. Fetch initial snapshot from SQL server tables
  2. Fetch incremental data from SQL server using SQL Change Data Capture (CDC) for subsequent Loads The process loops through each table in dataset config to fetch data based on LSN. The loop is set to run sequentially. the Batch count property is left blank (So, the default= 20 is used) 3.Upload the data to Azure Blob storage in json format.

Please can you advice of the best practices to be adopted to fasten the pipeline?

Please let me know if you need more information in order to assist with this.

Thank you

Kind regards,

Aks

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,957 questions
{count} votes

2 answers

Sort by: Most helpful
  1. VasimTamboli 5,095 Reputation points
    2023-05-12T17:23:52.7633333+00:00

    To optimize and improve the performance of your data ingestion pipelines in Azure Data Factory (ADF), you can follow these best practices:

    Use Parallel Execution: Instead of running your pipeline sequentially, consider using parallel execution to process multiple tables simultaneously. This can significantly improve the overall performance. You can achieve parallel execution by using the "ForEach" activity in ADF and configuring it to process tables in parallel.

    Use Incremental Loading Techniques: Incremental loading reduces the amount of data processed by only capturing and processing the changes since the last run. Since you're already using SQL Change Data Capture (CDC), make sure to leverage it effectively to fetch only the incremental data instead of processing the entire dataset. This can greatly reduce processing time and improve pipeline performance.

    Optimize Batch Size: The default batch size in ADF is set to 20. However, you can experiment with different batch sizes to find the optimal value for your scenario. Increasing the batch size can reduce the number of round trips and improve throughput. Consider testing different batch sizes to find the most efficient one for your specific workload.

    Utilize Data Compression: If your data allows for it, consider compressing the output data in Azure Blob storage. Compressed data takes up less storage space and reduces the time required for data transfer and processing. You can use compression techniques like GZip or Snappy to compress the JSON data before storing it in Azure Blob storage.

    Monitor and Optimize Performance: Regularly monitor the performance of your pipeline using Azure Monitor and ADF pipeline monitoring tools. Identify any bottlenecks, long-running activities, or data processing issues. Optimize your pipeline accordingly based on the insights gathered from monitoring data.

    Use Optimized Data Movement: When moving data between different sources and sinks, utilize optimized connectors and integration runtimes provided by ADF. For example, use the PolyBase feature for SQL Server or use the Azure Data Lake Storage connector for Blob storage to achieve faster data movement.

    Implement Data Partitioning and Filtering: If possible, partition your data based on relevant columns during extraction, such as date or region. This allows for more efficient data processing, as you can filter and process smaller subsets of data at a time.

    Provision Adequate Resources: Ensure that you have provisioned enough resources for your ADF pipeline. This includes the appropriate number of data integration units (DIUs) and integration runtime (IR) nodes based on your workload and data volume. Scaling up or out your ADF resources can improve performance and reduce processing time.

    By following these best practices, you can optimize the performance of your data ingestion pipelines in Azure Data Factory and achieve faster data processing and transfer times.

    2 people found this answer helpful.
    0 comments No comments

  2. ShaikMaheer-MSFT 38,466 Reputation points Microsoft Employee
    2023-05-17T17:01:14.7933333+00:00

    Hi Aks,

    Thank you for posting query in Microsoft Q&A Platform.

    You can consider using ADF CDC feature as well. Kindly consider checking below video for better idea.

    CDC(change data capture) for SQL Source in Mapping data flows in Azure Data Factory or Synapse

    CDC resource can also be considered. Kindly check below video for better idea.

    CDC (change data capture) Resource in Azure Data Factory

    Hope this helps. Please let me know if any further queries.


    Please consider hitting Accept Answer button. Accepted answers help community as well.

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.