Low throughput and connection failures in Azure Synapse Copy Data Activity when transferring large data volumes from an on-premises SQL Server can stem from various factors, such as network limitations, integration runtime constraints, and SQL Server performance bottlenecks.
How does network bandwidth impact the Copy Data activity in Synapse?
Network bandwidth directly influences the speed and reliability of data transfer between your on-premises SQL Server and Azure Data Lake Storage (ADLS). Insufficient bandwidth can lead to slow data movement and increase the likelihood of connection timeouts or failures. Ensuring adequate network capacity is essential for optimal performance.
Is throughput dependent on the on-premises network or Integration Runtime (IR) bandwidth?
Yes, throughput is influenced by both the on-premises network and the Integration Runtime (IR) bandwidth.
If using ExpressRoute, are there any data transfer limitations in Synapse?
Azure ExpressRoute has no strict data transfer limits, but your circuit's bandwidth (e.g., 1 Gbps, 10 Gbps) determines the maximum transfer speed. Higher bandwidth tiers enable faster data movement, improving Synapse performance. Refer to pricing details for information on bandwidth rates.
What are the best practices to improve throughput in Synapse pipelines?
Should batch size and parallelism be increased?
Optimizing batch sizes and increasing parallelism can enhance data movement performance. Smaller batch sizes can reduce memory consumption and improve performance, while larger batch sizes may increase throughput but require more memory. Increasing the degree of parallelism allows multiple data partitions to be processed simultaneously, leading to faster data movement.
Would adjusting the Integration Runtime configuration help?
Tuning the IR configuration can significantly impact performance. Allocating more Data Integration Units (DIUs) can increase throughput, with the allowed DIUs for a copy activity ranging between 2 and 256. Ensuring that the self-hosted IR has sufficient resources like CPU, memory, and network bandwidth is vital for efficient data movement.
Should staging be enabled for better performance?
Enabling staging can improve performance, especially when dealing with large datasets. When loading data into Azure Synapse Analytics, using PolyBase or the COPY statement can significantly enhance performance by allowing for batch loading in bulk instead of row-by-row insertion, drastically reducing load times.
Are there Synapse-specific optimizations for handling large-scale data movement?
For Azure SQL Database sources, enabling source partitioning can improve read times by allowing parallel connections on the source system. Specify the number of partitions and use a partition column with high cardinality.
Set up a cache for a better user experience and avoid returning millions of records to a dashboard. Use scheduled refreshes to prevent parallel query executions that can drain SQL serverless pool resources.
Please refer below Microsoft Docs:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
https://learn.microsoft.com/en-us/azure/expressroute/expressroute-faqs
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance#troubleshoot-copy-activity-performance
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.