How to improve extraction performance of SAP HANA Azure Data Factory Connector

Question

How to improve extraction performance of SAP HANA Azure Data Factory Connector

Luca Campeti (ICONSULTING) 0

Hi Experts,

with Azure Data Factory I am carrying out tests on reading data from a SAP HANA database and storing it in Synapse tables but I am noticing disappointing performance. An example: using Polybase, I have 3,492,246 rows x 184 columns transferred end-to-end in 16 minutes, 12 of which are just pulling from SAP HANA and writing to the staging repository.

Consider that the "Physical partitions of table" flag is enabled and the SHIR is on a fully dedicated Standard D4s v5 (4 vcpus, 16 GiB memory), with a limit of 16 concurrent jobs.

I tried many parameter combinations:

Increase Packet size (KB) up to 20960
Increase the maximum data integration units
Increase the degree of copy parallelism
Increase the SHIR concurrent jobs limit
Disable performance metrics analytics

but the final result is always almost the same, in fact sometimes it gets worse.

I also noticed that the maximum number of open connections to SAP HANA is always 4. Just as the maximum number of used DIUs in the "Blob Storage -> Synapse Analytics" transfer is always equal to 2 and the number of used parallel copies is always 1.

Do you have any idea what can cause such poor performance? May it depend on the SHIR VM (although I have never seen it in difficulty in terms of CPUs and RAM during flows). What can I try to investigate further? Am I the one who has too high expectations?

Thank you very much in advance for your feedback

Luca

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-01-23T10:17:56.5633333+00:00
@Luca Campeti (ICONSULTING)

Thanks for the question and using MS Q&A platform.

Thank you for sharing your considerations. It’s good to know that you are comparing the performance of Azure Data Factory with SAP Data Services jobs. Based on your description, it seems that the ADF time for the first extraction phase (SAP HANA source --> staging container in the Azure Data Lake Gen 2) is taking much longer than the entire DS job elapsed time (SAP HANA source --> SAP HANA destination). One suggestion is to try increasing the parallelism of the copy activity. By default, the parallel copy is set to 4 when copying data from partition-option-enabled data stores like SAP HANA. You can try increasing the node count on your SHIR, increase your parallelism or size (scale up or out). However, note that too many parallel copies may even hurt the performance. Gradually tuning the parallel copies may be a better approach.

Regarding the SHIR VM size, you can refer to the Microsoft documentation for performance and troubleshooting for SAP data extraction. The documentation suggests that you should record the SHIR VM size for each test cycle, the degree of copy parallelism, and the number of partitions. Observe the performance of the SHIR VM, the performance of the source SAP system, and the desired vs. the actual degree of parallelism. Use an iterative process to identify the optimum settings and the ideal size for the SHIR VM. Regarding the Copy Activity config, you can try the following:

Increase the packet size (KB) up to 20960.

Increase the maximum data integration units.

Disable performance metrics analytics.

Gradually tune the parallel copies.

I hope this helps you investigate further and improve the performance of your data transfer. Let usknow if you have any other questions or concerns.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-01-24T09:05:14.76+00:00

@Luca Campeti (ICONSULTING) We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-01-23T10:17:56.5633333+00:00

@Luca Campeti (ICONSULTING)

Thanks for the question and using MS Q&A platform.

Thank you for sharing your considerations. It’s good to know that you are comparing the performance of Azure Data Factory with SAP Data Services jobs. Based on your description, it seems that the ADF time for the first extraction phase (SAP HANA source --> staging container in the Azure Data Lake Gen 2) is taking much longer than the entire DS job elapsed time (SAP HANA source --> SAP HANA destination). One suggestion is to try increasing the parallelism of the copy activity. By default, the parallel copy is set to 4 when copying data from partition-option-enabled data stores like SAP HANA. You can try increasing the node count on your SHIR, increase your parallelism or size (scale up or out). However, note that too many parallel copies may even hurt the performance. Gradually tuning the parallel copies may be a better approach.

Regarding the SHIR VM size, you can refer to the Microsoft documentation for performance and troubleshooting for SAP data extraction. The documentation suggests that you should record the SHIR VM size for each test cycle, the degree of copy parallelism, and the number of partitions. Observe the performance of the SHIR VM, the performance of the source SAP system, and the desired vs. the actual degree of parallelism. Use an iterative process to identify the optimum settings and the ideal size for the SHIR VM. Regarding the Copy Activity config, you can try the following:

Increase the packet size (KB) up to 20960.

Increase the maximum data integration units.

Disable performance metrics analytics.

Gradually tune the parallel copies.

I hope this helps you investigate further and improve the performance of your data transfer. Let usknow if you have any other questions or concerns.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-01-24T09:05:14.76+00:00

@Luca Campeti (ICONSULTING) We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Amira Bedhiafi 34,101 Volunteer Moderator

Investigate if the SAP HANA database itself is a bottleneck. This could be due to query performance, indexes, or how data is structured and stored in SAP HANA. Optimizing queries or adding indexes might help. The Standard D4s v5 VM used for SHIR might not be sufficient for your workload. Even though you have not observed CPU and RAM issues, it's possible that the VM is not powerful enough for the data volume and complexity. Consider upgrading to a more powerful VM. You mentioned that the maximum number of open connections to SAP HANA is always 4 and the maximum number of used DIUs is 2. This limitation could be due to configuration settings in Azure Data Factory or limits within SAP HANA. Investigate if there are any configurations or limits that can be adjusted to increase these numbers. The copy activity’s performance in Azure Data Factory can be optimized by fine-tuning the settings such as batch size, parallel copy operations, and retry policies. Experimenting with different configurations might yield better results. Increasing DIUs is a good approach, but it needs to be balanced with the capabilities of the source and destination systems. Overallocating DIUs can lead to underutilization and bottlenecks elsewhere.

Luca Campeti (ICONSULTING) 0 Reputation points

2024-01-22T09:08:22.12+00:00
Hi Amira,

thank you for your reply, let me share some considarations below:

Consider that I am comparing the Azure Data Factory extraction performance against the SAP Data Services jobs which the customer has currently in place: I am connected to the same data sources (tables) and extracting the same amount of data, but DS writes on another SAP HANA database (different from the source one) while Azure Data Factory stores data in a Synapse Dedicated SQL Pool; but only the ADF time for the first extraction phase (SAP HANA source --> staging container in the Azure Data Lake Gen 2) shows always a x6 factor respect to entire DS job elapsed time (SAP HANA source --> SAP HANA destination)

Ok, I will perform an attempt with a more powerfull SHIR VM. Do you have some suggestion to identify an adequate size? So far, to limit costs for the customer, I have changed only the number of concurrent SHIR jobs, but I can try to ask a resizing...

Consider that I started my tests with the default values of Copy Activity and the "Auto" selection on the DIU and the copy parallelism. Than I adjusted all the parameters I already mentioned above, making an high number of single tests to understand how increase performance, but the situation has practically never changed and the times have remained the same if not even worsened. What do you suggest to modify in the Copy Activity config?

Thank you so much for your time.

Have a nice day!
Luca Campeti (ICONSULTING) 0 Reputation points

2024-01-22T13:44:21.8166667+00:00

Deleted duplicated comment
Amira Bedhiafi 34,101 Reputation points Volunteer Moderator

2024-01-23T10:47:51.01+00:00

If you want to identify an appropriate SHIR VM size, you should consider the volume and complexity of the data you're processing. A good starting point is to analyze the current performance metrics of your SHIR VM, especially during peak loads. If CPU or memory usage is consistently high, a more powerful VM may be needed. Azure offers a range of VM sizes, and you may want to consider a VM with higher CPU and memory resources. For data-intensive tasks, VMs optimized for memory or compute might be more suitable. Don't forget also the network bandwidth of the VM, as this can be a limiting factor in data transfer tasks. Since the 'Auto' selection didn't make any improvement, try to adjust the DIU and parallel copy operations just be careful you need to know that increasing these settings can lead to resource contention if the source or destination systems cannot handle the increased load. When it comes to batch sizes we all know that smaller batches can be processed more efficiently, especially if there are network or memory constraints.

Share via

How to improve extraction performance of SAP HANA Azure Data Factory Connector

1 answer

Your answer