ADF throughput increase from Oracle database in aws

Sabyasachi Dash 21 Reputation points
2020-12-01T09:52:50.973+00:00

Help Needed ! Increase speed of data transfer between Oracle database on AWS to Azure data lake server ADLS We have an Oracle database hosted in AWS private cloud with large amount of data. We need to transfer the data from the database to the Azure data lake storage ADLS in the azure cloud. We have a 1 Gbps virtual network gateway connecting the 2 cloud environments. We are using ADF pipelines to transfer the data from source to sink. The speed we are getting for data transfer is between 400KBps - 5 MBps which is low considering the 1Gbps connection. We need to increase the data transfer speed to shorten the time taken for data transfer.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,712 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,086 Reputation points
    2020-12-01T20:44:45.603+00:00

    Hello @Sabyasachi Dash and welcome to Microsoft Q&A.

    There are a number of things that can be done to increase throughput. In your case, I suspect it is caused by a low degree of parallelism (1) and/or not using partitioned copy.

    I am not overly familiar with the AWS ecosystem, but it sounds like you have an established connection, and so are using Azure Integration Runtime, rather than a Self-Hosted Integration Runtime. Please correct me if this is not the case.
    If you are using Self-Hosted Integration Runtime, then you could try adding more nodes and/or scaling up the nodes, in addition to the following recommendations.

    There are optimizations which are specific to Oracle, and there are generic optimizations. There is some interplay between them.

    First, the generic optimizations. In the Copy Activity > Settings you can find Data Integration Units and Degree of copy parallelism. The Data Integration Units are a measure of how much "power" is assigned to the task. The Degree of copy parallelism is how many processes are available to read/write simultaneously. More is better.
    44145-image.png

    In the Copy Data > Sink (for both Azure Data Lake Gen 1 and Gen 2), there is Max Concurrent connections. Max concurrent connections is how many simultaneous writers you will allow. More is better, but the default value is 'no limit'.
    Depending upon the type of dataset, more options may be available (i.e. delimited text vs binary).

    For the optimizations specific to Oracle, there is Partition Options. If you are currently using None, this would explain the low throughput. Whether you should select Physical partitions of table or Dynamic range depends on the particulars of your situation. Please see the article Parallel Copy from Oracle for details on determining which is right for you. The article also explains how the aforementioned Degree of parallelism interacts with the Partition options.
    44138-image.png

    Please let me know if this helps you at all. Thank you for your patience.
    Martin

    0 comments No comments

  2. Sabyasachi Dash 21 Reputation points
    2020-12-02T09:49:00.403+00:00

    Hey @MartinJaffer-MSFT ,

    1. We are using Self hosted IR.
    2. Initially our pipelines were using no partition option(1.1mbps throughput), later we started testing physical and dynamic range partition option in order to increase throughput. The results obtained were no change in speed for physical partition , doubling of throughput for dynamic partition (2.2 MBPS approximately). The physical partition (Total 3) is based on geography and it didn't yield any result because 1 partition alone contains 90% of data.
    3. We are using 1 GBPS virtual network gateway and VM is DS12 v2 (4 vcpus, 28 GiB memory) , we tested network bandwidth with msft team and source team using iperf software, the throughput obtained for a single thread was 35 mb. So at least we are expecting more bandwidth due to network and vm configuration.
    4. We tried Degree of copy parallelism and it didn't result any improvement.
    5. DIU is not applicable for us due to self hosted IR
    6. Max Concurrent connections we tried as well, it is no help.

    When we see the statistics for copy activity, it is clearly visible that reading from source is the time consuming one. For 50 Gb of oracle source data with 1.1 mbps speed it is taking around 14 hours.

    Our goal is to increase throughput for a single copy activity to 30+ MBPS so that we can copy huge amount of data within stipulated timeframe.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.