Data ingestion using copy data activity (from Oracle to ADLS GEN2)

Question

Data ingestion using copy data activity (from Oracle to ADLS GEN2)

Anandhakumar Cholendran 45

Hello ,

I am using a query to ingest the data from Oracle to ADLS GEN2 (as a parquet file) using copy activity. The Query which I am using has a row count of 5.2B records. When I use the copy activity to run the script , it is running for more than 10+hrs to execute the query in oracle and no data is getting ingested to sink (ADLS GEN2) Can you please suggest what approach needs to be followed here to handle the data with such scenarios!!

Thanks in advance :)

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-08-04T23:56:29.58+00:00

@Anandhakumar Cholendran We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

Accepted answer

0 additional answers

Your answer

KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-08-04T23:56:29.58+00:00

@Anandhakumar Cholendran We still have not heard back from you. Just wanted to check if the below information was helpful? If it answers your query, please do click Accept Answer and Yes for "was this answer helpful", as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Thank you

Answer 1

@Anandhakumar Cholendran Welcome to Microsoft Q&A forum and thanks for reaching out here.

Seems like you are trying to copy large amount of data which is why it is taking more time. I'm not sure what your source (Oracle) connector configuration is but ADF oracle connector provides built-in data partitioning to copy data from Oracle in parallel. This is recommended to efficiently copy large amount of data from Oracle as it helps to run parallel queries against your Oracle source to load data by partitions. You can find those data partitioning options on the Source tab of the copy activity as shown below.

Screenshot of partition options

The following are suggested configurations for different scenarios. When copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in which case the performance is better than writing to a single file.

User's image

When copying data from a non-partitioned table, you can use "Dynamic range" partition option to partition against an integer column. If your source data doesn't have such type of column, you can leverage ORA_HASH function in source query to generate a column and use it as partition column.

In addition, I suggest going through the copy activity performance optimization guide to improve the performance of you copy activity: Copy activity performance optimization features

User's image

If you feel Copy activity is slow with respect to performance, I encourage you to explore Azure Data factory mapping data flow which are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Since they run on Apache spark clusters, they are very performant when compared to regular pipeline activities.

Using mapping data flow you can configure data partition while reading and writing the data to your sink store.

User's image

Hope this info helps. Do let me know if you have any questions.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Anandhakumar Cholendran 45 Reputation points

2023-07-31T14:01:28.88+00:00

Hi @KranthiPakala-MSFT , Thanks for the quick support. Will follow these approaches. By the way, can we use oracle as a source as a part of dataflow activity? I guess we cannot use dataflow for self-hosted environment. Please correct if i am wrong :)

Cheers!
KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator

2023-08-01T00:04:58.9466667+00:00

@KranthiPakala-MSFT Yes, you are correct, SHIR is not supported in Mapping dataflow, but you Oracle is supported as a datastore in Mapping flows using Azure IR.

Hope this info helps.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Share via

Data ingestion using copy data activity (from Oracle to ADLS GEN2)

0 additional answers

Your answer