Data migration optimization from Azure data lake to SQL Databases

Samuel Pastrana Bedoya 146 Reputation points
2022-08-03T13:27:40.193+00:00

I have about fifteen million json type records saved in individual files in a single folder, I would like to know the best configuration that I can make in the copy data that I am using to optimize said migration, according to what I have read, the configurations can vary depending on the number of files to process and migrate, so I don't know what would be the best configuration with fifteen million records.
227685-configuracion-copy-data.png

This is the configuration I have for the copy data, if I can change something to optimize the process, I would appreciate if you have any advice or suggestions regarding what to modify,Processing is from data lake to sql and from datalake to cosmos db.

Thank you very much.

Azure SQL Database
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. AnnuKumari-MSFT 34,566 Reputation points Microsoft Employee Moderator
    2022-08-04T07:06:17.923+00:00

    Hi @Samuel Pastrana Bedoya ,

    Thankyou for using Microsoft Q&A platform and thanks for posting your question.

    As I understand your ask, you want to optimize the process of data migration from ADLS to SQL server for which you are using Copy activity in Azure data factory pipeline. Please let me know if my understanding is incorrect.

    The allowed DIUs or Data Integration Units to empower a copy activity run is between 2 and 256. However, ADF dynamically applies the optimal DIU setting based on your source-sink pair and data pattern.

    You can consider increasing DIU to 256 . However, during the runtime , DIUs will be decided based on the data stores during the runtime. It depends on various factors like the copy is happening for single file or muliple files or is it between file store to non file store. So, depending on all these factors, DIUs will be assigned.

    228004-image.png

    For more details, check Data Integration Units

    Now coming to Degree of parallelism, when you specify a value for the parallelCopies property, take the load increase on your source and sink data stores into account. Also consider the load increase to the integration runtime if the copy activity is empowered by it.

    This load increase happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store.

    If you notice that either the data store or the self-hosted integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.

    For more information , please check : Performance tuning steps

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you.
      Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.