Best tool to copy huge amount of data from on-prem SMB or NFS to Azure blob using SHIR

Anonymous
2024-04-03T21:07:24.01+00:00

I would like to copy huge amount of data (initial load starts with 222 GB) from on-prem SMBor NFS source to Aure blob. I did a test using ADF and SHIR and the copy took one hour and 40 minutes. The same test using azcopy took 21 minutes.
I know ADF and Azocpy are different tools and maybe ADF is not optimized, but for resiliency and data transformation. I would like to know what could be the best approach? Try Databricks? Try some specific configuration for SMB or NFS?
Thanks in advance.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,201 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,646 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 33,866 Reputation points Volunteer Moderator
    2024-04-03T23:42:21.8666667+00:00

    I believe that there are multiple approaches you can consider to optimize the process. because simply each tool or service has its strengths, and the choice often depends on specific requirements such as speed, cost, ease of setup, and features like data transformation or integration capabilities.

    1st case : Azure Data Factory (ADF) with SHIR

    You can optimize the data transfer performance in ADF by tweaking the parallelism settings, batch size, and by utilizing DIUs more effectively (you may need to consider dedicating a machine for the SHIR).

    For ongoing synchronization beyond the initial load, consider implementing incremental loads using ADF where you transfer only new or changed data after the initial full load, this way you can succeed in reducing the volume of data to be moved and the time required for subsequent data transfers.

    2nd case : AzCopy

    As you've noted, AzCopy is faster for moving data from point A to point B. Simply, because it's optimized for speed and efficiency in data transfer scenarios and is a great choice for bulk data migrations without the need for transformation.

    AzCopy can be integrated into scripts for automation, with the ability of scheduling and having incremental copy operations.

    3rd case : Azure Databricks

    If your data transfer needs include transformation, cleansing, or other forms of processing, Azure Databricks could be a powerful tool. Databricks can handle large datasets efficiently and can write data into Azure Blob Storage quickly.

    Also, it offers massive scalability, which can be a strong point if your data processing needs grow or fluctuate over time.

    4th case : Specific Configurations for SMB/NFS

    You need to verify first if your on-premises network is optimized for the data transfer. This might involve configuring your network for high throughput or ensuring a direct, fast connection to Azure.

    Regardless of the tool, running parallel copies (where possible) can significantly reduce transfer time. This requires careful planning to avoid overloading your network or source file systems.

    Consider Azure Files, why not ?

    For a hybrid approach, Azure File Sync can synchronize your on-premises file servers with Azure Files (and subsequently move data to Blob Storage if needed). It is a good balance of cloud integration and local performance/accessibility.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.