migrate data from Onprem SQL Server 2012 to DataLake

Burra, Ashok 21 Reputation points
2021-01-03T03:14:57.86+00:00

Hi Team,

We have a single table with 6TB size (500Million xml messages) of data in form of 2 columns. (XML, int)

We are in the process of migrating our application and DB to Microsoft Azure. One of the thought process is to migrate the DB to Elastic pool. However, the size of a particular table is more than 5.5TB with 500 million xml messages.

So we decided to move the table to ADLS and move the rest of the RDMS data to Azure Database as service under Elastic pool of DBs.

Our major concern is moving the data.
We followed the approach of dumping the XML message data using a custom C# program to local disk in form of individual XML files based on the messageIDs and move them to ADSL (Azure data lake service) via ziping the files and unziping at Azure.

Our custom program takes almost 6 hours to dump 1M messages to local disk, ziping them and moving to Azure environment.

However, we are unable to unzip the file as our unxip module is running as azure function and times out before fulling unzipping.

Even if we fine tune to reduce the size of zip and make multiple comfortable size zips to move the data, the whole process is estimated to take close to 360days to complete migrating the data to ADLS.

There are operational overaheads in using Azure Data box disk service. So we are left with only moving the data by a migration program.

Any inputs and directions to move data faster from onprem SQL SERVER to ADLS will be highly appreciated.

Regards

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
{count} votes

Accepted answer
  1. JayaC-MSFT 5,606 Reputation points
    2021-01-12T06:28:32.097+00:00

    Hello @Burra, Ashok , I believe you will be executing an azure function to extract data from Sql database, process - as per requirement and write to ADL ( please correct me if I haven't understood it correctly ).
    Here I am sharing couple of example :

    1. To extract data :

    https://learn.microsoft.com/en-us/azure/azure-functions/functions-scenario-database-table-cleanup

    https://randypaulo.com/2018/12/19/querying-azure-sql-database-using-azure-functions-2-0-to-return-json-data/

    1. To write data to ADL :

    https://github.com/Azure/azure-functions-datalake-extension

    https://github.com/Azure/azure-functions-datalake-extension/blob/master/samples/DataLakeExtensionSamples/OutputFromBlob.cs
    [These are samples, you may need to modify your code based on your requirement with the help of these documents)

    Please let me know if this helps or you need anything specific in any certain language.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Burra, Ashok 21 Reputation points
    2021-01-12T14:26:57.763+00:00

    Hi @JayaC-MSFT ,

    To answer your query, yes we were looking at Azure function or Webjobs to read the data from the DB to populate to ADL. The samples you provided is helpful and gives us an implementation model to refer to.

    However, given the execution time limit of 30min with Azure functions, we will have to work on options that can read all the records of db and dump to ADL in defined directory hierarchy. (yyyy->mm->dd->hh->files).

    Alternatively, Is there an out of box solution to bulk transfer from azure db to ADL.

    Our proposed entity structure of partitioned db will be
    Long id, --[The name of the file]
    XML message, --[Data that will be written to a file on ADL]
    Date messagedateTime. --[To build the hierarchy at ADL]

    With the alternates at hand we are close to draw a holistic solution.
    Step 1: Partition the 6TB data into 1TB size and move to staging DB(Onprem)
    Step 2: Create bacpac/dacpac and move to azure sql server
    Step 3: Trigger the Azure function/webjob to read the data and dump to ADL (We need parallel execution here to iterate the db or bulk copy of db contents to ADL)
    The above step is what the time consuming and need approaches to get done faster.

    Step 4: Iterate from Step 1 to Step 3 until the 6TB or data is moved.

    Regards
    Ashok.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.