Move data from a SQL Server database to SQL Database with Azure Data Factory
This article shows how to move data from a SQL Server database to Azure SQL Database via Azure Blob Storage using the Azure Data Factory (ADF): this method is a supported legacy approach that has the advantages of a replicated staging copy, though we suggest to look at our data migration page for the latest options.
For a table that summarizes various options for moving data to an Azure SQL Database, see Move data to an Azure SQL Database for Azure Machine Learning.
Introduction: What is ADF and when should it be used to migrate data?
Azure Data Factory is a fully managed cloud-based data integration service that orchestrates and automates the movement and transformation of data. The key concept in the ADF model is pipeline. A pipeline is a logical grouping of Activities, each of which defines the actions to perform on the data contained in Datasets. Linked services are used to define the information needed for Data Factory to connect to the data resources.
With ADF, existing data processing services can be composed into data pipelines that are highly available and managed in the cloud. These data pipelines can be scheduled to ingest, prepare, transform, analyze, and publish data, and ADF manages and orchestrates the complex data and processing dependencies. Solutions can be quickly built and deployed in the cloud, connecting a growing number of on-premises and cloud data sources.
Consider using ADF:
- when data needs to be continually migrated in a hybrid scenario that accesses both on-premises and cloud resources
- when the data needs transformations or have business logic added to it when being migrated.
ADF allows for the scheduling and monitoring of jobs using simple JSON scripts that manage the movement of data on a periodic basis. ADF also has other capabilities such as support for complex operations. For more information on ADF, see the documentation at Azure Data Factory (ADF).
The Scenario
We set up an ADF pipeline that composes two data migration activities. Together they move data on a daily basis between a SQL Server database and Azure SQL Database. The two activities are:
- Copy data from a SQL Server database to an Azure Blob Storage account
- Copy data from the Azure Blob Storage account to Azure SQL Database.
Note
The steps shown here have been adapted from the more detailed tutorial provided by the ADF team: Copy data from a SQL Server database to Azure Blob storage References to the relevant sections of that topic are provided when appropriate.
Prerequisites
This tutorial assumes you have:
- An Azure subscription. If you do not have a subscription, you can sign up for a free trial.
- An Azure storage account. You use an Azure storage account for storing the data in this tutorial. If you don't have an Azure storage account, see the Create a storage account article. After you have created the storage account, you need to obtain the account key used to access the storage. See Manage storage account access keys.
- Access to an Azure SQL Database. If you must set up an Azure SQL Database, the topic Getting Started with Microsoft Azure SQL Database provides information on how to provision a new instance of an Azure SQL Database.
- Installed and configured Azure PowerShell locally. For instructions, see How to install and configure Azure PowerShell.
Note
This procedure uses the Azure portal.
Upload the data to your SQL Server instance
We use the NYC Taxi dataset to demonstrate the migration process. The NYC Taxi dataset is available, as noted in that post, on Azure blob storage NYC Taxi Data. The data has two files, the trip_data.csv file, which contains trip details, and the trip_far.csv file, which contains details of the fare paid for each trip. A sample and description of these files are provided in NYC Taxi Trips Dataset Description.
You can either adapt the procedure provided here to a set of your own data or follow the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset into your SQL Server database, follow the procedure outlined in Bulk Import Data into SQL Server database.
Create an Azure Data Factory
The instructions for creating a new Azure Data Factory and a resource group in the Azure portal are provided Create an Azure Data Factory. Name the new ADF instance adfdsp and name the resource group created adfdsprg.
Install and configure Azure Data Factory Integration Runtime
The Integration Runtime is a customer-managed data integration infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. This runtime was formerly called "Data Management Gateway".
To set up, follow the instructions for creating a pipeline
Create linked services to connect to the data resources
A linked service defines the information needed for Azure Data Factory to connect to a data resource. We have three resources in this scenario for which linked services are needed:
- On-premises SQL Server
- Azure Blob Storage
- Azure SQL Database
The step-by-step procedure for creating linked services is provided in Create linked services.
Define and create tables to specify how to access the datasets
Create tables that specify the structure, location, and availability of the datasets with the following script-based procedures. JSON files are used to define the tables. For more information on the structure of these files, see Datasets.
Note
You should execute the Add-AzureAccount
cmdlet before executing the New-AzureDataFactoryTable cmdlet to confirm that the right Azure subscription is selected for the command execution. For documentation of this cmdlet, see Add-AzureAccount.
The JSON-based definitions in the tables use the following names:
- the table name in the SQL Server is nyctaxi_data
- the container name in the Azure Blob Storage account is containername
Three table definitions are needed for this ADF pipeline:
Note
These procedures use Azure PowerShell to define and create the ADF activities. But these tasks can also be accomplished using the Azure portal. For details, see Create datasets.
SQL on-premises Table
The table definition for the SQL Server is specified in the following JSON file:
{
"name": "OnPremSQLTable",
"properties":
{
"location":
{
"type": "OnPremisesSqlServerTableLocation",
"tableName": "nyctaxi_data",
"linkedServiceName": "adfonpremsql"
},
"availability":
{
"frequency": "Day",
"interval": 1,
"waitOnExternal":
{
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
The column names were not included here. You can subselect on the column names by including them here (for details check the ADF documentation topic.
Copy the JSON definition of the table into a file called onpremtabledef.json file and save it to a known location (here assumed to be C:\temp\onpremtabledef.json). Create the table in ADF with the following Azure PowerShell cmdlet:
New-AzureDataFactoryTable -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp –File C:\temp\onpremtabledef.json
Blob Table
Definition for the table for the output blob location is in the following (this maps the ingested data from on-premises to Azure blob):
{
"name": "OutputBlobTable",
"properties":
{
"location":
{
"type": "AzureBlobLocation",
"folderPath": "containername",
"format":
{
"type": "TextFormat",
"columnDelimiter": "\t"
},
"linkedServiceName": "adfds"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}
Copy the JSON definition of the table into a file called bloboutputtabledef.json file and save it to a known location (here assumed to be C:\temp\bloboutputtabledef.json). Create the table in ADF with the following Azure PowerShell cmdlet:
New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\bloboutputtabledef.json
SQL Azure Table
Definition for the table for the SQL Azure output is in the following (this schema maps the data coming from the blob):
{
"name": "OutputSQLAzureTable",
"properties":
{
"structure":
[
{ "name": "column1", "type": "String"},
{ "name": "column2", "type": "String"}
],
"location":
{
"type": "AzureSqlTableLocation",
"tableName": "your_db_name",
"linkedServiceName": "adfdssqlazure_linked_servicename"
},
"availability":
{
"frequency": "Day",
"interval": 1
}
}
}
Copy the JSON definition of the table into a file called AzureSqlTable.json file and save it to a known location (here assumed to be C:\temp\AzureSqlTable.json). Create the table in ADF with the following Azure PowerShell cmdlet:
New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\AzureSqlTable.json
Define and create the pipeline
Specify the activities that belong to the pipeline and create the pipeline with the following script-based procedures. A JSON file is used to define the pipeline properties.
- The script assumes that the pipeline name is AMLDSProcessPipeline.
- Also note that we set the periodicity of the pipeline to be executed on daily basis and use the default execution time for the job (12 am UTC).
Note
The following procedures use Azure PowerShell to define and create the ADF pipeline. But this task can also be accomplished using the Azure portal. For details, see Create pipeline.
Using the table definitions provided previously, the pipeline definition for the ADF is specified as follows:
{
"name": "AMLDSProcessPipeline",
"properties":
{
"description" : "This pipeline has two activities: the first one copies data from SQL Server to Azure Blob, and the second one copies from Azure Blob to Azure Database Table",
"activities":
[
{
"name": "CopyFromSQLtoBlob",
"description": "Copy data from SQL Server to blob",
"type": "CopyActivity",
"inputs": [ {"name": "OnPremSQLTable"} ],
"outputs": [ {"name": "OutputBlobTable"} ],
"transformation":
{
"source":
{
"type": "SqlSource",
"sqlReaderQuery": "select * from nyctaxi_data"
},
"sink":
{
"type": "BlobSink"
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 0,
"timeout": "01:00:00"
}
},
{
"name": "CopyFromBlobtoSQLAzure",
"description": "Push data to Sql Azure",
"type": "CopyActivity",
"inputs": [ {"name": "OutputBlobTable"} ],
"outputs": [ {"name": "OutputSQLAzureTable"} ],
"transformation":
{
"source":
{
"type": "BlobSource"
},
"sink":
{
"type": "SqlSink",
"WriteBatchTimeout": "00:5:00",
}
},
"Policy":
{
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"style": "StartOfInterval",
"retry": 2,
"timeout": "02:00:00"
}
}
]
}
}
Copy this JSON definition of the pipeline into a file called pipelinedef.json file and save it to a known location (here assumed to be C:\temp\pipelinedef.json). Create the pipeline in ADF with the following Azure PowerShell cmdlet:
New-AzureDataFactoryPipeline -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\pipelinedef.json
Start the Pipeline
The pipeline can now be run using the following command:
Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp -StartDateTime startdateZ –EndDateTime enddateZ –Name AMLDSProcessPipeline
The startdate and enddate parameter values need to be replaced with the actual dates between which you want the pipeline to run.
Once the pipeline executes, you should be able to see the data show up in the container selected for the blob, one file per day.
We have not leveraged the functionality provided by ADF to pipe data incrementally. For more information on how to do this and other capabilities provided by ADF, see the ADF documentation.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Mark Tabladillo | Senior Cloud Solution Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
Feedback
Submit and view feedback for