Move data From MySQL using Azure Data Factory
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see MySQL connector in V2.
This article explains how to use the Copy Activity in Azure Data Factory to move data from an on-premises MySQL database. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.
You can copy data from an on-premises MySQL data store to any supported sink data store. For a list of data stores supported as sinks by the copy activity, see the Supported data stores table. Data factory currently supports only moving data from a MySQL data store to other data stores, but not for moving data from other data stores to an MySQL data store.
Prerequisites
Data Factory service supports connecting to on-premises MySQL sources using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway.
Gateway is required even if the MySQL database is hosted in an Azure IaaS virtual machine (VM). You can install the gateway on the same VM as the data store or on a different VM as long as the gateway can connect to the database.
Note
See Troubleshoot gateway issues for tips on troubleshooting connection/gateway related issues.
Supported versions and installation
For Data Management Gateway to connect to the MySQL Database, you need to install the MySQL Connector/NET for Microsoft Windows (version between 6.6.5 and 6.10.7) on the same system as the Data Management Gateway. This 32 bit driver is compatible with 64 bit Data Management Gateway. MySQL version 5.1 and above is supported.
Tip
If you hit error on "Authentication failed because the remote party has closed the transport stream.", consider to upgrade the MySQL Connector/NET to higher version.
Getting started
You can create a pipeline with a copy activity that moves data from an on-premises Cassandra data store by using different tools/APIs.
- The easiest way to create a pipeline is to use the Copy Wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
- You can also use the following tools to create a pipeline: Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create linked services to link input and output data stores to your data factory.
- Create datasets to represent input and output data for the copy operation.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For a sample with JSON definitions for Data Factory entities that are used to copy data from an on-premises MySQL data store, see JSON example: Copy data from MySQL to Azure Blob section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to a MySQL data store:
Linked service properties
The following table provides description for JSON elements specific to MySQL linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: OnPremisesMySql | Yes |
server | Name of the MySQL server. | Yes |
database | Name of the MySQL database. | Yes |
schema | Name of the schema in the database. | No |
authenticationType | Type of authentication used to connect to the MySQL database. Possible values are: Basic . |
Yes |
userName | Specify user name to connect to the MySQL database. | Yes |
password | Specify password for the user account you specified. | Yes |
gatewayName | Name of the gateway that the Data Factory service should use to connect to the on-premises MySQL database. | Yes |
Dataset properties
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type RelationalTable (which includes MySQL dataset) has the following properties
Property | Description | Required |
---|---|---|
tableName | Name of the table in the MySQL Database instance that linked service refers to. | No (if query of RelationalSource is specified) |
Copy activity properties
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, are policies are available for all types of activities.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
When source in copy activity is of type RelationalSource (which includes MySQL), the following properties are available in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
query | Use the custom query to read data. | SQL query string. For example: select * from MyTable. | No (if tableName of dataset is specified) |
JSON example: Copy data from MySQL to Azure Blob
This example provides sample JSON definitions that you can use to create a pipeline by using Visual Studio or Azure PowerShell. It shows how to copy data from an on-premises MySQL database to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.
Important
This sample provides JSON snippets. It does not include step-by-step instructions for creating the data factory. See moving data between on-premises locations and cloud article for step-by-step instructions.
The sample has the following data factory entities:
- A linked service of type OnPremisesMySql.
- A linked service of type AzureStorage.
- An input dataset of type RelationalTable.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses RelationalSource and BlobSink.
The sample copies data from a query result in MySQL database to a blob hourly. The JSON properties used in these samples are described in sections following the samples.
As a first step, setup the data management gateway. The instructions are in the moving data between on-premises locations and cloud article.
MySQL linked service:
{
"name": "OnPremMySqlLinkedService",
"properties": {
"type": "OnPremisesMySql",
"typeProperties": {
"server": "<server name>",
"database": "<database name>",
"schema": "<schema name>",
"authenticationType": "<authentication type>",
"userName": "<user name>",
"password": "<password>",
"gatewayName": "<gateway>"
}
}
}
Azure Storage linked service:
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
MySQL input dataset:
The sample assumes you have created a table "MyTable" in MySQL and it contains a column called "timestampcolumn" for time series data.
Setting "external": "true" informs the Data Factory service that the table is external to the data factory and is not produced by an activity in the data factory.
{
"name": "MySqlDataSet",
"properties": {
"published": false,
"type": "RelationalTable",
"linkedServiceName": "OnPremMySqlLinkedService",
"typeProperties": {},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobMySqlDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/mysql/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Pipeline with Copy activity:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to RelationalSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "CopyMySqlToBlob",
"properties": {
"description": "pipeline for copy activity",
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "RelationalSource",
"query": "$$Text.Format('select * from MyTable where timestamp >= \\'{0:yyyy-MM-ddTHH:mm:ss}\\' AND timestamp < \\'{1:yyyy-MM-ddTHH:mm:ss}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink",
"writeBatchSize": 0,
"writeBatchTimeout": "00:00:00"
}
},
"inputs": [
{
"name": "MySqlDataSet"
}
],
"outputs": [
{
"name": "AzureBlobMySqlDataSet"
}
],
"policy": {
"timeout": "01:00:00",
"concurrency": 1
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"name": "MySqlToBlob"
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}
Type mapping for MySQL
As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following two-step approach:
- Convert from native source types to .NET type
- Convert from .NET type to native sink type
When moving data to MySQL, the following mappings are used from MySQL types to .NET types.
MySQL Database type | .NET Framework type |
---|---|
bigint unsigned | Decimal |
bigint | Int64 |
bit | Decimal |
blob | Byte[] |
bool | Boolean |
char | String |
date | Datetime |
datetime | Datetime |
decimal | Decimal |
double precision | Double |
double | Double |
enum | String |
float | Single |
int unsigned | Int64 |
int | Int32 |
integer unsigned | Int64 |
integer | Int32 |
long varbinary | Byte[] |
long varchar | String |
longblob | Byte[] |
longtext | String |
mediumblob | Byte[] |
mediumint unsigned | Int64 |
mediumint | Int32 |
mediumtext | String |
numeric | Decimal |
real | Double |
set | String |
smallint unsigned | Int32 |
smallint | Int16 |
text | String |
time | TimeSpan |
timestamp | Datetime |
tinyblob | Byte[] |
tinyint unsigned | Int16 |
tinyint | Int16 |
tinytext | String |
varchar | String |
year | Int |
Map source to sink columns
To learn about mapping columns in source dataset to columns in sink dataset, see Mapping dataset columns in Azure Data Factory.
Repeatable read from relational sources
When copying data from relational data stores, keep repeatability in mind to avoid unintended outcomes. In Azure Data Factory, you can rerun a slice manually. You can also configure retry policy for a dataset so that a slice is rerun when a failure occurs. When a slice is rerun in either way, you need to make sure that the same data is read no matter how many times a slice is run. See Repeatable read from relational sources.
Performance and Tuning
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.
Feedback
Submit and view feedback for