Create a Data Factory Pipeline with Copy and Hive activities
This sample creates a data factory with a data pipeline with three activities in it.
Pipeline: Copy Activity -> HDInsight Activity -> Copy Activity
- The first Copy Activity copies the input file from an FTP server to an Azure blob.
- The HDInsight Activity processes the data from input file by running a Hive script on an Azure HDInsight cluster to produce an output file in Azure Blob Storage. The script combines first names and last names from the input file and create an output file with full names.
- The second Copy Activity copies the output from Hive processing to a table in an Azure SQL database.
Prerequisites
An Azure subscription.
Upload the combinefirstandlast.hql file to a folder named Script in the Azure blob storage in a container named adftutorial.
Use the createemployeestable.sql from the script folder to create a table named Employees in your Azure SQL database.
Create an Azure Virtual Machine and configure it to be an FTP server. See this article for details. You can use your FTP server instead.
Upload input.txt file from the script folder to a folder named incomingfiles on the FTP server.
Add parameters to the azuredeploy.parameters.json** file and specify values for them.
"ftpHost": { "value": "<your FTP server name or IP address>" }, "ftpUser": { "value": "<FTP user name>" }, "ftpPassword": { "value": "<FTP password>" }, "ftpFolderName": { "value": "incomingfiles" }, "ftpFileName": { "value": "input.txt" }, "storageAccountResourceGroupName": { "value": "<Resource group of your Azure Storae account>" }, "storageAccountName": { "<Azure Storage account name>": "" }, "storageAccountKey": { "value": "<Azure Storage access key>" }, "blobContainer": { "value": "adftutorial" }, "inputBlobFolder": { "value": "inputdata" }, "inputBlobName": { "value": "input.txt" }, "outputBlobFolder": { "value": "outputdata" }, "hiveScriptFolder": { "value": "script" }, "hiveScriptFile": { "value": "combinefirstandlast.hql" }, "sqlServerName": { "value": "<Name of Azure SQL server>" }, "sqlDatabaseName": { "value": "<Name of Azure SQL database>" }, "sqlServerUserName": { "value": "<Name of user who has access to the SQL server>" }, "sqlServerPassword": { "value": "<Password for Azuer SQL user>" }, "targetSQLTable": { "value": "Employees" }
When you deploy this Azure Resource Template, a data factory is created with the following entities:
- Linked services
- FTP
- Azure Storage
- Azure HDInsight
- Azure SQL Database
- Datasets
- FTP (FileShare)
- Azure Blob (2 of them)
- Azure SQL
- Pipeline with three activities: Copy, HDInsight Hive, and another Copy.
In this tutorial, the input file in FTP server has the following data:
Doe, John
Doe, Jane
Gates, Bill
Allen, Paul
This file is copied from the FTP server to inputdata folder in Azure Blob container. The HDInsight Hive activity processes this file and create an output file like this:
John Doe
Jane Doe
Bill Gates
Paul Allen
The second Copy Activity copies this data to Employees table in the Azure SQL database.
FullName
--------
John Doe
Jane Doe
Bill Gates
Paul Allen
For more information, see Overview and prerequisites article.
See Tutorial: Create a pipeline using Resource Manager Template article for a detailed walkthrough with step-by-step instructions.
Tags: Microsoft.DataFactory/datafactories, linkedservices, FtpServer, AzureStorage, HDInsightOnDemand, AzureSqlDatabase, datasets, FileShare, AzureBlob, TextFormat, AzureSqlTable, datapipelines, Copy, FileSystemSource, BlobSink, HDInsightHive, BlobSource, SqlSink