Build a data pipeline by using Azure Data Factory, DevOps, and machine learning
Azure DevOps Services
Get started building a data pipeline with data ingestion, data transformation, and model training.
Learn how to grab data from a CSV (comma-separated values) file and save the data to Azure Blob Storage. Transform the data and save it to a staging area. Then train a machine learning model by using the transformed data. Write the model to blob storage as a Python pickle file.
Prerequisites
Before you begin, you need:
- An Azure account that has an active subscription. Create an account for free.
- An active Azure DevOps organization. Sign up for Azure Pipelines.
- The Administrator role for service connections in your Azure DevOps project. Learn how to add the Administrator role.
- Data from sample.csv.
- Access to the data pipeline solution in GitHub.
- DevOps for Azure Databricks.
Provision Azure resources
Sign in to the Azure portal.
From the menu, select the Cloud Shell button. When you're prompted, select the Bash experience.
Note
You'll need an Azure Storage resource to persist any files that you create in Azure Cloud Shell. When you first open Cloud Shell, you're prompted to create a resource group, storage account, and Azure Files share. This setup is automatically used for all future Cloud Shell sessions.
Select an Azure region
A region is one or more Azure datacenters within a geographic location. East US, West US, and North Europe are examples of regions. Every Azure resource, including an App Service instance, is assigned a region.
To make commands easier to run, start by selecting a default region. After you specify the default region, later commands use that region unless you specify a different region.
In Cloud Shell, run the following
az account list-locations
command to list the regions that are available from your Azure subscription.az account list-locations \ --query "[].{Name: name, DisplayName: displayName}" \ --output table
From the
Name
column in the output, choose a region that's close to you. For example, chooseasiapacific
orwestus2
.Run
az config
to set your default region. In the following example, replace<REGION>
with the name of the region you chose.az config set defaults.location=<REGION>
The following example sets
westus2
as the default region.az config set defaults.location=westus2
Create Bash variables
In Cloud Shell, generate a random number. You'll use this number to create globally unique names for certain services in the next step.
resourceSuffix=$RANDOM
Create globally unique names for your storage account and key vault. The following commands use double quotation marks, which instruct Bash to interpolate the variables by using the inline syntax.
storageName="datacicd${resourceSuffix}" keyVault="keyvault${resourceSuffix}"
Create one more Bash variable to store the names and the region of your resource group. In the following example, replace
<REGION>
with the region that you chose for the default region.rgName='data-pipeline-cicd-rg' region='<REGION>'
Create variable names for your Azure Data Factory and Azure Databricks instances.
datafactorydev='data-factory-cicd-dev' datafactorytest='data-factory-cicd-test' databricksname='databricks-cicd-ws'
Create Azure resources
Run the following
az group create
command to create a resource group by usingrgName
.az group create --name $rgName
Run the following
az storage account create
command to create a new storage account.az storage account create \ --name $storageName \ --resource-group $rgName \ --sku Standard_RAGRS \ --kind StorageV2
Run the following
az storage container create
command to create two containers,rawdata
andprepareddata
.az storage container create -n rawdata --account-name $storageName az storage container create -n prepareddata --account-name $storageName
Run the following
az keyvault create
command to create a new key vault.az keyvault create \ --name $keyVault \ --resource-group $rgName
Create a new data factory by using the portal UI or Azure CLI:
- Name:
data-factory-cicd-dev
- Version:
V2
- Resource group:
data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable Git.
Add the Azure Data Factory extension.
az extension add --name datafactory
Run the following
az datafactory create
command to create a new data factory.az datafactory create \ --name data-factory-cicd-dev \ --resource-group $rgName
Copy the subscription ID. Your data factory uses this ID later.
- Name:
Create a second data factory by using the portal UI or the Azure CLI. You use this data factory for testing.
- Name:
data-factory-cicd-test
- Version:
V2
- Resource group:
data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable GIT.
Run the following
az datafactory create
command to create a new data factory for testing.az datafactory create \ --name data-factory-cicd-test \ --resource-group $rgName
Copy the subscription ID. Your data factory uses this ID later.
- Name:
Add a new Azure Databricks service:
- Resource group:
data-pipeline-cicd-rg
- Workspace name:
databricks-cicd-ws
- Location: Your closest location
Add the Azure Databricks extension if it's not already installed.
az extension add --name databricks
Run the following
az databricks workspace create
command to create a new workspace.az databricks workspace create \ --resource-group $rgName \ --name databricks-cicd-ws \ --location eastus2 \ --sku trial
Copy the subscription ID. Your Databricks service uses this ID later.
- Resource group:
Upload data to your storage container
- In the Azure portal, open your storage account in the
data-pipeline-cicd-rg
resource group. - Go to Blob Service > Containers.
- Open the
prepareddata
container. - Upload the sample.csv file.
Set up Key Vault
You use Azure Key Vault to store all connection information for your Azure services.
Create a Databricks personal access token
- In the Azure portal, go Databricks and then open your workspace.
- In the Azure Databricks UI, create and copy a personal access token.
Copy the account key and connection string for your storage account
- Go to your storage account.
- Open Access keys.
- Copy the first key and connection string.
Save values to Key Vault
Create three secrets:
- databricks-token:
your-databricks-pat
- StorageKey:
your-storage-key
- StorageConnectString:
your-storage-connection
- databricks-token:
Run the following
az keyvault secret set
command to add secrets to your key vault.az keyvault secret set --vault-name "$keyVault" --name "databricks-token" --value "your-databricks-pat" az keyvault secret set --vault-name "$keyVault" --name "StorageKey" --value "your-storage-key" az keyvault secret set --vault-name "$keyVault" --name "StorageConnectString" --value "your-storage-connection"
Import the data pipeline solution
- Sign in to your Azure DevOps organization and then go to your project.
- Go to Repos and then import your forked version of the GitHub repository. For more information, see Import a Git repo into your project.
Add an Azure Resource Manager service connection
- Create an Azure Resource Manager service connection.
- Select Service Principal (automatic).
- Choose the data-pipeline-cicd-rg resource group.
- Name the service connection
azure_rm_connection
. - Select Grant access permission to all pipelines. You need to have the Service Connections Administrator role to select this option.
Add pipeline variables
Create a new variable group named
datapipeline-vg
.Add the Azure DevOps extension if it isn't already installed.
az extension add --name azure-devops
Sign in to your Azure DevOps organization.
az devops login --org https://dev.azure.com/<yourorganizationname>
az pipelines variable-group create --name datapipeline-vg -p <yourazuredevopsprojectname> --variables \ "LOCATION=$region" \ "RESOURCE_GROUP=$rgName" \ "DATA_FACTORY_NAME=$datafactorydev" \ "DATA_FACTORY_DEV_NAME=$datafactorydev" \ "DATA_FACTORY_TEST_NAME=$datafactorytest" \ "ADF_PIPELINE_NAME=DataPipeline" \ "DATABRICKS_NAME=$databricksname" \ "AZURE_RM_CONNECTION=azure_rm_connection" \ "DATABRICKS_URL=<URL copied from Databricks in Azure portal>" \ "STORAGE_ACCOUNT_NAME=$storageName" \ "STORAGE_CONTAINER_NAME=rawdata"
Create a second variable group named
keys-vg
. This group pulls data variables from Key Vault.Select Link secrets from an Azure key vault as variables. For more information, see Link a variable group to secrets in Azure Key Vault.
Authorize the Azure subscription.
Choose all of the available secrets to add as variables (
databricks-token
,StorageConnectString
,StorageKey
).
Configure Azure Databricks and Azure Data Factory
Follow the steps in the next sections to set up Azure Databricks and Azure Data Factory.
Create testscope in Azure Databricks
- In the Azure portal, go to Key vault > Properties.
- Copy the DNS Name and Resource ID.
- In your Azure Databricks workspace, create a secret scope named
testscope
.
Add a new cluster in Azure Databricks
- In the Azure Databricks workspace, go to Clusters.
- Select Create Cluster.
- Name and save your new cluster.
- Select your new cluster name.
- In the URL string, copy the content between
/clusters/
and/configuration
. For example, in the stringclusters/0306-152107-daft561/configuration
, you would copy0306-152107-daft561
. - Save this string to use later.
Set up your code repository in Azure Data Factory
- In Azure Data Factory, go to Author & Monitor. For more information, see Create a data factory.
- Select Set up code repository and then connect your repo.
- Repository type: Azure DevOps Git
- Azure DevOps organization: Your active account
- Project name: Your Azure DevOps data pipeline project
- Git repository name: Use existing.
- Select the main branch for collaboration.
- Set /azure-data-pipeline/factorydata as the root folder.
- Branch to import resource into: Select Use existing and main.
Link Azure Data Factory to your key vault
- In the Azure portal UI, open the key vault.
- Select Access policies.
- Select Add Access Policy.
- For Configure from template, select Key & Secret Management.
- In Select principal, search for the name of your development data factory and add it.
- Select Add to add your access policies.
- Repeat these steps to add an access policy for the test data factory.
Update the key vault linked service in Azure Data Factory
- Go to Manage > Linked services.
- Update the Azure key vault to connect to your subscription.
Update the storage linked service in Azure Data Factory
- Go to Manage > Linked services.
- Update the Azure Blob Storage value to connect to your subscription.
Update the Azure Databricks linked service in Azure Data Factory
- Go to Manage > Linked services.
- Update the Azure Databricks value to connect to your subscription.
- For the Existing Cluster ID, enter the cluster value you saved earlier.
Test and publish the data factory
- In Azure Data Factory, go to Edit.
- Open
DataPipeline
. - Select Variables.
- Verify that the
storage_account_name
refers to your storage account in the Azure portal. Update the default value if necessary. Save your changes. - Select Validate to verify
DataPipeline
. - Select Publish to publish data-factory assets to the
adf_publish
branch of your repository.
Run the CI/CD pipeline
Follow these steps to run the continuous integration and continuous delivery (CI/CD) pipeline:
- Go to the Pipelines page. Then choose the action to create a new pipeline.
- Select Azure Repos Git as the location of your source code.
- When the list of repositories appears, select your repository.
- As you set up your pipeline, select Existing Azure Pipelines YAML file. Choose the YAML file: /azure-data-pipeline/data_pipeline_ci_cd.yml.
- Run the pipeline. When running your pipeline for the first time, you might need to give permission to access a resource during the run.
Clean up resources
If you're not going to continue to use this application, delete your data pipeline by following these steps:
- Delete the
data-pipeline-cicd-rg
resource group. - Delete your Azure DevOps project.