Build a data pipeline by using Azure Data Factory, DevOps, and machine learning

Article
06/20/2023

Azure DevOps Services

Get started building a data pipeline with data ingestion, data transformation, and model training.

Learn how to grab data from a CSV (comma-separated values) file and save the data to Azure Blob Storage. Transform the data and save it to a staging area. Then train a machine learning model by using the transformed data. Write the model to blob storage as a Python pickle file.

Prerequisites

Before you begin, you need:

An Azure account that has an active subscription. Create an account for free.
An active Azure DevOps organization. Sign up for Azure Pipelines.
- The Administrator role for service connections in your Azure DevOps project. Learn how to add the Administrator role.
Data from sample.csv.
Access to the data pipeline solution in GitHub.
DevOps for Azure Databricks.

Provision Azure resources

Sign in to the Azure portal.
From the menu, select the Cloud Shell button. When you're prompted, select the Bash experience.

Note

You'll need an Azure Storage resource to persist any files that you create in Azure Cloud Shell. When you first open Cloud Shell, you're prompted to create a resource group, storage account, and Azure Files share. This setup is automatically used for all future Cloud Shell sessions.

Select an Azure region

A region is one or more Azure datacenters within a geographic location. East US, West US, and North Europe are examples of regions. Every Azure resource, including an App Service instance, is assigned a region.

To make commands easier to run, start by selecting a default region. After you specify the default region, later commands use that region unless you specify a different region.

In Cloud Shell, run the following az account list-locations command to list the regions that are available from your Azure subscription.
```
az account list-locations \
  --query "[].{Name: name, DisplayName: displayName}" \
  --output table
```
From the Name column in the output, choose a region that's close to you. For example, choose asiapacific or westus2.
Run az config to set your default region. In the following example, replace <REGION> with the name of the region you chose.
```
az config set defaults.location=<REGION>
```
The following example sets westus2 as the default region.
```
az config set defaults.location=westus2
```

Create Bash variables

In Cloud Shell, generate a random number. You'll use this number to create globally unique names for certain services in the next step.
```
resourceSuffix=$RANDOM
```
Create globally unique names for your storage account and key vault. The following commands use double quotation marks, which instruct Bash to interpolate the variables by using the inline syntax.
```
storageName="datacicd${resourceSuffix}"
keyVault="keyvault${resourceSuffix}"
```
Create one more Bash variable to store the names and the region of your resource group. In the following example, replace <REGION> with the region that you chose for the default region.
```
rgName='data-pipeline-cicd-rg'
region='<REGION>'
```

Create variable names for your Azure Data Factory and Azure Databricks instances.

datafactorydev='data-factory-cicd-dev'
datafactorytest='data-factory-cicd-test'
databricksname='databricks-cicd-ws'

Create Azure resources

Run the following az group create command to create a resource group by using rgName.
```
az group create --name $rgName
```

Run the following az storage account create command to create a new storage account.

az storage account create \
    --name $storageName \
    --resource-group $rgName \
    --sku Standard_RAGRS \
    --kind StorageV2

Run the following az storage container create command to create two containers, rawdata and prepareddata.

az storage container create -n rawdata --account-name $storageName 
az storage container create -n prepareddata --account-name $storageName

Run the following az keyvault create command to create a new key vault.

az keyvault create \
    --name $keyVault \
    --resource-group $rgName

Create a new data factory by using the portal UI or Azure CLI:
- Name: data-factory-cicd-dev
- Version: V2
- Resource group: data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable Git.
1. Add the Azure Data Factory extension.
```
az extension add --name datafactory
```
2. Run the following az datafactory create command to create a new data factory.
```
 az datafactory create \
     --name data-factory-cicd-dev \
     --resource-group $rgName
```
3. Copy the subscription ID. Your data factory will use this ID later.
Create a second data factory by using the portal UI or the Azure CLI. You'll use this data factory for testing.
- Name: data-factory-cicd-test
- Version: V2
- Resource group: data-pipeline-cicd-rg
- Location: Your closest location
- Clear the selection for Enable GIT.
1. Run the following az datafactory create command to create a new data factory for testing.
```
 az datafactory create \
     --name data-factory-cicd-test \
     --resource-group $rgName
```
2. Copy the subscription ID. Your data factory will use this ID later.
Add a new Azure Databricks service:
- Resource group: data-pipeline-cicd-rg
- Workspace name: databricks-cicd-ws
- Location: Your closest location
1. Add the Azure Databricks extension if it's not already installed.
```
 az extension add --name databricks
```
2. Run the following az databricks workspace create command to create a new workspace.
```
az databricks workspace create \
    --resource-group $rgName \
    --name databricks-cicd-ws  \
    --location eastus2  \
    --sku trial
```
3. Copy the subscription ID. Your Databricks service will use this ID later.

Upload data to your storage container

In the Azure portal, open your storage account in the data-pipeline-cicd-rg resource group.
Go to Blob Service > Containers.
Open the prepareddata container.
Upload the sample.csv file.

Set up Key Vault

You'll use Azure Key Vault to store all connection information for your Azure services.

Create a Databricks personal access token

In the Azure portal, go Databricks and then open your workspace.
In the Azure Databricks UI, create and copy a personal access token.

Copy the account key and connection string for your storage account

Go to your storage account.
Open Access keys.
Copy the first key and connection string.

Save values to Key Vault

Create three secrets:
- databricks-token: your-databricks-pat
- StorageKey: your-storage-key
- StorageConnectString: your-storage-connection

Run the following az keyvault secret set command to add secrets to your key vault.

az keyvault secret set --vault-name "$keyVault" --name "databricks-token" --value "your-databricks-pat"
az keyvault secret set --vault-name "$keyVault" --name "StorageKey" --value "your-storage-key"
az keyvault secret set --vault-name "$keyVault" --name "StorageConnectString" --value "your-storage-connection"

Import the data pipeline solution

Sign in to your Azure DevOps organization and then go to your project.
Go to Repos and then import your forked version of the GitHub repository. For more information, see Import a Git repo into your project.

Add an Azure Resource Manager service connection

Create an Azure Resource Manager service connection.
Select Service Principal (automatic).
Choose the data-pipeline-cicd-rg resource group.
Name the service connection azure_rm_connection.
Select Grant access permission to all pipelines. You'll need to have the Service Connections Administrator role to select this option.

Add pipeline variables

Create a new variable group named datapipeline-vg.
Add the Azure DevOps extension if it isn't already installed.
```
az extension add --name azure-devops 
```

az devops login --org https://dev.azure.com/<yourorganizationname>

az pipelines variable-group create --name datapipeline-vg -p <yourazuredevopsprojectname> --variables \
                                    "LOCATION=$region" \
                                    "RESOURCE_GROUP=$rgName" \
                                    "DATA_FACTORY_NAME=$datafactorydev" \
                                    "DATA_FACTORY_DEV_NAME=$datafactorydev" \
                                    "DATA_FACTORY_TEST_NAME=$datafactorytest" \
                                    "ADF_PIPELINE_NAME=DataPipeline" \
                                    "DATABRICKS_NAME=$databricksname" \
                                    "AZURE_RM_CONNECTION=azure_rm_connection" \
                                    "DATABRICKS_URL=<URL copied from Databricks in Azure portal>" \
                                    "STORAGE_ACCOUNT_NAME=$storageName" \
                                    "STORAGE_CONTAINER_NAME=rawdata"

Create a second variable group named keys-vg. This group will pull data variables from Key Vault.
Select Link secrets from an Azure key vault as variables. For more information, see Link secrets from an Azure key vault.
Authorize the Azure subscription.
Choose all of the available secrets to add as variables (databricks-token,StorageConnectString,StorageKey).

Configure Azure Databricks and Azure Data Factory

Follow the steps in the next sections to set up Azure Databricks and Azure Data Factory.

Create testscope in Azure Databricks

In the Azure portal, go to Key vault > Properties.
Copy the DNS Name and Resource ID.
In your Azure Databricks workspace, create a secret scope named testscope.

Add a new cluster in Azure Databricks

In the Azure Databricks workspace, go to Clusters.
Select Create Cluster.
Name and save your new cluster.
Select your new cluster name.
In the URL string, copy the content between /clusters/ and /configuration. For example, in the string clusters/0306-152107-daft561/configuration, you would copy 0306-152107-daft561.
Save this string to use later.

Set up your code repository in Azure Data Factory

In Azure Data Factory, go to Author & Monitor. For more information, see Create a data factory.
Select Set up code repository and then connect your repo.
- Repository type: Azure DevOps Git
- Azure DevOps organization: Your active account
- Project name: Your Azure DevOps data pipeline project
- Git repository name: Use existing.
  - Select the main branch for collaboration.
  - Set /azure-data-pipeline/factorydata as the root folder.
- Branch to import resource into: Select Use existing and main.

Link Azure Data Factory to your key vault

In the Azure portal UI, open the key vault.
Select Access policies.
Select Add Access Policy.
For Configure from template, select Key & Secret Management.
In Select principal, search for the name of your development data factory and add it.
Select Add to add your access policies.
Repeat these steps to add an access policy for the test data factory.

Update the key vault linked service in Azure Data Factory

Go to Manage > Linked services.
Update the Azure key vault to connect to your subscription.

Update the storage linked service in Azure Data Factory

Go to Manage > Linked services.
Update the Azure Blob Storage value to connect to your subscription.

Update the Azure Databricks linked service in Azure Data Factory

Go to Manage > Linked services.
Update the Azure Databricks value to connect to your subscription.
For the Existing Cluster ID, enter the cluster value you saved earlier.

Test and publish the data factory

In Azure Data Factory, go to Edit.
Open DataPipeline.
Select Variables.
Verify that the storage_account_name refers to your storage account in the Azure portal. Update the default value if necessary. Save your changes.
Select Validate to verify DataPipeline.
Select Publish to publish data-factory assets to the adf_publish branch of your repository.

Run the CI/CD pipeline

Follow these steps to run the continuous integration and continuous delivery (CI/CD) pipeline:

Go to the Pipelines page. Then choose the action to create a new pipeline.
Select Azure Repos Git as the location of your source code.
When the list of repositories appears, select your repository.
As you set up your pipeline, select Existing Azure Pipelines YAML file. Choose the YAML file: /azure-data-pipeline/data_pipeline_ci_cd.yml.
Run the pipeline. If your pipeline hasn't been run before, you might need to give permission to access a resource during the run.

Clean up resources

If you're not going to continue to use this application, delete your data pipeline by following these steps:

Delete the data-pipeline-cicd-rg resource group.
Delete your Azure DevOps project.

Next steps

Learn more about data in Azure Data Factory