Continuous integration and delivery on Azure Databricks using Azure DevOps

Note

This article covers Azure DevOps, which is neither provided nor supported by Databricks. To contact the provider, see Azure DevOps Services support.

Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines.

Continuous integration begins with the practice of having you commit your code with some frequency to a branch within a source code repository. Each commit is then merged with the commits from other developers to ensure that no conflicts were introduced. Changes are further validated by creating a build and running automated tests against that build. This process ultimately results in an artifact, or deployment bundle, that will eventually be deployed to a target environment, in this article’s case an Azure Databricks workspace.

Overview of a typical Azure Databricks CI/CD pipeline

Though it can vary based on your needs, a typical configuration for an Azure Databricks pipeline includes the following steps:

Continuous integration:

  1. Code
    1. Develop code and unit tests in an Azure Databricks notebook or using an external IDE.
    2. Manually run tests.
    3. Commit code and tests to a git branch.
  2. Build
    1. Gather new and updated code and tests.
    2. Run automated tests.
    3. Build libraries and non-notebook Apache Spark code.
  3. Release: Generate a release artifact.

Continuous delivery:

  1. Deploy
    1. Deploy notebooks.
    2. Deploy libraries.
  2. Test: Run automated tests and report results.
  3. Operate: Programmatically schedule data engineering, analytics, and machine learning workflows.

Develop and commit your code

One of the first steps in designing a CI/CD pipeline is deciding on a code commit and branching strategy to manage the development and integration of new and updated code without adversely affecting the code currently in production. Part of this decision involves choosing a version control system to contain your code and facilitate the promotion of that code. Azure Databricks supports integrations with various Git providers, which allow you to commit code and notebooks to a Git repository.

If your version control system is not among those supported through direct notebook integration, or if you want more flexibility and control than the self-service Git provider integration, you can use the Databricks CLI to export notebooks and commit them from your local machine. This script should be run from within a local git repository that is set up to sync with the appropriate remote repository. When executed, this script should:

  1. Check out the desired branch.
  2. Pull new changes from the remote branch.
  3. Export code and notebooks from the Azure Databricks workspace using the Azure Databricks workspace CLI.
  4. Prompt the user for a commit message or use the default if one is not provided.
  5. Commit the updated code and notebooks to the local branch.
  6. Push the changes to the remote branch.

The following script performs these steps:

git checkout <branch>
git pull
databricks workspace export-dir <workspace-directory-path> <local-directory-path> --overwrite

dt=`date '+%Y-%m-%d %H:%M:%S'`
msg_default="DB export on $dt"
read -p "Enter the commit comment [$msg_default]: " msg
msg=${msg:-$msg_default}
echo $msg

git add .
git commit -m "<commit-message>"
git push

The preceding Databricks CLI command applies to Databricks CLI versions 0.205 and above.

If you prefer to develop in an IDE rather than in Azure Databricks notebooks, you can use the Git provider integration features built into modern IDEs or the Git CLI to commit your code.

Azure Databricks provides Databricks Connect, which connects IDEs to Azure Databricks clusters. This is especially useful when developing libraries, as it allows you to run and unit test your code on Azure Databricks clusters without having to deploy that code. See Databricks Connect limitations to determine whether your use case is supported.

Depending on your branching strategy and promotion process, the point at which a CI/CD pipeline will initiate a build will vary. However, committed code from various contributors will eventually be merged into a designated branch to be built and deployed. Branch management steps run outside of Azure Databricks, using the interfaces provided by the version control system.

There are numerous CI/CD tools you can use to manage and execute your pipeline. This article illustrates how to use Azure DevOps. CI/CD is a design pattern, so the steps and stages outlined in this article’s example should transfer with a few changes to the pipeline definition language in each tool. Furthermore, much of the code in this example pipeline runs standard Python code, which you can invoke in other tools.

Tip

For information about using Jenkins with Azure Databricks instead of Azure DevOps, see CI/CD with Jenkins on Azure Databricks.

The rest of this article describes a pair of example pipelines in Azure DevOps that you can adapt to your own needs for Azure Databricks.

About the example

This article’s example uses two pipelines to build and release example Python code, an example Python notebook, and related build and release settings files, all of which are stored in a remote Git repository.

The first pipeline, known as the build pipeline, prepares build artifacts for the second pipeline, known as the release pipeline. Separating the build pipeline from the release pipeline allows you to create a build without deploying it, or to deploy artifacts from multiple builds at one time.

In this example, you create the build and release pipelines, which do the following:

  1. Creates an Azure virtual machine for the build pipeline. This virtual machine uses the correct version of Python to match the one on your remote Azure Databricks cluster.
  2. Installs Python tools on the virtual machine for testing and packaging the example Python code.
  3. Installs and configures on the virtual machine a version of Databricks Connect to match the one on your remote cluster.
  4. Copies the files from your Git repository to the virtual machine.
  5. Runs unit tests on the Python code and publishes the test results.
  6. If the unit tests pass, packages the Python code into a Python wheel and then creates a gzip’ed tar file that contains the Python wheel and related release settings files.
  7. Copies the gzip’ed tar file as a zip file into a location for the release pipeline to access.
  8. Creates another Azure virtual machine for the release pipeline. This virtual machine also uses the correct version of Python to match the one on your remote Azure Databricks cluster.
  9. Gets the zip file from the build pipeline’s location and then unpackages the zip file to get the Python wheel and related release settings files.
  10. Deploys the example Python notebook to your remote Azure Databricks workspace.
  11. Deploys the Python wheel and related release settings files to your remote Azure Databricks workspace.
  12. Installs the deployed Python wheel into your remote Azure Databricks cluster.
  13. Runs integrations tests on the deployed Python notebook, which calls a function in the deployed Python wheel, and then publishes the test results.

Before you begin

To use this article’s example, you must have:

  • An existing Azure DevOps project. If you do not yet have a project, create a project in Azure DevOps.
  • An existing repository with a Git provider that Azure DevOps supports. You will add the Python example code, the example Python notebook, and related release settings files to this repository. If you do not yet have a repository, create one by following your Git provider’s instructions. Then connect your Azure DevOps project to your existing repository, if you have not done so already; for instructions, follow the links in Supported source repositories.

Step 1: Define the build pipeline

Azure DevOps provides a cloud hosted interface for defining the stages of your CI/CD pipeline using YAML. For more information on Azure DevOps and pipelines, see the Azure DevOps documentation.

In this step, you define the build pipeline, which runs unit tests and builds a deployment artifact. To deploy the code to an Azure Databricks workspace, you specify this build pipeline as a deployment artifact in a release pipeline. You define this release pipeline later in Step 5.

To run build pipelines, Azure DevOps provides cloud-hosted, on-demand execution agents that support deployments to Kubernetes, VMs, Azure Functions, Azure Web Apps, and many more targets. In this example, you use an on-demand agent to automate the deployment of code to the target Azure Databricks workspace. Tools or packages required by the build pipeline must be defined in the build pipeline script and installed on the agent at run time. This example defines and installs tools and package on the agent that match those on the target Azure Databricks cluster; this example uses Databricks Runtime 10.4 LTS, which includes Python 3.8.

Now, define your build pipeline as follows:

  1. Sign in to Azure DevOps and open your Azure DevOps project.

  2. Click Pipelines in the sidebar, and then click Pipelines on the Pipelines menu.

    Azure DevOps Pipeline menu

  3. Click the Create Pipeline button to open the pipeline editor, where you will define your build pipeline script in the azure-pipelines.yml file that is displayed. If the pipeline editor is not visible after you click the Create Pipeline button, then select the build pipeline’s name and then click Edit.

    You can use the Git branch selector Git branch selector to customize the build process for each branch in your Git repository. It is a CI/CD best practice to not do production work directly in your repository’s main branch; this example assumes a branch named release exists in the repository to be used instead.

    Azure DevOps Pipeline editor

    The azure-pipelines.yml build pipeline script is stored by default in the root directory of the remote Git repository that you associated with the pipeline.

  4. Configure environment variables that the build pipelines reference by clicking the Variables button.

    For this example, set the following five environment variables, making sure to click Save after you set them:

    • DATABRICKS_ADDRESS, which represents the per-workspace URL of your Azure Databricks workspace, beginning with https://, for example https://adb-<workspace-id>.<random-number>.azuredatabricks.net. Do not include the trailing / after .net.

    • DATABRICKS_API_TOKEN, which represents your Azure Databricks personal access token or Azure Active Directory (AD) token.

      Note

      As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

    • DATABRICKS_CLUSTER_ID, which represents the Azure Databricks cluster ID in your workspace. This pipeline assumes that you are using Databricks Runtime 10.4 LTS on the cluster.

    • DATABRICKS_ORG_ID, which is the workspace ID of your workspace.

    • DATABRICKS_PORT, which represents the port that Databricks Connect uses. This value is typically 15001.

  5. Overwrite your pipeline’s azure-pipelines.yml file’s starter contents with the following definition, and then click Save.

    # Specify the trigger event to start the build pipeline.
    # In this case, new code merged into the release branch initiates a new build.
    trigger:
    - release
    
    # Specify the operating system for the agent that runs on the Azure virtual
    # machine for the build pipeline (known as the build agent). The virtual
    # machine image should match the one on the Azure Databricks cluster as
    # closely as possible. For example, Databricks Runtime 10.4 LTS runs
    # Ubuntu 20.04.4 LTS, which maps to the Ubuntu 20.04 virtual machine
    # image in the Azure Pipeline agent pool. See
    # https://learn.microsoft.com/azure/devops/pipelines/agents/hosted#software
    pool:
      vmImage: ubuntu-20.04
    
    # Install Python. The version of Python must match the version on the
    # Azure Databricks cluster. This pipeline assumes that you are using
    # Databricks Runtime 10.4 LTS on the cluster.
    steps:
    - task: UsePythonVersion@0
      displayName: 'Use Python 3.8'
      inputs:
        versionSpec: 3.8
    
    # Install required Python modules and their dependencies. These
    # include pytest, which is needed to run unit tests on a cluster,
    # and setuptools, which is needed to create a Python wheel. Also
    # install the version of Databricks Connect that is compatible
    # with Databricks Runtime 10.4 LTS on the cluster.
    - script: |
        pip install pytest requests setuptools wheel
        pip install -U databricks-connect==10.4.*
      displayName: 'Load Python dependencies'
    
    # Use environment variables to pass Azure Databricks workspace and cluster
    # information to the Databricks Connect configuration function.
    - script: |
        echo "y
        $(DATABRICKS_ADDRESS)
        $(DATABRICKS_API_TOKEN)
        $(DATABRICKS_CLUSTER_ID)
        $(DATABRICKS_ORG_ID)
        $(DATABRICKS_PORT)" | databricks-connect configure
      displayName: 'Configure Databricks Connect'
    
    # Download the files from the designated branch in the Git remote repository
    # onto the build agent.
    - checkout: self
      persistCredentials: true
      clean: true
    
    # For library code developed outside of an Azure Databricks notebook, the
    # process is like traditional software development practices. You write a
    # unit test using a testing framework, such as the Python pytest module, and
    # you use JUnit-formatted XML files to store the test results.
    - script: |
        python -m pytest --junit-xml=$(Build.Repository.LocalPath)/logs/TEST-LOCAL.xml $(Build.Repository.LocalPath)/libraries/python/dbxdemo/test*.py || true
      displayName: 'Run Python unit tests for library code'
    
    # Publishes the test results to Azure DevOps. This lets you visualize
    # reports and dashboards related to the status of the build process.
    - task: PublishTestResults@2
      inputs:
        testResultsFiles: '**/TEST-*.xml'
        failTaskOnFailedTests: true
        publishRunAttachments: true
    
    # Package the example Python code into a Python wheel.
    - script: |
        cd $(Build.Repository.LocalPath)/libraries/python/dbxdemo
        python3 setup.py sdist bdist_wheel
        ls dist/
      displayName: 'Build Python Wheel for Libs'
    
    # Generate the deployment artifacts. To do this, the build agent gathers
    # all the new or updated code to be deployed to the Azure Databricks
    # environment, including the sample Python notebook, the Python wheel
    # library that was generated by the build process, related release settings
    # files, and the result summary of the tests for archiving purposes.
    # Use git diff to flag files that were added in the most recent Git merge.
    # Then add the Python wheel file that you just created along with utility
    # scripts used by the release pipeline.
    # The implementation in your pipeline will likely be different.
    # The objective here is to add all files intended for the current release.
    - script: |
        git diff --name-only --diff-filter=AMR HEAD^1 HEAD | xargs -I '{}' cp --parents -r '{}' $(Build.BinariesDirectory)
        mkdir -p $(Build.BinariesDirectory)/libraries/python/libs
        cp $(Build.Repository.LocalPath)/libraries/python/dbxdemo/dist/*.* $(Build.BinariesDirectory)/libraries/python/libs
        mkdir -p $(Build.BinariesDirectory)/cicd-scripts
        cp $(Build.Repository.LocalPath)/cicd-scripts/*.* $(Build.BinariesDirectory)/cicd-scripts
        mkdir -p $(Build.BinariesDirectory)/notebooks
        cp $(Build.Repository.LocalPath)/notebooks/*.* $(Build.BinariesDirectory)/notebooks
      displayName: 'Get Changes'
    
    # Create the deployment artifact and then publish it to the
    # artifact repository.
    - task: ArchiveFiles@2
      inputs:
        rootFolderOrFile: '$(Build.BinariesDirectory)'
        includeRootFolder: false
        archiveType: 'zip'
        archiveFile: '$(Build.ArtifactStagingDirectory)/$(Build.BuildId).zip'
        replaceExistingArchive: true
    
    - task: PublishBuildArtifacts@1
      inputs:
        ArtifactName: 'DatabricksBuild'
    

Step 2: Add the unit test source files to the repository

To enable the build agent to run the unit tests, add the following three files addcol.py, test-addcol.py, and __init__.py as shown, within a folder path libraries/python/dbxdemo that you create in the root of your associated remote Git repository.

The first file, addcol.py, represents a library function that might be installed on an Azure Databricks cluster. This simple function adds a new column, populated by a literal, to an Apache Spark DataFrame.

# addcol.py
import pyspark.sql.functions as F

def with_status(df):
  return df.withColumn("status", F.lit("checked"))

The second file, test-addcol.py, tests the addcol.py file’s code by passing a mock DataFrame object to the preceding with_status function. The result is then compared to a DataFrame object containing the expected values. If the values match, the test passes.

# test-addcol.py
import pytest

from pyspark.sql import SparkSession
from .addcol import with_status

@pytest.fixture
def spark() -> SparkSession:
  return SparkSession.builder.getOrCreate()

def test_with_status(spark):
  source_data = [
    ("pete", "pan", "peter.pan@databricks.com"),
    ("jason", "argonaut", "jason.argonaut@databricks.com")
  ]
  source_df = spark.createDataFrame(
    source_data,
    ["first_name", "last_name", "email"]
  )

  actual_df = with_status(source_df)

  expected_data = [
    ("pete", "pan", "peter.pan@databricks.com", "checked"),
    ("jason", "argonaut", "jason.argonaut@databricks.com", "checked")
  ]

  expected_df = spark.createDataFrame(
    expected_data,
    ["first_name", "last_name", "email", "status"]
  )

  assert(expected_df.collect() == actual_df.collect())

The third file, __init__.py, must be blank and must also exist in the libraries/python/dbxdemo folder path. This file enables the test-addcol.py file to load the addcol.py file as a library.

Step 3: Add the Python wheel packaging script to the repository

To enable the build agent to use Python Setuptools to package the Python wheel to give to the release pipeline, add a minimal version of the following setup.py file to the libraries/python/dbxdemo folder path in your associated remote Git repository:

# setup.py
from setuptools import setup, find_packages

setup(
  name = 'dbxdemo',
  version = '0.1.0',
  packages = ['.']
)

Step 4: Add the Python notebook to the repository

To enable the build agent to give the sample Python notebook to the release pipeline, add the following dbxdemo-notebook.py file to a notebooks folder that you create in the root of your associated remote Git repository:

# Databricks notebook source
import sys
sys.path.append("/databricks/python3/lib/python3.8/site-packages")

# COMMAND ----------

import unittest
from addcol import *

class TestNotebook(unittest.TestCase):

  def test_with_status(self):
    source_data = [
      ("pete", "pan", "peter.pan@databricks.com"),
      ("jason", "argonaut", "jason.argonaut@databricks.com")
    ]

    source_df = spark.createDataFrame(
      source_data,
      ["first_name", "last_name", "email"]
    )

    actual_df = with_status(source_df)

    expected_data = [
      ("pete", "pan", "peter.pan@databricks.com", "checked"),
      ("jason", "argonaut", "jason.argonaut@databricks.com", "checked")
    ]

    expected_df = spark.createDataFrame(
      expected_data,
      ["first_name", "last_name", "email", "status"]
    )

    self.assertEqual(expected_df.collect(), actual_df.collect())

unittest.main(argv = [''], verbosity = 2, exit = False)

Step 5: Define the release pipeline

The release pipeline deploys the build artifacts to an Azure Databricks environment. Separating the release pipeline in this step from the build pipeline in the preceding steps allows you to create a build without deploying it, or to deploy artifacts from multiple builds at one time.

  1. In your Azure DevOps project, on the Pipelines menu in the sidebar, click Releases.

    Azure DevOps Releases

  2. Click New pipeline.

  3. On the side of the screen is a list of featured templates for common deployment patterns. For this release pipeline, click Empty job.

    Azure DevOps release pipeline 1

  4. In the Artifacts box on the side of the screen, click Add. In the Add an artifact pane, for Source (build pipeline), select the build pipeline that you created earlier. Then click Add.

    Azure DevOps release pipeline 2

  5. You can configure how the pipeline is triggered by clicking Lightning bolt icon, which displays triggering options on the side of the screen. If you want a release to be initiated automatically based on build artifact availability or after a pull request workflow, enable the appropriate trigger. For this example, in the last step of this article you manually trigger the build pipeline and then the release pipeline.

    Azure DevOps release pipeline stage 1

  6. Click Save > OK.

Step 6: Define environment variables for the release pipeline

Your release pipeline relies on the following three environment variables, which you can add by clicking Add in the Pipeline variables section on the Variables tab, with a Scope of Stage 1:

  • DATABRICKS_HOST, which represents the per-workspace URL of your Azure Databricks workspace, beginning with https://, for example https://adb-<workspace-id>.<random-number>.azuredatabricks.net. Do not include the trailing / after .net. This should be the same value as DATABRICKS_ADDRESS that you set earlier in the build pipeline. (Databricks Connect in the build pipeline expects to find a DATABRICKS_ADDRESS environment variable, while the Databricks CLI in the release pipeline expects this environment variable to be named DATABRICKS_HOST instead.)

  • DATABRICKS_TOKEN, which represents your Azure Databricks personal access token or Azure Active Directory (AD) token. This should be the same value as DATABRICKS_API_TOKEN that you set earlier in the build pipeline. (In the build pipeline, Databricks Connect expects to find a DATABRICKS_API_TOKEN environment variable. In the release pipeline, Databricks CLI expects this environment variable to be named DATABRICKS_TOKEN.)

    Note

    As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

  • DATABRICKS_CLUSTER_ID, which represents the Azure Databricks cluster ID in your workspace. This should be the same value as the DATABRICKS_CLUSTER_ID environment variable that you set earlier in the build pipeline.

Azure DevOps release pipeline environment variables

Step 7: Configure the release agent for the release pipeline

  1. Click the 1 job, 0 task link within the Stage 1 object.

    Azure DevOps release pipeline add stage

  2. On the Tasks tab, click Agent job.

  3. In the Agent selection section, for Agent pool, select Azure Pipelines.

  4. For Agent Specification, select the same agent as you specified for the build agent earlier, in this example ubuntu-20.04.

    Azure DevOps release pipeline agent job definition

  5. Click Save > OK.

Step 8: Set the Python version for the release agent

  1. Click the plus sign in the Agent job section, indicated by the red arrow in the following figure. A searchable list of available tasks appears. There is also a Marketplace tab for third-party plug-ins that can be used to supplement the standard Azure DevOps tasks. You will add several tasks to the release agent during the next several steps.

    Azure DevOps add task

  2. The first task you add is Use Python version, located on the Tool tab. If you cannot find this task, use the Search box to look for it. When you find it, select it and then click the Add button next to the Use Python version task.

    Azure DevOps set python version 1

  3. As with the build pipeline, you want to make sure that the Python version is compatible with the scripts called in subsequent tasks. In this case, click the Use Python 3.x task next to Agent job, and then set Version spec to 3.8. Also set Display name to Use Python 3.8. This pipeline assumes that you are using Databricks Runtime 10.4 LTS on the cluster, which also used Python 3.8.

    Azure DevOps set python version 2

  4. Click Save > OK.

Step 9: Unpackage the build artifact from the build pipeline

  1. Next, have the release agent extract the Python wheel, related release settings files, and the sample Python notebook from the zip file by using the Extract files task: click the plus sign in the Agent job section, select the Extract files task on the Utility tab, and then click Add.

  2. Click the Extract files task next to Agent job, set Archive file patterns to **/*.zip, and set the Destination folder to the system variable $(Release.PrimaryArtifactSourceAlias)/Databricks. Also set Display name to Extract build pipeline artifact.

    Note

    $(Release.PrimaryArtifactSourceAlias) represents an Azure DevOps-generated alias to identify the primary artifact source location on the release agent, for example _<your-github-alias>.<your-github-repo-name>. The release pipeline sets this value as the environment variable RELEASE_PRIMARYARTIFACTSOURCEALIAS in the Initialize job phase for the release agent. See Classic release and artifacts variables.

  3. Set Display name to Extract build pipeline artifact.

    Azure DevOps unpackage

  4. Click Save > OK.

Step 10. Install the Databricks CLI and unittest XML reporting

  1. Next, install the Databricks CLI and the unittest XML reporting package on the release agent, as the release agent will call the Databricks CLI and unittest in the next few tasks. To do this, use the Bash task: click the plus sign again in the Agent job section, select the Bash task on the Utility tab, and then click Add.

  2. Click the Bash Script task next to Agent job.

  3. For Type, select Inline.

  4. Replace the contents of Script with the following command, which installs the Databricks CLI:

    pip install databricks-cli
    pip install unittest-xml-reporting
    
  5. Set Display name to Install Databricks CLI and unittest XML reporting.

    Azure DevOps release pipeline install packages

  6. Click Save > OK.

Step 11: Deploy the notebook to the workspace

  1. Next, have the release agent use the Databricks CLI to deploy the sample Python notebook to the Azure Databricks workspace by using another Bash task: click the plus sign again in the Agent job section, select the Bash task on the Utility tab, and then click Add.

  2. Click the Bash Script task next to Agent job.

  3. For Type, select Inline.

  4. Replace the contents of Script with the following command, which runs the databricks workspace import subcommand to copy the Python notebook from the release agent to your Azure Databricks workspace:

    databricks workspace import --language=PYTHON --format=SOURCE --overwrite "$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/notebooks/dbxdemo-notebook.py /Shared/dbxdemo-notebook.py"
    

    Note

    $(System.ArtifactsDirectory) represents the directory from which artifacts are downloaded during deployment of a release, for example /home/vsts/work/r1/a. The release pipeline sets this value as the environment variable SYSTEM_ARTIFACTSDIRECTORY in the Initialize job phase for the release agent. See Classic release and artifacts variables.

  5. Set Display name to Copy notebook to workspace.

    Azure DevOps release pipeline copy notebook to workspace

  6. Click Save > OK.

Step 12: Deploy the library to DBFS

  1. Next, have the release agent use the Databricks CLI to deploy the Python library to a DBFS location within the Azure Databricks workspace by using another Bash task: click the plus sign again in the Agent job section, select the Bash task on the Utility tab, and then click Add.

  2. Click the Bash Script task next to Agent job.

  3. For Type, select Inline.

  4. Replace the contents of Script with the following command, which runs the databricks fs cp subcommand to copy the Python library from the release agent to your Azure Databricks workspace:

    databricks fs cp  --overwrite "$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl" "dbfs:/libraries/python/libs/dbxdemo-0.1.0-py3-none-any.whl"
    
  5. Set Display name to Copy Python wheel to workspace.

    Azure DevOps release pipeline copy wheel to workspace

  6. Click Save > OK.

Step 13: Install the library on the cluster

  1. Next, have the release agent install the library that was just copied over into your workspace onto a specific cluster within that workspace. To do this, you create a Python script task: click the plus sign again in the Agent job section, select the Python script task on the Utility tab, and then click Add.

  2. Click the Run a Python script task next to Agent job.

  3. Set the Script path to $(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/installWhlLibrary.py. The Python script, installWhlLibrary.py, is in the artifact created by our build pipeline. The installWhlLibrary.py script takes five arguments, which you will set in this task as follows:

    • shard - The URL for the target workspace (for example, https://<region>.azuredatabricks.net). This maps to the environment variable DATABRICKS_HOST that you set earlier for the release pipeline. This URL must not include the trailing / after .net.

    • token - An Azure Databricks personal access token or Azure AD token for workspace. This maps to the environment variable DATABRICKS_TOKEN that you set earlier for the release pipeline.

      Note

      As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

    • clusterid - The ID for the cluster on which to install the library. This maps to the environment variable DATABRICKS_CLUSTER_ID that you set earlier for the release pipeline.

    • libs - The extracted directory containing the libraries. This maps to the path in the release agent that contains the Python wheel. The path must include the trailing /.

    • dbfspath - The path within the DBFS file system to retrieve the libraries. The path must not include the beginning dbfs: but must include the beginning /. Also the path must not include the trailing /.

  4. Set Arguments to the following:

    --shard=$(DATABRICKS_HOST) --token=$(DATABRICKS_TOKEN) --clusterid=$(DATABRICKS_CLUSTER_ID) --libs=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/libraries/python/libs/ --dbfspath=/libraries/python/libs
    

    Azure DevOps install library

  5. Set Display name to Install Python wheel on cluster.

  6. Click Save > OK.

The preceding installWhlLibrary.py file checks to make sure that before installing a new version of a library on an Azure Databricks cluster, the existing library is uninstalled first. To do this, the installWhlLibrary.py file calls the Databricks REST API to perform the following steps:

  1. Check if the library is installed.
  2. If installed, uninstall the library.
  3. Restart the cluster if any uninstalls were performed.
  4. Wait until the cluster is running again before proceeding.
  5. Install the library.

The installWhlLibrary.py file’s contents are as follows. To have the release pipeline run this script, create a folder named cicd-scripts in the root of your Git repository, and then add this installWhlLibrary.py file to the cicd-scripts folder:

# installWhlLibrary.py
#!/usr/bin/python3
import json
import requests
import sys
import getopt
import time
import os

def main():
  shard = ''
  token = ''
  clusterid = ''
  libspath = ''
  dbfspath = ''

  try:
    opts, args = getopt.getopt(sys.argv[1:], 'hstcld',
      ['shard=', 'token=', 'clusterid=', 'libs=', 'dbfspath='])
  except getopt.GetoptError:
    print(
      'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
    sys.exit(2)

  for opt, arg in opts:
    if opt == '-h':
      print(
        'installWhlLibrary.py -s <shard> -t <token> -c <clusterid> -l <libs> -d <dbfspath>')
      sys.exit()
    elif opt in ('-s', '--shard'):
      shard = arg
    elif opt in ('-t', '--token'):
      token = arg
    elif opt in ('-c', '--clusterid'):
      clusterid = arg
    elif opt in ('-l', '--libs'):
      libspath=arg
    elif opt in ('-d', '--dbfspath'):
      dbfspath=arg

  print('-s is ' + shard)
  print('-t is ' + token)
  print('-c is ' + clusterid)
  print('-l is ' + libspath)
  print('-d is ' + dbfspath)

  # Generate the list of files from walking the local path.
  libslist = []
  for path, subdirs, files in os.walk(libspath):
    for name in files:

      name, file_extension = os.path.splitext(name)
      if file_extension.lower() in ['.whl']:
        print('Adding ' + name + file_extension.lower() + ' to the list of .whl files to evaluate.')
        libslist.append(name + file_extension.lower())

  for lib in libslist:
    dbfslib = 'dbfs:' + dbfspath + '/' + lib
    print('Evaluating whether ' + dbfslib + ' must be installed, or uninstalled and reinstalled.')

    if (getLibStatus(shard, token, clusterid, dbfslib)) is not None:
      print(dbfslib + ' status: ' + getLibStatus(shard, token, clusterid, dbfslib))
      if (getLibStatus(shard, token, clusterid, dbfslib)) == "not found":
        print(dbfslib + ' not found. Installing.')
        installLib(shard, token, clusterid, dbfslib)
      else:
        print(dbfslib + ' found. Uninstalling.')
        uninstallLib(shard, token, clusterid, dbfslib)
        print("Restarting cluster: " + clusterid)
        restartCluster(shard, token, clusterid)
        print('Installing ' + dbfslib + '.')
        installLib(shard, token, clusterid, dbfslib)

def uninstallLib(shard, token, clusterid, dbfslib):
  values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}
  requests.post(shard + '/api/2.0/libraries/uninstall', data=json.dumps(values), auth=("token", token))

def restartCluster(shard, token, clusterid):
  values = {'cluster_id': clusterid}
  requests.post(shard + '/api/2.0/clusters/restart', data=json.dumps(values), auth=("token", token))

  waiting = True
  p = 0
  while waiting:
    time.sleep(30)
    clusterresp = requests.get(shard + '/api/2.0/clusters/get?cluster_id=' + clusterid,
      auth=("token", token))
    clusterjson = clusterresp.text
    jsonout = json.loads(clusterjson)
    current_state = jsonout['state']
    print(clusterid + " state: " + current_state)
    if current_state in ['TERMINATED', 'RUNNING','INTERNAL_ERROR', 'SKIPPED'] or p >= 10:
      break
      p = p + 1

def installLib(shard, token, clusterid, dbfslib):
  values = {'cluster_id': clusterid, 'libraries': [{'whl': dbfslib}]}
  requests.post(shard + '/api/2.0/libraries/install', data=json.dumps(values), auth=("token", token))

def getLibStatus(shard, token, clusterid, dbfslib):

  resp = requests.get(shard + '/api/2.0/libraries/cluster-status?cluster_id='+ clusterid, auth=("token", token))
  libjson = resp.text
  d = json.loads(libjson)
  if (d.get('library_statuses')):
    statuses = d['library_statuses']

    for status in statuses:
      if (status['library'].get('whl')):
        if (status['library']['whl'] == dbfslib):
          return status['status']
  else:
    # No libraries found.
    return "not found"

if __name__ == '__main__':
  main()

Step 14: Run integration tests on the Python notebook

You can also run tests directly from notebooks containing asserts by using unittest. In this case, you use the same tests you used in the earlier unit tests, but now it imports the installed addcol library from the whl that you just installed on the cluster.

To automate these tests and include them in the CI/CD pipeline, use the Databricks REST API to execute the notebook from the CI/CD server. This allows you to check whether the notebook execution passed or failed using unittest. Any assert failures appear in the JSON output returned by the REST API and in the JUnit test results.

  1. Add a Command line task to the release pipeline: click the plus sign again in the Agent job section, select the Command line task on the Utility tab, and then click Add.

  2. Click the Command Line Script task next to Agent job.

  3. Replace the Script box’s contents with the following script. These commands create directories for the notebook execution logs and the test summaries. These commands also include a pip command to install the required pytest and requests modules.

    mkdir -p $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/json
    mkdir -p $(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/xml
    pip install pytest requests
    
  4. Set Display name to Create integration tests directories.

    Azure DevOps configure test environment

  5. Click Save > OK.

Step 15: Run the notebook

  1. Create a Python script task: click the plus sign again in the Agent job section, select the Python script task on the Utility tab, and then click Add.

  2. Click the Run a Python script task next to Agent job.

  3. With File path selected, set the Script path to $(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/executenotebook.py. The Python script, executeNotebook.py, is in the artifact created by our build pipeline. The executeNotebook.py script takes five arguments, which you will set in this task as follows:

    • shard - The URL for the target workspace (for example, https://<region>.azuredatabricks.net). This maps to the environment variable DATABRICKS_HOST that you set earlier for the release pipeline. This URL must not include the trailing / after .net.

    • token - An Azure Databricks personal access token or Azure AD token for workspace. This maps to the environment variable DATABRICKS_TOKEN that you set earlier for the release pipeline.

      Note

      As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

    • clusterid - The ID for the cluster on which to install the library. This maps to the environment variable DATABRICKS_CLUSTER_ID that you set earlier for the release pipeline.

    • localpath - The extracted directory containing the test notebooks. The path must not include the training /.

    • workspacepath - The path within the workspace to which the test notebooks were deployed. The path must include the beginning /. Also the path must not include the trailing /.

    • outfilepath - The path you created to store the JSON output returned by the REST API.

  4. Set Arguments to the following:

    --shard=$(DATABRICKS_HOST) --token=$(DATABRICKS_TOKEN) --clusterid=$(DATABRICKS_CLUSTER_ID) --localpath=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/notebooks --workspacepath=/Shared --outfilepath=$(System.ArtifactsDirectory)/$(Release.PrimaryArtifactSourceAlias)/Databricks/logs/json
    
  5. Set Display name to Run notebook.

    Azure DevOps execute notebooks

  6. Click Save > OK.

The following executenotebook.py script runs the notebook using the jobs runs submit endpoint which submits an anonymous job. Because this endpoint is asynchronous, it uses the job ID initially returned by the REST call to the poll for the status of the job. After the job completes, the JSON output is saved to the path specified by the function arguments passed at invocation.

To have the release pipeline run this script, add this executenotebook.py file to the cicd-scripts folder in the root of your Git repository, where you added the installWhlLibrary.py file earlier:

# executenotebook.py
#!/usr/bin/python3
import json
import requests
import os
import sys
import getopt
import time

def main():
  shard = ''
  token = ''
  clusterid = ''
  localpath = ''
  workspacepath = ''
  outfilepath = ''

  try:
    opts, args = getopt.getopt(sys.argv[1:], 'hs:t:c:lwo',
      ['shard=', 'token=', 'clusterid=', 'localpath=', 'workspacepath=', 'outfilepath='])
  except getopt.GetoptError:
    print(
      'executenotebook.py -s <shard> -t <token>  -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>)')
    sys.exit(2)

  for opt, arg in opts:
    if opt == '-h':
      print(
        'executenotebook.py -s <shard> -t <token> -c <clusterid> -l <localpath> -w <workspacepath> -o <outfilepath>')
      sys.exit()
    elif opt in ('-s', '--shard'):
        shard = arg
    elif opt in ('-t', '--token'):
        token = arg
    elif opt in ('-c', '--clusterid'):
        clusterid = arg
    elif opt in ('-l', '--localpath'):
        localpath = arg
    elif opt in ('-w', '--workspacepath'):
        workspacepath = arg
    elif opt in ('-o', '--outfilepath'):
        outfilepath = arg

  print('-s is ' + shard)
  print('-t is ' + token)
  print('-c is ' + clusterid)
  print('-l is ' + localpath)
  print('-w is ' + workspacepath)
  print('-o is ' + outfilepath)

  # Generate the list of notebooks from walking the local path.
  notebooks = []
  for path, subdirs, files in os.walk(localpath):
    for name in files:
      fullpath = path + '/' + name
      # Remove the localpath to the repo but keep the workspace path.
      fullworkspacepath = workspacepath + path.replace(localpath, '')

      name, file_extension = os.path.splitext(fullpath)
      if file_extension.lower() in ['.scala', '.sql', '.r', '.py']:
        row = [fullpath, fullworkspacepath, 1]
        notebooks.append(row)

  # Run each notebook in the list.
  for notebook in notebooks:
    nameonly = os.path.basename(notebook[0])
    workspacepath = notebook[1]

    name, file_extension = os.path.splitext(nameonly)

    # workspacepath removes the extension, so now add it back.
    fullworkspacepath = workspacepath + '/' + name + file_extension

    print('Running job for: ' + fullworkspacepath)
    values = {'run_name': name, 'existing_cluster_id': clusterid, 'timeout_seconds': 3600, 'notebook_task': {'notebook_path': fullworkspacepath}}

    resp = requests.post(shard + '/api/2.0/jobs/runs/submit',
      data=json.dumps(values), auth=("token", token))
    runjson = resp.text
    print("runjson: " + runjson)
    d = json.loads(runjson)
    runid = d['run_id']

    i=0
    waiting = True
    while waiting:
      time.sleep(10)
      jobresp = requests.get(shard + '/api/2.0/jobs/runs/get?run_id='+str(runid),
        data=json.dumps(values), auth=("token", token))
      jobjson = jobresp.text
      print("jobjson: " + jobjson)
      j = json.loads(jobjson)
      current_state = j['state']['life_cycle_state']
      runid = j['run_id']
      if current_state in ['TERMINATED', 'INTERNAL_ERROR', 'SKIPPED'] or i >= 12:
        break
      i=i+1

    if outfilepath != '':
      file = open(outfilepath + '/' +  str(runid) + '.json', 'w')
      file.write(json.dumps(j))
      file.close()

if __name__ == '__main__':
  main()

Step 16: Generate and evaluate test results

This task runs a Python script using pytest to determine if the asserts in the test notebooks passed or failed.

  1. Add a Python script task to the release pipeline: click the plus sign again in the Agent job section, select the Python script task on the Utility tab, and then click Add.

  2. Click the Run a Python script task next to Agent job.

  3. With File path selected, set the Script path to $(Release.PrimaryArtifactSourceAlias)/Databricks/cicd-scripts/evaluatenotebookruns.py.

  4. Set Display name to Create and evaluate notebook test results.

    Azure DevOps generate test results

  5. Click Save > OK.

The script evaluatenotebookruns.py defines the test_job_run function, which parses and evaluates the JSON generated by the previous task. Another test, test_performance, looks for tests that run longer than expected.

To have the release pipeline run this script, add this evaluatenotebookruns.py file to the cicd-scripts folder in the root of your Git repository, where you added the installWhlLibrary.py and executenotebook.py files earlier:

# evaluatenotebookruns.py
#!/usr/bin/python3
import io
import xmlrunner
from xmlrunner.extra.xunit_plugin import transform
import unittest
import json
import glob
import os

class TestJobOutput(unittest.TestCase):

  test_output_path = '<path-to-json-logs-on-release-agent>'

  def test_performance(self):
    path = self.test_output_path
    statuses = []

    for filename in glob.glob(os.path.join(path, '*.json')):
      print('Evaluating: ' + filename)
      data = json.load(open(filename))
      duration = data['execution_duration']
      if duration > 100000:
        status = 'FAILED'
      else:
        status = 'SUCCESS'

      statuses.append(status)

    self.assertFalse('FAILED' in statuses)

  def test_job_run(self):
    path = self.test_output_path
    statuses = []

    for filename in glob.glob(os.path.join(path, '*.json')):
      print('Evaluating: ' + filename)
      data = json.load(open(filename))
      status = data['state']['result_state']
      statuses.append(status)

    self.assertFalse('FAILED' in statuses)

if __name__ == '__main__':
  out = io.BytesIO()

  unittest.main(testRunner=xmlrunner.XMLTestRunner(output=out),
    failfast=False, buffer=False, catchbreak=False, exit=False)

  with open('TEST-report.xml', 'wb') as report:
    report.write(transform(out.getvalue()))

In the preceding script, replace <path-to-json-logs-on-release-agent> with the full absolute path to the Databricks/logs/json/ folder on the build agent. For example, this could be something like /home/vsts/work/r1/a/_<your-github-alias>.<your-github-repo-name>/Databricks/logs/json/. See the earlier discussion about $(System.ArtifactsDirectory) and $(Release.PrimaryArtifactSourceAlias) in Steps 9 and 11.

Step 17: Publish test results

Use the Publish Test Results task to archive the JSON results and publish the test results to Azure DevOps Test Hub. This enables you to visualize reports and dashboards related to the status of the test runs.

  1. Add a Publish Test Results task to the release pipeline: click the plus sign again in the Agent job section, select the Publish Test Results task on the Test tab, and then click Add.

  2. Click the **Publish Test Results /TEST-*.xml task next to Agent job.

  3. Leave all of the default settings unchanged.

    Azure DevOps publish test results

    Note

    $(System.DefaultWorkingDirectory) represents he local path on the agent where your source code files are downloaded, for example /home/vsts/work/r1/a. The release pipeline sets this value as the environment variable SYSTEM_DEFAULTWORKINGDIRECTORY in the Initialize job phase for the release agent. See Use predefined variables.

  4. Click Save > OK.

At this point, you have completed an integration and deployment cycle using the CI/CD pipeline. By automating this process, you ensure that your code is tested and deployed by an efficient, consistent, and repeatable process.

Step 18: Run the build and release pipelines

In this step, you will run the build and release pipelines manually. To learn how to set this up later to run the pipelines automatically, see the commentary earlier in Step 5.

Run the build pipeline:

  1. On the Pipelines menu in the sidebar, click Pipelines.
  2. Click your pipeline’s name, and then click Run pipeline.
  3. For Branch/tag, select the name of the branch in your Git repository that contains all of the source code that you added. This example assumes that this is in the release branch.
  4. Click Run. The build pipeline’s run page appears.
  5. To see the build pipeline’s progress and to view the related logs, click the spinning icon next to Job.

Run the release pipeline:

  1. After the build pipeline runs successfully (there are all checkmark icons on the Job details list), on the Pipelines menu in the sidebar, click Releases.
  2. Click your release pipeline’s name, and then click Create release.
  3. Click Create.
  4. To see the release pipeline’s progress, click the latest release on the Releases tab.
  5. Click the Stage 1 box.
  6. Click View logs.

View test run results for the build and release pipelines:

  1. On the Test Plans menu in the sidebar, click Runs.
  2. In the Recent test runs section, on the Test runs tab, double-click the latest log dashboard entries in the list.