What is CI/CD on Azure Databricks?
This article is an introduction to CI/CD on Databricks. Continuous integration and continuous delivery (CI/CD) refers to the process of developing and delivering software in short, frequent cycles through the use of automation pipelines. CI/CD is common to software development, and is becoming increasingly necessary to data engineering and data science. By automating the building, testing, and deployment of code, development teams are able to deliver releases more reliably than with the manual processes still common to data engineering and data science teams.
Azure Databricks recommends using Databricks Asset Bundles for CI/CD, which enable the development and deployment of complex data, analytics, and ML projects for the Azure Databricks platform. Bundles allow you to easily manage many custom configurations and automate builds, tests, and deployments of your projects to Azure Databricks development, staging, and production workspaces.
For an overview of CI/CD for machine learning projects on Azure Databricks, see How does Databricks support CI/CD for machine learning?.
What’s in a CI/CD pipeline on Azure Databricks?
You can use Databricks Asset Bundles to define and programmatically manage your Azure Databricks CI/CD implementation, which usually includes:
- Notebooks: Azure Databricks notebooks are often a key part of data engineering and data science workflows. You can use version control for notebooks, and also validate and test them as part of a CI/CD pipeline. You can run automated tests against notebooks to check whether they are functioning as expected.
- Libraries: Manage the library dependencies required to run your deployed code. Use version control on libraries and include them in automated testing and validation.
- Workflows: Databricks Jobs are comprised of jobs that allow you to schedule and run automated tasks using notebooks or Spark jobs.
- Data pipelines: You can also include data pipelines in CI/CD automation, using Delta Live Tables, the framework in Databricks for declaring data pipelines.
- Infrastructure: Infrastructure configuration includes definitions and provisioning information for clusters, workspaces, and storage for target environments. Infrastructure changes can be validated and tested as part of a CI/CD pipeline, ensuring that they are consistent and error-free.
Steps for CI/CD on Azure Databricks
A typical flow for an Azure Databricks CI/CD pipeline includes the following steps:
- Store: Store your Azure Databricks code and notebooks in a version control system like Git. This allows you to track changes over time and collaborate with other team members. See CI/CD techniques with Git and Databricks Git folders (Repos) and bundle Git settings.
- Code: Develop code and unit tests in an Azure Databricks notebook in the workspace or locally using an external IDE. Azure Databricks provides a Visual Studio Code extension that makes it easy to develop and deploy changes to Azure Databricks workspaces.
- Build: Use Databricks Asset Bundles settings to automatically build certain artifacts during deployments. See artifacts. In addition, Pylint extended with the Databricks Labs pylint plugin help to enforce coding standards and detect bugs in your Databricks notebooks and application code.
- Deploy: Deploy changes to the Azure Databricks workspace using Databricks Asset Bundles in conjunction with tools like Azure DevOps, Jenkins, or GitHub Actions. See Databricks Asset Bundle deployment modes.
- Test: Develop and run automated tests to validate your code changes using tools like pytest. To test your integrations with workspace APIs, the Databricks Labs pytest plugin allows you to create workspace objects and clean them up after tests finish.
- Run: Use the Databricks CLI in conjunction with Databricks Asset Bundles to automate runs in your Azure Databricks workspaces. See Run a bundle.
- Monitor: Monitor the performance of your code and workflows in Azure Databricks using tools like Azure Monitor or Datadog. This helps you identify and resolve any issues that arise in your production environment.
- Iterate: Make small, frequent iterations to improve and update your data engineering or data science project. Small changes are easier to roll back than large ones.
Related links
For more information on managing the lifecycle of Azure Databricks assets and data, see the following documentation about CI/CD and data pipeline tools.
Area | Use these tools when you want to… |
---|---|
Databricks Asset Bundles | Programmatically define, deploy, and run Azure Databricks jobs, Delta Live Tables pipelines, and MLOps Stacks by using CI/CD best practices and workflows. |
Databricks Terraform provider | Provision and manage Databricks infrastructure and resources using Terraform. |
CI/CD workflows with Git and Databricks Git folders | Use GitHub and Databricks Git folders for source control and CI/CD workflows. |
Authenticate with Azure DevOps on Databricks | Authenticate with Azure DevOps. |
Use a Microsoft Entra service principal to authenticate access to Azure Databricks Git folders | Use an MS Entra service principal to authenticate access to Databricks Git folders. |
Continuous integration and delivery on Azure Databricks using Azure DevOps | Develop a CI/CD pipeline for Azure Databricks that uses Azure DevOps. |
Continuous integration and delivery using GitHub Actions | Develop a CI/CD workflow on GitHub that uses GitHub Actions developed for Azure Databricks. |
CI/CD with Jenkins on Azure Databricks | Develop a CI/CD pipeline for Azure Databricks that uses Jenkins. |
Orchestrate Azure Databricks jobs with Apache Airflow | Manage and schedule a data pipeline that uses Apache Airflow. |
Service principals for CI/CD | Use service principals, instead of users, with CI/CD systems. |