Git integration with Databricks Repos
Databricks Repos is a visual Git client and API in Azure Databricks. It supports common Git operations such as cloning a repository, committing and pushing, pulling, branch management, and visual comparison of diffs when committing.
Within Repos you can develop code in notebooks or other files and follow data science and engineering code development best practices using Git for version control, collaboration, and CI/CD.
What can you do with Databricks Repos?
Databricks Repos provides source control for data and AI projects by integrating with Git providers.
In Databricks Repos, you can use Git functionality to:
- Clone, push to, and pull from a remote Git repository.
- Create and manage branches for development work, including merging, rebasing, and resolving conflicts.
- Create notebooks (including IPYNB notebooks) and edit them and other files.
- Visually compare differences upon commit and resolve merge conflicts.
For step-by-step instructions, see Run Git operations on Databricks Repos.
Databricks Repos also has an API that you can integrate with your CI/CD pipeline. For example, you can programmatically update a Databricks repo so that it always has the most recent version of the code. For information about best practices for code development using Databricks Repos, see CI/CD techniques with Git and Databricks Repos.
For information on the kinds of notebooks supported in Azure Databricks, see Export and import Databricks notebooks.
Supported Git providers
Databricks Git folders are backed by an integrated Git repository. The repository can be hosted by any of the cloud and enterprise Git providers listed in the following section.
What is a “Git provider”?
A “Git provider” is the specific (named) service that hosts a source control model based on Git. Git-based source control platforms are hosted in two ways: as a cloud service hosted by the developing company, or as an on-premises service installed and managed by your own company on its own hardware. Many Git providers such as GitHub, Microsoft, GitLab, and Atlassian provide both cloud-based SaaS and on-premises (sometimes called “self-managed”) Git services.
When choosing your Git provider during configuration, you must be aware of the differences between cloud (SaaS) and on-premises Git providers. On-premises solutions are typically hosted behind a company VPN and might not be accessible from the internet. Usually, the on-premises Git providers have a name ending in “Server” or “Self-Managed”, but if you are uncertain, contact your company admins or review the Git provider’s documentation.
If you are using “GitHub” as a provider and are still uncertain if you are using the cloud or on-premises version, see About GitHub Enterprise Server in the GitHub docs.
Cloud Git providers supported by Databricks
- GitHub, GitHub AE, and GitHub Enterprise Cloud
- Atlassian BitBucket Cloud
- GitLab and GitLab EE
- Microsoft Azure DevOps (Azure Repos)
On-premises Git providers supported by Databricks
- GitHub Enterprise Server
- Atlassian BitBucket Server and Data Center
- GitLab Self-Managed
- Microsoft Azure DevOps Server
If you are integrating an on-premises Git repo that is not accessible from the internet, a proxy for Git authentication requests must also be installed within your company’s VPN. For more details, see Git Server Proxy for Databricks Repos.
To learn how to use access tokens with your Git provider, see Configure Git credentials & connect a remote repo to Azure Databricks.
Resources for Git integration
Use the Databricks CLI 2.0 for Git integration with Azure Databricks:
Read the following reference docs: