Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This page covers size limits, supported features, security considerations, and CI/CD behavior for Databricks Git folders. For general Databricks resource limits, see Resource limits. To learn about supported asset types in Git folders, see Supported asset types in Git folders.
File and repo limits
Azure Databricks doesn't enforce a limit on repository size. However, the following limits apply:
- Working branches are limited to 1 GB.
- You can't view files larger than 10 MB in the Azure Databricks UI.
- Each Git operation supports up to 2 GB of memory and 4 GB of disk writes.
- Individual workspace files have separate size limits. See Limitations.
Databricks recommends keeping the total number of workspace assets and files under 20,000.
Because limits apply per operation, cloning a 5 GB repository fails, but cloning a 3 GB repository and later adding 2 GB succeeds. If your repository exceeds these limits, you might receive an error or a timeout during cloning, though the operation might still complete in the background.
To work with larger repositories, try sparse checkout or Git CLI commands. To write temporary files that don't persist after cluster shutdown, use $TEMPDIR. This avoids exceeding branch size limits and offers better performance than writing to a working directory (CWD) in the workspace filesystem. See Where should I write temporary files on Azure Databricks?.
Local branches can remain in the associated Git folder for up to 30 days after the remote branch is deleted. To remove a local branch completely, delete the repository.
Reduce repository size
If your repository exceeds size limits due to large files, adding them to .gitignore won't reduce repository size. Files already committed to Git remain in the repository history even when added to .gitignore.
To reduce repository size:
- Use Git tools like
git filter-repoor BFG Repo-Cleaner to remove large files from commit history. This rewrites history and requires force-pushing to your remote repository. - Clone only specific directories. See Configure sparse checkout mode.
- Move unrelated code to separate repositories.
For more information, see Removing sensitive data from a repository in the GitHub documentation.
Monorepo support
Databricks recommends against creating Git folders backed by monorepos—large, single-organization Git repositories with thousands of files across many projects. Cloning a monorepo can exceed Git folder memory and disk limits and slow Git operations. If your repository contains multiple projects, consider splitting it or using sparse checkout to limit which directories are cloned. See Configure sparse checkout mode.
Configuration
Not all standard Git features work in Git folders, and content is stored differently than in a local clone. The following topics explain how storage works, which servers are supported, and how features like .gitignore and submodules behave.
Repository content storage
Azure Databricks temporarily clones repository contents to disk in the control plane. The control plane database stores notebook files like those in the main workspace. Non-notebook files are stored on disk for up to 30 days.
On-premises and self-hosted Git servers
Databricks Git folders support GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab Self-managed if the server is internet-accessible. See Git Proxy Server for Git folders for on-premises integration.
To integrate with a Bitbucket Server, GitHub Enterprise Server, or GitLab self-managed instance that isn't internet-accessible, contact your Azure Databricks account team.
Supported asset types
For details on supported asset types, see Supported asset types in Git folders.
.gitignore file support
Git folders support .gitignore files. To prevent Git from tracking a file, add the filename (including extension) to a .gitignore file. Either create one or use an existing file cloned from your remote repository.
.gitignore works only for untracked files. Adding an already-committed file to .gitignore doesn't remove it from Git history or reduce repository size. To remove committed files, see Reduce repository size.
Git submodule support
Standard Git folders don't support Git submodules, but Git folders with Git CLI access can use them. See Use Git CLI commands (Beta).
Azure Data Factory support
Azure Data Factory (ADF) supports Git folders.
Source management
A few operations work differently in Git folders than in a standard Git workflow, particularly around notebooks and branch deletion.
Notebook dashboards and branch changes
Azure Databricks source format notebooks don't store dashboard information.
To preserve dashboards, change the notebook format to .ipynb (Jupyter format), which supports dashboard and visualization definitions by default. To preserve visualization data, commit the notebook with outputs.
See Manage IPYNB notebook output commits.
Branch merging support
Git folders support branch merging. You can also create a pull request and merge through your Git provider.
Deleting branches
To delete a branch, you must work in your Git provider.
Python dependency precedence
Python libraries in a Git folder take precedence over libraries stored elsewhere. For example, if a library is installed on your Databricks compute and a library with the same name exists in a Git folder, the Git folder library is imported. See Python library precedence.
Security, authentication, and tokens
Azure Databricks stores Git credentials in the control plane, not in your local environment. The following topics cover how Git folder contents are encrypted, how tokens are stored and audited, and what to do if you run into authentication issues.
Issue with a conditional access policy (CAP) for Microsoft Entra ID
You might get a "denied access" error when cloning a repository if:
- Your Azure Databricks workspace uses Azure DevOps with Microsoft Entra ID authentication.
- You've enabled a conditional access policy in Azure DevOps and an Microsoft Entra ID conditional access policy.
To resolve this, add an exclusion to the conditional access policy (CAP) for Azure Databricks IP addresses or users.
For more information, see Conditional access policies.
Allowlist with Microsoft Entra ID tokens
If you use Microsoft Entra ID for authenticating with Azure DevOps, the default allowlist restricts Git URLs to:
dev.azure.comvisualstudio.com
For more information, see Git URL allowlists.
Git folder encryption
Azure Databricks encrypts Git folder contents using a default key. Customer-managed keys are only supported for encrypting Git credentials.
GitHub token storage and access
- The Azure Databricks control plane stores authentication tokens. Employees can only access them through audited temporary credentials.
- Azure Databricks logs token creation and deletion, but not usage. Git operation logging lets you audit token usage by the Azure Databricks application.
- GitHub Enterprise audits token usage. Other Git services might also offer server auditing.
GPG commit signing
Git folders don't support GPG signing of commits.
SSH support
Git folders support only HTTPS, not SSH.
Azure DevOps cross-tenancy errors
When connecting to DevOps in a separate tenancy, you might see Unable to parse credentials from Azure Active Directory account. If the Azure DevOps project is in a different Microsoft Entra ID tenancy than Azure Databricks, use an Azure DevOps access token. See Personal access token.
CI/CD and MLOps
If you run jobs against files in a Git folder, be aware of how Git operations can affect notebook state and MLflow experiments in ways that might not be obvious.
Incoming changes clear the notebook state
Git operations that alter notebook source code result in loss of notebook state, including cell outputs, comments, version history, and widgets. For example, git pull can change notebook source code, requiring Git folders to overwrite the existing notebook. Operations like git commit, push, or creating a new branch don't affect source code and preserve notebook state.
Important
MLflow experiments don't work in Git folders with Databricks Runtime 14.x or earlier.
MLflow experiments in Git folders
There are two types of MLflow experiments: workspace and notebook. See Organize training runs with MLflow experiments.
Workspace experiments: You can't create workspace MLflow experiments in Git folders. Log MLflow runs to an experiment created in a regular workspace folder. For multi-user collaboration, use a shared workspace folder.
Notebook experiments: You can create notebook experiments in a Databricks Git folder. If you check your notebook into source control as an
.ipynbfile, MLflow runs log to an automatically created experiment. Source control doesn't check in the experiment or its runs. See Create notebook experiment.
Prevent data loss in MLflow experiments
Notebook MLflow experiments created using Lakeflow Jobs with source code in a remote repository are stored in temporary storage. These experiments persist initially after workflow execution, but risk deletion during scheduled cleanup. Databricks recommends using workspace MLflow experiments with Jobs and remote Git sources.
Warning
Switching to a branch that doesn't contain the notebook risks losing the associated MLflow experiment data. This loss becomes permanent if you don't visit the prior branch within 30 days.
To recover missing experiment data before the 30-day expiry, restore the original notebook name, open the notebook, and click
in the right pane. This triggers mlflow.get_experiment_by_name() and recovers the experiment and runs. After 30 days, Azure Databricks purges orphaned MLflow experiments for GDPR compliance.
To prevent data loss, avoid renaming notebooks in a repository. If you rename a notebook, immediately click the experiment icon in the right pane.
Running jobs during Git operations
During a Git operation, some notebooks might be updated while others aren't yet, causing unpredictable behavior.
For example, if notebook A calls notebook Z using %run and a job starts during a Git operation, the job might run the latest notebook A with an older notebook Z. The job might fail or run notebooks from different commits.
To avoid this, configure job tasks to use your Git provider as the source instead of a workspace path. See Use Git with Lakeflow Jobs.
Next steps
- Troubleshoot Git folder errors
- Create and manage Git folders
- Set up Git integration for Git folders