שתף דרך


Set up private Git connectivity for Azure Databricks Git folders

If you host a private Git server (such as GitHub Enterprise Server, Bitbucket Server, or GitLab self-managed) or your Git server is behind a firewall, you can use the Git server proxy to connect Databricks Git folders to your private repositories. The proxy routes Git commands from your Azure Databricks workspace through a compute resource to your private Git server.

About Git server proxy

Databricks Git server proxy for Git folders lets you proxy Git commands from your Azure Databricks workspace to a private Git server that isn't accessible on the internet.

Databricks Git folders represent your connected Git repositories as folders. The contents of these folders are version-controlled by syncing to the connected Git repository. By default, Git folders can only synchronize with repositories accessible on the internet. If you host a private Git server or your Git server is behind a firewall, you must use Git server proxy with Git folders. Your Git server must be accessible from your Azure Databricks compute plane.

How Git server proxy works

Git server proxy for Databricks Git folders proxies Git commands from the Databricks control plane to a proxy cluster running in your workspace's compute plane. The proxy cluster is configured to run a proxy service that receives Git commands from the Databricks control plane and forwards them to your Git server. Proxying doesn't affect the security architecture of your Databricks control plane.

The following illustrates the overall system architecture:

Diagram that shows how Git Server Proxy for Databricks Git folders is configured to run from a customer's compute plane

Important

Databricks provides an enablement notebook to configure your Git server instance to proxy commands for Databricks Git folders. Get the enablement notebook on GitHub. The Databricks Git server proxy is designed to work with the Databricks Runtime version included in the configuration notebook. Don't update the Databricks Runtime version of the proxy cluster.

Set up Git Server Proxy

To enable private Git connectivity for Databricks Git folders, prepare your Git server instance, run the enablement notebook to create the proxy, and validate your configuration.

To set up the Git server proxy:

  1. Prepare your Git server instance with static IP addresses and HTTPS transport.
  2. Run the enablement notebook to create the proxy cluster.
  3. Validate your configuration by cloning a repository.
  4. Configure Git credentials for users.

Prerequisites

Before you enable the proxy, verify the following:

  • Your Git server instance is accessible from your Azure Databricks workspace's compute plane Azure virtual network (VNet), and has both HTTPS and personal access tokens (PATs) enabled.

Note

Git server proxy for Databricks works in all regions supported by your VNet.

Step 1: Prepare your Git server instance

Important

To create a compute resource and complete this task, you must be a workspace admin with access rights.

Configure your Git server to accept connections from the proxy cluster and enable HTTPS transport.

Your enterprise Git server typically has an allowlist of IP addresses from which access is permitted. To allow the proxy cluster driver node to access your Git server, associate a static outbound IP address for traffic originating from your proxy cluster and add it to your Git server's allowlist.

  1. Associate a static outbound IP address for traffic originating from your proxy cluster by using Azure Firewall or an egress appliance.
  2. Add the IP address from the previous step to your Git server's allowlist.

Then, configure your Git server instance to allow HTTPS transport:

  • GitHub Enterprise: See Which remote URL should I use in the GitHub Enterprise help.
  • Bitbucket Server: On the Bitbucket server administration page, click Server settings and select HTTP(S) enabled.

Step 2: Run the enablement notebook

To enable the proxy:

  1. Log into your Azure Databricks workspace as a workspace admin with access rights to create a cluster.

  2. Import this notebook, which chooses the smallest instance type available from your cloud provider to run the Git proxy:

    Notebook: Enable Git server proxy for Databricks Git folders for private Git server connectivity in Git folders.

  3. Click Run All to run the notebook, which performs the following tasks:

    • Creates a single node compute resource named "Databricks Git Proxy" that doesn't auto-terminate. This proxy service processes and forwards Git commands from your Azure Databricks workspace to your private Git server.
    • Enables a feature flag that controls whether Git requests in Databricks Git folders are proxied through the compute instance.

    As a best practice, create a job to run the Git proxy compute resource on a regular schedule. This keeps the Git proxy service available for your users.

Note

Running an additional long-running compute resource incurs extra Databricks Units (DBUs). To minimize costs, the notebook configures the proxy to use a single-node compute resource with an inexpensive node type. Modify the compute options to suit your needs. For pricing information, see the Databricks pricing calculator.

Step 3: Validate your Git server configuration

To validate your Git server configuration, clone a repository hosted on your private Git server through the proxy cluster. A successful clone confirms that the Git server proxy is working for your workspace.

Step 4: Create proxy-enabled Git repositories

After users configure their Git credentials, no further steps are required to create or synchronize repositories. To configure credentials and access repositories programmatically, see Configure Git credentials & connect a remote repo to Azure Databricks.

Remove global CAN ATTACH TO permissions

The Git server proxy doesn't require CAN ATTACH TO permission for any user. To prevent users from running arbitrary workloads on the proxy cluster, restrict cluster access control list (ACL) permissions on the proxy server:

  1. Click Compute from the sidebar, and then click the Kebab menu icon. kebab menu next to the compute entry for the Git Server Proxy you're running.

  2. Click Edit permissions.

  3. From the dialog, remove the Can Attach To entry for All Users.

Troubleshooting

This section covers common issues and how to diagnose them.

Checklist for common problems

Before you start diagnosing an error, confirm the following:

Run the rest of the debug notebook and capture the results. If you can't resolve the issue or don't see any failures reported, Databricks support can review the results. Export and send the debug notebook as a DBC archive if requested.

Change your Git proxy configuration

If your Git proxy service isn't working with the default configuration, set environment variables to support your network infrastructure.

Use the following environment variables to update the configuration for your Git proxy service:

Environment variable Format Description
GIT_PROXY_ENABLE_SSL_VERIFICATION true/false Set this to false if you are using a self-signed certificate for your private Git server.
GIT_PROXY_CA_CERT_PATH File path (string) Set this to the path to a CA certificate file used for SSL verification. Example: /FileStore/myCA.pem
GIT_PROXY_HTTP_PROXY https://<hostname>:<port #> Set this to the HTTPS URL for your network's firewall proxy for HTTP traffic.
GIT_PROXY_CUSTOM_HTTP_PORT Port number (integer) Set this to the port number assigned to your Git server's HTTP port.

To set these environment variables:

  1. Go to the Compute tab in your Azure Databricks workspace.
  2. Select the compute configuration for your Git proxy service.
  3. At the bottom of the Configuration pane, expand Advanced and select the Spark tab.
  4. Add environment variables to the Environment variables field.

Inspect logs on the proxy cluster

The file at /databricks/git-proxy/git-proxy.log on the proxy cluster contains logs that are useful for debugging purposes.

The log file should start with Data-plane proxy server binding to ('', 8000)…. If it doesn't, the proxy server didn't start properly. Restart the cluster, or delete the cluster and run the enablement notebook again.

If the log file starts with this line, review the log statements that follow for each Git request initiated by Git operations in Databricks Git folders.

For example:

  do_GET: https://server-address/path/to/repo/info/refs?service=git-upload-pack 10.139.0.25 - - [09/Jun/2021 06:53:02] /
  "GET /server-address/path/to/repo/info/refs?service=git-upload-pack HTTP/1.1" 200`

Error logs written to this file can be useful to help you or Databricks Support debug issues.

SSL certificate errors

You might see the following error:

  https://git.consult-prodigy.com/Prodigy/databricks_test: Secure connection to https://git.consult-prodigy.com/Prodigy/databricks_test could not be established because of SSL problems

This often means you're using a repository that requires special SSL certificates. Check the /databricks/git-proxy/git-proxy.log file on the proxy cluster. If certificate validation failed, add the certificate authority to the system certificate chain:

  1. Extract the root certificate using your browser or another method, and upload it to Databricks File System.
  2. Edit the Git folders Git Proxy cluster to set the GIT_PROXY_CA_CERT_PATH environment variable to point to the root certificate file. See Environment variables.

After you complete these steps, restart the cluster.

Frequently asked questions

The following are common questions about Git server proxy configuration and usage.

How do I check if the Git proxy is running?

Import and run the Git proxy debug notebook. The results show if there are issues with the Git proxy service.

Can workspaces share proxy clusters?

Each Azure Databricks workspace requires its own proxy cluster. You can't share a proxy cluster across multiple workspaces, and each workspace can have only one Git proxy server cluster.

Can I route only some Git traffic through the proxy?

All Databricks Git folders-related traffic routes through the proxy cluster, even for public Git repositories. Your Azure Databricks workspace doesn't differentiate between proxied and non-proxied repositories.

Which Git providers are supported?

Databricks Git folders support GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab self-managed. Other enterprise Git server providers should also work if they conform to common Git specifications.

Is GNU Privacy Guard (GPG) commit signing supported?

No.

Is SSH transport supported?

No. Only HTTPS is supported.

Can I use a non-default HTTPS port?

The enablement notebook assumes your Git server uses the default HTTPS port 443. Set the environment variable GIT_PROXY_CUSTOM_HTTP_PORT to use a different port.

Do users need to change Git URLs for the proxy?

No. Users enter the normal Git repository URL, such as https://git.company.com/org/repo-name.git. All Git traffic for Databricks Git folders routes through the proxy transparently.

How does authentication work with the proxy?

Yes, the proxy uses the user's Git credential to authenticate to the Git server. Access is restricted by the permissions specified in that credential.