Share via


Databricks SSH tunnel

Important

The Databricks SSH tunnel is in Beta.

The Databricks SSH tunnel allows you to connect your IDE to your Databricks compute. It is simple to set up, enables you to run and debug code interactively on the cluster, reduces environment mismatches, and keeps all code and data secure within your Databricks workspace.

Requirements

To use the SSH tunnel, you must have:

  • The Databricks CLI version 0.269 or higher installed on your local machine and authentication configured. See Install.
  • Compute in your Databricks workspace with dedicated (single user) access mode. See Dedicated compute overview.
    • The compute must be using Databricks Runtime 17.0 and above.
    • Unity Catalog must be enabled.
    • If a compute policy exists, it must not prohibit jobs execution.

Set up the SSH tunnel

First, set up the SSH tunnel using the databricks ssh setup command. Replace <connection-name> with the name for the tunnel, for example, my-tunnel.

databricks ssh setup --name <connection-name>

The CLI prompts you to choose a cluster, or you can provide a cluster ID by passing --cluster <cluster-id>.

Note

For IntelliJ, Databricks recommends that you include –-auto-start-cluster=false in the setup command. Starting a JetBrains IDE automatically starts all clusters, which can result in unintended compute costs. If you set this option, you must start the cluster in the workspace to start the SSH tunnel.

Connect to Databricks

Next, connect to Databricks using an IDE or terminal.

Connect using Visual Studio Code or Cursor

  1. For Visual Studio Code, install the Remote SSH extension. Cursor includes a remote SSH extension.

  2. In the IDE main menu, click View > Command Palette. Select Remote-SSH: Settings. Alternatively, select Preferences: Open User Settings (JSON) to modify settings.json directly.

  3. Under Remote.SSH: Default Extensions (or remote.SSH.defaultExtensions in settings.json), add ms-Python.Python and ms-toolsai.jupyter.

    If you are modifying settings.json:

    "remote.SSH.defaultExtensions": [
        "ms-Python.Python",
        "ms-toolsai.jupyter"
    ]
    

    Note

    Optionally, increase the value of Remote.SSH: Connect Timeout (or remote.SSH.connectTimeout in settings.json) to further reduce the chance of timeout errors. The default timeout is 360.

  4. In the Command Palette, select Remote-SSH: Connect to Host.

  5. From the dropdown, select the tunnel you set up in the first step. The IDE proceeds to connect in a new window.

    Note

    If the compute is not running, it will be started. However, if it takes longer than the timeout for the compute to start, the SSH connection attempt will fail.

  6. Select linux when prompted for the server type.

Connect using IntelliJ IDEs

  1. Follow the Remote development tutorial to get set up.

  2. On the new connection screen enter the following:

    Username: root Host: <connection-name>

Connect using terminal

To connect to the Databricks from the command line, provide the ssh command the name of your connection, for example:

ssh my-tunnel

Open projects

  1. The initial connection opens an empty IDE window without any open folder. In Visual Studio Code, use the Open Folder command from the Command palette to open a desired project.
  2. Use the workspace mount (/Workspace/Users/<your-username>) for persistent storage.

Run code (Visual Studio Code)

  • If you open a Python project, the Python extension can automatically detect virtual environments, but you still need to manually activate the right one. Select the Interpreter command from the Command palette, and choose the environment pythonEnv-xxx. This has access to all built-in Databricks Runtime libraries, or anything you’ve installed globally on the cluster.
  • In some cases the Python extension can’t automatically detect virtual environments (venv), such as when you open a folder that can’t be recognized as a Python project. To fix this, open a terminal and execute echo $DATABRICKS_VIRTUAL_ENV, then copy the path and use it in the Python: Select Interpreter command.

After the venv is selected, Python files or notebooks can be executed with normal run or debug actions provided by the Python or Jupyter extensions.

Manage Python dependencies

The simplest way to install required dependencies is using the workspace UI. See Compute-scoped libraries. With this approach, you install dependencies globally for the cluster. You don't need to reinstall libraries each time the cluster is restarted.

However, for a more programmatic setup that is scoped to a specific project, use a notebook-scoped installation.

Project-specific setup notebook

To manage dependencies for a specific project:

  1. Create a setup.ipynb file in your project.

  2. The ssh CLI creates a Python environment (pythonEnv-xxx), which already has built-in Databricks Runtime libraries or Compute-scoped libraries. Attach the notebook to this pythonEnv-xxx environment.

  3. Use %pip install commands to install your dependencies:

    • %pip install . if you have pyproject.toml (%pip install .<group> to scope it down)
    • %pip install -r dependencies.txt if you have dependencies.txt
    • %pip install /Volumes/your/wheel.whl (or /Workspace paths) if you built and uploaded a custom library as a wheel

    %pip commands have Databricks-specific logic with additional guardrails. The logic also ensures that dependencies are available to all Spark executor nodes, not just the driver node that you are connected to. This enables user-defined functions (UDFs) with custom dependencies.

    For more usage examples, see Manage libraries with %pip commands.

Run this notebook every time you establish a new ssh session. You don’t need to re-install dependencies if an existing ssh session is dropped and reconnected back to the cluster in under 10 minutes. (The time is configurable with -shutdown-delay=10m option in your local ssh config.)

Note

If you have multiple ssh sessions connected to the same cluster at the same time, they use the same virtual environment.

Limitations

The Databricks SSH tunnel has the following limitations:

  • The Databricks extension for Visual Studio Code and the Databricks SSH tunnel are not yet compatible and should not be used together.
  • Any Git folder you created in your workspace through the Databricks workspace UI will not be recognized as a git repository by the git CLI and IDE git integrations, as these folders lack .git metadata. To work around this, see How do I use Git with the SSH Tunnel?.
  • The home and root mounts on the cluster you connect to are ephemeral. Any content on the cluster is not preserved when the cluster is restarted.

Databricks Notebooks differences

There are some differences in notebooks when using the SSH tunnel:

  • Python files don’t define any Databricks globals (like spark or dbutils). You must import them explicitly with from databricks.sdk.runtime import spark.
  • For ipynb notebooks, these features are available:
    • Databricks globals: display, displayHTML, dbutils, table, sql, udf, getArgument, sc, sqlContext, spark
    • %sql magic command to execute SQL cells

To work with Python source “notebooks”:

  • Search for jupyter.interactiveWindow.cellMarker.codeRegex and set it to:

    ^# COMMAND ----------|^# Databricks notebook source|^(#\\s*%%|#\\s*\\<codecell\\>|#\\s*In\\[\\d*?\\]|#\\s*In\\[ \\])
    
  • Search for jupyter.interactiveWindow.cellMarker.default and set it to:

    # COMMAND ----------
    

Troubleshooting

This section contains information about resolving common issues.

SSH connection fails or times out

  • Make sure your cluster is RUNNING in the Databricks UI and not just stopped or starting.
  • Check outbound port 22 is open and allowed on your laptop/network/VPN.
  • Increase SSH connect timeout in your IDE. See Connect using Visual Studio Code or Cursor.
  • If you see public or private key mismatch errors, try deleting the ~/.databricks/ssh-tunnel-keys folder.
  • If you see "remote host identification has changed” errors, check the ~/.ssh/known_hosts file and delete the entries related to your cluster.
  • If the SSH session is dropped after 1 hour, this is a known limitation. See Limitations.
  • No more than 10 ssh connections are allowed to a single cluster.

CLI authentication errors

  • Confirm your Databricks CLI profile is valid and authenticated (databricks auth login).
  • Make sure you have proper cluster permissions, such as CAN MANAGE.

Files disappear or environment resets after cluster restart

  • Only /Workspace, /Volumes, and /dbfs mounts are persistent. All data in /home, /root, etc. is erased after a restart.
  • Use cluster library management for persistent dependencies. Automate reinstalls using init scripts if needed. See What are init scripts?.

"Not a git repository" error or missing git features in IDE

Git works only if you clone into /Workspace/Users/<your-username> using the terminal. Web-created folders don’t have .git metadata. See How do I use Git with the SSH Tunnel?.

My code doesn’t work

  • Make sure you select the right Python interpreter that has access to all Databricks Runtime dependencies.
    • If you open a Python project, the Python extension can automatically detect virtual environments, but you still need to manually activate the right one. Execute Python: Select Interpreter command, and choose pythonEnv-xxx environment. It will have access to all built-in Databricks Runtime libraries, or anything you’ve installed globally on the cluster.
    • In some cases the Python extension can’t automatically detect virtual environments, such as when you open a folder that can’t be recognized as a Python project. You can open a terminal and execute echo $DATABRICKS_VIRTUAL_ENV, then copy the path and use it in the Python: Select Interpreter command.
  • IPYNB notebooks and *.py Databricks notebooks have access to Databricks globals, but Python *.py files don’t. See Databricks Notebooks differences.

Can’t setup ssh connection on windows under WSL

Databricks recommends performing ssh setup directly on Windows. If you set it up on the WSL side, but then use a Windows version of Visual Studio Code, it won’t find necessary ssh configurations.

FAQ

How is my code and data secured?

All code runs within your Databricks cloud virtual private cloud (VPC). No data or code leaves your secure environment. SSH traffic is fully encrypted.

Which IDEs are supported?

Visual Studio Code and Cursor are officially supported, but the Databricks SSH tunnel is compatible with any IDE with SSH capabilities.

Are all Databricks notebook features available from the IDE?

Some features such as display(), dbutils, and %sql are available with limitations or manual setup. See Databricks Notebooks differences.

Can multiple users develop on the same cluster at once?

No.

Will my cluster start automatically when I connect via SSH Tunnel?

Yes, but if it takes longer to start the cluster than the connect timeout, the connection attempt will fail.

How do I know if my cluster is running?

Navigate to Compute in the Databricks workspace UI, and check the status of the cluster. The cluster must show Running for SSH tunnel connections to work.

How do I disconnect my SSH/IDE session?

You can disconnect a session by closing your IDE window, using the Disconnect option in your IDE, closing your SSH terminal, or running the exit command in the terminal.

Does disconnecting SSH automatically stop my cluster?

No, ssh server has a configurable shutdown-delay, and it will continue running in the background for a specified amount of time (10m by default, can be changed in the ssh config ProxyCommand by modifying -shutdown-delay option). After the timeout the server terminates, which kicks in the cluster idle timeout (which you configure during the cluster creation).

How do I stop the cluster to avoid unnecessary charges?

Navigate to Compute in the Databricks workspace UI, find your cluster, and click Terminate or Stop.

How should I handle persistent dependencies?

Dependencies installed during a session are lost after cluster restart. Use persistent storage (/Workspace/Users/<your-username>) for requirements and setup scripts. Use cluster libraries or init scripts for automation.

What authentication methods are supported?

Authentication uses the Databricks CLI and your ~/.databrickscfg profiles file. SSH keys are handled by the Databrick SSH tunnel.

Can I connect to external databases or services from the cluster?

Yes, as long as your cluster networking allows outbound connections and you have the necessary libraries.

Can I use additional IDE extensions?

Most extensions work when installed within your remote SSH session, depending on your IDE and cluster. Visual Studio Code by default doesn’t install local extensions on remote hosts. You can manually install them by opening the extensions panel and enabling your local extensions on the remote host. You can also configure Visual Studio Code to always install certain extensions remotely. See Connect to Databricks.

How do I use Git with the SSH Tunnel?

Currently Git folders created using the Databricks workspace UI are not recognized as git repositories in IDEs. To work around this, clone repositories using the git CLI from your SSH session into your persistent workspace folder:

  1. Open a terminal and navigate to a desired parent directory (for example, cd /Workspace/Users/<your-username>)
  2. Clone your repository in that directory.
  3. In Visual Studio Code, open this folder in a new window by running code <repo-name> or open the folder in a new window using the UI.