dbx by Databricks Labs

Important

This documentation has been retired and might not be updated.

Databricks recommends that you use Databricks Asset Bundles instead of dbx by Databricks Labs. See What are Databricks Asset Bundles? and Migrate from dbx to bundles.

Note

This article covers dbx by Databricks Labs, which is provided as-is and is not supported by Databricks through customer technical support channels. Questions and feature requests can be communicated through the Issues page of the databrickslabs/dbx repo on GitHub.

dbx by Databricks Labs is an open source tool which is designed to extend the legacy Databricks command-line interface (Databricks CLI) and to provide functionality for rapid development lifecycle and continuous integration and continuous delivery/deployment (CI/CD) on the Azure Databricks platform.

dbx simplifies jobs launch and deployment processes across multiple environments. It also helps to package your project and deliver it to your Azure Databricks environment in a versioned fashion. Designed in a CLI-first manner, it is built to be actively used both inside CI/CD pipelines and as a part of local tooling (such as local IDEs, including Visual Studio Code and PyCharm).

The typical development workflow with dbx is:

  1. Create a remote repository with a Git provider Databricks supports, if you do not have a remote repo available already.

  2. Clone your remote repo into your Azure Databricks workspace.

  3. Create or move an Azure Databricks notebook into the cloned repo in your Azure Databricks workspace. Use this notebook to begin prototyping the code that you want your Azure Databricks clusters to run.

  4. To enhance and modularize your notebook code by adding separate helper classes and functions, configuration files, and tests, switch over to using a local development machine with dbx, your preferred IDE, and Git installed.

  5. Clone your remote repo to your local development machine.

  6. Move your code out of your notebook into one or more local code files.

  7. As you code locally, push your work from your local repo to your remote repo. Also, sync your remote repo with your Azure Databricks workspace.

    Tip

    Alternatively, you can use dbx sync to automatically synchronize local file changes with corresponding files in your workspace, in real time.

  8. Keep using the notebook in your Azure Databricks workspace for rapid prototyping, and keep moving validated code from your notebook to your local machine. Keep using your local IDE for tasks such as code modularization, code completion, linting, unit testing, and step-through debugging of code and objects that do not require a live connection to Azure Databricks.

  9. Use dbx to batch run your local code on your target clusters, as desired. (This is similar to running the spark-submit script in Spark’s bin directory to launch applications on a Spark cluster.)

  10. When you are ready for production, use a CI/CD platform such as GitHub Actions, Azure DevOps, or GitLab to automate running your remote repo’s code on your clusters.

Requirements

To use dbx, you must have the following installed on your local development machine, regardless of whether your code uses Python, Scala, or Java:

  • Python version 3.8 or above.

    If your code uses Python, you should use a version of Python that matches the one that is installed on your target clusters. To get the version of Python that is installed on an existing cluster, you can use the cluster’s web terminal to run the python --version command. See also the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.

  • pip.

  • If your code uses Python, a method to create Python virtual environments to ensure you are using the correct versions of Python and package dependencies in your dbx projects. This article covers pipenv.

  • dbx version 0.8.0 or above. You can install this package from the Python Package Index (PyPI) by running pip install dbx.

    To confirm that dbx is installed, run the following command:

    dbx --version
    

    If the version number is returned, dbx is installed.

    If the version number is below 0.8.0, upgrade dbx by running the following command, and then check the version number again:

    pip install dbx --upgrade
    dbx --version
    
    # Or ...
    python -m pip install dbx --upgrade
    dbx --version
    
  • The Databricks CLI version 0.18 or below, set up with authentication. The legacy Databricks CLI (Databricks CLI version 0.17) is automatically installed when you install dbx. This authentication can be set up on your local development machine in one or both of the following locations:

    • Within the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables (starting with legacy Databricks CLI version 0.8.0).
    • In an Azure Databricks configuration profile within your .databrickscfg file.

    dbx looks for authentication credentials in these two locations, respectively. dbx uses only the first set of matching credentials that it finds.

    Note

    dbx does not support the use of a .netrc file for authentication, beginning with legacy Databricks CLI version 0.17.2. To check your installed legacy Databricks CLI version, run the command databricks --version.

  • git for pushing and syncing local and remote code changes.

Continue with the instructions for one of the following IDEs:

Note

Databricks has validated usage of the preceding IDEs with dbx; however, dbx should work with any IDE. You can also use No IDE (terminal only).

dbx is optimized to work with single-file Python code files and compiled Scala and Java JAR files. dbx does not work with single-file R code files or compiled R code packages. This is because dbx works with the Jobs API 2.0 and 2.1, and these APIs cannot run single-file R code files or compiled R code packages as jobs.

Visual Studio Code

Complete the following instructions to begin using Visual Studio Code and Python with dbx.

On your local development machine, you must have the following installed in addition to the general requirements:

Follow these steps to begin setting up your dbx project structure:

  1. From your terminal, create a blank folder. These instructions use a folder named dbx-demo. You can give your dbx project’s root folder any name you want. If you use a different name, replace the name throughout these steps. After you create the folder, switch to it, and then start Visual Studio Code from that folder.

    For Linux and macOS:

    mkdir dbx-demo
    cd dbx-demo
    code .
    

    Tip

    If command not found: code displays after you run code ., see Launching from the command line on the Microsoft website.

    For Windows:

    md dbx-demo
    cd dbx-demo
    code .
    
  2. In Visual Studio Code, create a Python virtual environment for this project:

    1. On the menu bar, click View > Terminal.

    2. From the root of the dbx-demo folder, run the pipenv command with the following option, where <version> is the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters’ version of Python), for example 3.8.14.

      pipenv --python <version>
      

      Make a note of the Virtualenv location value in the output of the pipenv command, as you will need it in the next step.

  3. Select the target Python interpreter, and then activate the Python virtual environment:

    1. On the menu bar, click View > Command Palette, type Python: Select, and then click Python: Select Interpreter.
    2. Select the Python interpreter within the path to the Python virtual environment that you just created. (This path is listed as the Virtualenv location value in the output of the pipenv command.)
    3. On the menu bar, click View > Command Palette, type Terminal: Create, and then click Terminal: Create New Terminal.

    For more information, see Using Python environments in VS Code in the Visual Studio Code documentation.

  4. Continue with Create a dbx project.

PyCharm

Complete the following instructions to begin using PyCharm and Python with dbx.

On your local development machine, you must have PyCharm installed in addition to the general requirements.

Follow these steps to begin setting up your dbx project structure:

  1. In PyCharm, on the menu bar, click File > New Project.
  2. In the Create Project dialog, choose a location for your new project.
  3. Expand Python interpreter: New Pipenv environment.
  4. Select New environment using, if it is not already selected, and then select Pipenv from the drop-down list.
  5. For Base interpreter, select the location that contains the Python interpreter for the target version of Python that you already have installed locally (and, ideally, a version that matches your target clusters’ version of Python).
  6. For Pipenv executable, select the location that contains your local installation of pipenv, if it is not already auto-detected.
  7. If you want to create a minimal dbx project, and you want to use the main.py file with that minimal dbx project, then select the Create a main.py welcome script box. Otherwise, clear this box.
  8. Click Create.
  9. In the Project tool window, right-click the project’s root folder, and then click Open in > Terminal.
  10. Continue with Create a dbx project.

IntelliJ IDEA

Complete the following instructions to begin using IntelliJ IDEA and Scala with dbx. These instructions create a minimal sbt-based Scala project that you can use to start a dbx project.

On your local development machine, you must have the following installed in addition to the general requirements:

  • IntelliJ IDEA.
  • The Scala plugin for IntelliJ IDEA. For more information, see Discover IntelliJ IDEA for Scala in the IntelliJ IDEA documentation.
  • Java Runtime Environment (JRE) 8. While any edition of JRE 8 should work, Databricks has so far only validated usage of dbx and IntelliJ IDEA with the OpenJDK 8 JRE. Databricks has not yet validated usage of dbx with IntelliJ IDEA and Java 11. For more information, see Java Development Kit (JDK) in the IntelliJ IDEA documentation.

Follow these steps to begin setting up your dbx project structure:

Step 1: Create an sbt-based Scala project

  1. In IntelliJ IDEA, depending on your view, click Projects > New Project or File > New > Project.
  2. In the New Project dialog, click Scala, click sbt, and then click Next.
  3. Enter a project name and a location for the project.
  4. For JDK, select your installation of the OpenJDK 8 JRE.
  5. For sbt, choose the highest available version of sbt that is listed.
  6. For Scala, ideally, choose the version of Scala that matches your target clusters’ version of Scala. See the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.
  7. Next to Scala, select the Sources box if it is not already selected.
  8. Add a package prefix to Package Prefix. These steps use the package prefix com.example.demo. If you specify a different package prefix, replace the package prefix throughout these steps.
  9. Click Finish.

Step 2: Add an object to the package

You can add any required objects to your package. This package contains a single object named SampleApp.

  1. In the Project tool window (View > Tool Windows > Project), right-click the project-name > src > main > scala folder, and then click New > Scala Class.

  2. Choose Object, and type the object’s name and then press Enter. For example, type SampleApp. If you enter a different object name here, be sure to replace the name throughout these steps.

  3. Replace the contents of the SampleApp.scala file with the following code:

    package com.example.demo
    
    object SampleApp {
      def main(args: Array[String]) {
      }
    }
    

Step 3: Build the project

Add any required project build settings and dependencies to your project. This step assumes that you are building a project that was set up in the previous steps and it depends on only the following libraries.

  1. Replace the contents of the project’s build.sbt file with the following content:

    ThisBuild / version := "0.1.0-SNAPSHOT"
    
    ThisBuild / scalaVersion := "2.12.14"
    
    val sparkVersion = "3.2.1"
    
    lazy val root = (project in file("."))
      .settings(
        name := "dbx-demo",
        idePackagePrefix := Some("com.example.demo"),
        libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion withSources(),
        libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion withSources(),
        libraryDependencies += "org.apache.spark" %% "spark-hive" % sparkVersion withSources()
      )
    

    In the preceding file, replace:

    • 2.12.14 with the version of Scala that you chose earlier for this project.
    • 3.2.1 with the version of Spark that you chose earlier for this project.
    • dbx-demo with the name of your project.
    • com.example.demo with the name of your package prefix.
  2. On the menu bar, click View > Tool Windows > sbt.

  3. In the sbt tool window, right-click the name of your project, and click Reload sbt Project. Wait until sbt finishes downloading the project’s dependencies from an Internet artifact store such as Coursier or Ivy by default, depending on your version of sbt. You can watch the download progress in the status bar. If you add or change any more dependencies to this project, you must repeat this project reloading step for each set of dependencies that you add or change.

  4. On the menu bar, click IntelliJ IDEA > Preferences.

  5. In the Preferences dialog, click Build, Execution, Deployment > Build Tools > sbt.

  6. In JVM, for JRE, select your installation of the OpenJDK 8 JRE.

  7. In sbt projects, select the name of your project.

  8. In sbt shell, select builds.

  9. Click OK.

  10. On the menu bar, click Build > Build Project. The build’s results appear in the sbt shell tool window (View > Tool Windows > sbt shell).

Step 4: Add code to the project

Add any required code to your project. This step assumes that you only want to add code to the SampleApp.scala file in the example package.

In the project’s src > main > scala > SampleApp.scala file, add the code that you want dbx to batch run on your target clusters. For basic testing, use the example Scala code in the section Code example.

Step 5: Run the project

  1. On the menu bar, click Run > Edit Configurations.
  2. In the Run/Debug Configurations dialog, click the + (Add New Configuration) icon, or Add new, or Add new run configuration.
  3. In the drop-down, click sbt Task.
  4. For Name, enter a name for the configuration, for example, Run the program.
  5. For Tasks, enter ~run.
  6. Select Use sbt shell.
  7. Click OK.
  8. On the menu bar, click Run > Run ‘Run the program’. The run’s results appear in the sbt shell tool window.

Step 6: Build the project as a JAR

You can add any JAR build settings to your project that you want. This step assumes that you only want to build a JAR that is based on the project that was set up in the previous steps.

  1. On the menu bar, click File > Project Structure.
  2. In the Project Structure dialog, click Project Settings > Artifacts.
  3. Click the + (Add) icon.
  4. In the drop-down list, select JAR > From modules with dependencies.
  5. In the Create JAR from Modules dialog, for Module, select the name of your project.
  6. For Main Class, click the folder icon.
  7. In the Select Main Class dialog, on the Search by Name tab, select SampleApp, and then click OK.
  8. For JAR files from libraries, select copy to the output directory and link via manifest.
  9. Click OK to close the Create JAR from Modules dialog.
  10. Click OK to close the Project Structure dialog.
  11. On the menu bar, click Build > Build Artifacts.
  12. In the context menu that appears, select project-name:jar > Build. Wait while sbt builds your JAR. The build’s results appear in the Build Output tool window (View > Tool Windows > Build).

The JAR is built to the project’s out > artifacts > <project-name>_jar folder. The JAR’s name is <project-name>.jar.

Step 7: Display the terminal in the IDE

With your dbx project structure now in place, you are ready to create your dbx project.

Display the IntelliJ IDEA terminal by clicking View > Tool Windows > Terminal on the menu bar, and then continue with Create a dbx project.

Eclipse

Complete the following instructions to begin using Eclipse and Java with dbx. These instructions create a minimal Maven-based Java project that you can use to start a dbx project.

On your local development machine, you must have the following installed in addition to the general requirements:

  • A version of Eclipse. These instructions use the Eclipse IDE for Java Developers edition of the Eclipse IDE.
  • An edition of the Java Runtime Environment (JRE) or Java Development Kit (JDK) 11, depending on your local machine’s operating system. While any edition of JRE or JDK 11 should work, Databricks has so far only validated usage of dbx and the Eclipse IDE for Java Developers with Eclipse 2022-03 R, which includes AdoptOpenJDK 11.

Follow these steps to begin setting up your dbx project structure:

Step 1: Create a Maven-based Java project

  1. In Eclipse, click File > New > Project.
  2. In the New Project dialog, expand Maven, select Maven Project, and click Next.
  3. In the New Maven Project dialog, select Create a simple project (skip archetype selection), and click Next.
  4. For Group Id, enter a group ID that conforms to Java’s package name rules. These steps use the package name of com.example.demo. If you enter a different group ID, substitute it throughout these steps.
  5. For Artifact Id, enter a name for the JAR file without the version number. These steps use the JAR name of dbx-demo. If you enter a different name for the JAR file, substitute it throughout these steps.
  6. Click Finish.

Step 2: Add a class to the package

You can add any classes to your package that you want. This package will contain a single class named SampleApp.

  1. In the Project Explorer view (Window > Show View > Project Explorer), select the project-name project icon, and then click File > New > Class.
  2. In the New Java Class dialog, for Package, enter com.example.demo.
  3. For Name, enter SampleApp.
  4. For Modifiers, select public.
  5. Leave Superclass blank.
  6. For Which method stubs would you like to create, select public static void Main(String[] args).
  7. Click Finish.

Step 3: Add dependencies to the project

  1. In the Project Explorer view, double-click project-name > pom.xml.

  2. Add the following dependencies as a child element of the <project> element, and then save the file:

    <dependencies>
      <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.2.1</version>
        <scope>provided</scope>
      </dependency>
      <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.2.1</version>
      </dependency>
      <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.12</artifactId>
        <version>3.2.1</version>
        <scope>provided</scope>
      </dependency>
    </dependencies>
    

    Replace:

    • 2.12 with your target clusters’ version of Scala.
    • 3.2.1 with your target clusters’ version of Spark.

    See the “System environment” section in the Databricks Runtime release notes versions and compatibility for the Databricks Runtime version for your target clusters.

Step 4: Compile the project

  1. In the project’s pom.xml file, add the following Maven compiler properties as a child element of the <project> element, and then save the file:

    <properties>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <maven.compiler.source>1.6</maven.compiler.source>
      <maven.compiler.target>1.6</maven.compiler.target>
    </properties>
    
  2. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.

  3. In the Run Configurations dialog, click Maven Build.

  4. Click the New launch configuration icon.

  5. Enter a name for this launch configuration, for example clean compile.

  6. For Base directory, click Workspace, choose your project’s directory, and click OK.

  7. For Goals, enter clean compile.

  8. Click Run. The run’s output appears in the Console view (Window > Show View > Console).

Step 5: Add code to the project

You can add any code to your project that you want. This step assumes that you only want to add code to a file named SampleApp.java for a package named com.example.demo.

In the project’s src/main/java > com.example.demo > SampleApp.java file, add the code that you want dbx to batch run on your target clusters. (If you do not have any code handy, you can use the Java code in the Code example, listed toward the end of this article.)

Step 6: Run the project

  1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.
  2. In the Run Configurations dialog, expand Java Application, and then click App.
  3. Click Run. The run’s output appears in the Console view.

Step 7: Build the project as a JAR

  1. In the Project Explorer view, right-click the project-name project icon, and then click Run As > Run Configurations.
  2. In the Run Configurations dialog, click Maven Build.
  3. Click the New launch configuration icon.
  4. Enter a name for this launch configuration, for example clean package.
  5. For Base directory, click Workspace, choose your project’s directory, and click OK.
  6. For Goals, enter clean package.
  7. Click Run. The run’s output appears in the Console view.

The JAR is built to the <project-name> > target folder. The JAR’s name is <project-name>-0.0.1-SNAPSHOT.jar.

Note

If the JAR does not appear in the target folder in the Project Explorer window at first, you can try to display it by right-clicking the project-name project icon, and then click Refresh.

Step 8: Display the terminal in the IDE

With your dbx project structure now in place, you are ready to create your dbx project. To start, set Project Explorer view to show the hidden files (files starting with a dot (./)) the dbx generates, as follows:

  1. In the Project Explorer view, click the ellipses (View Menu) filter icon, and then click Filters and Customization.
  2. In the Filters and Customization dialog, on the Pre-set filters tab, clear the . resources* box.
  3. Click OK.

Next, display the Eclipse terminal as follows:

  1. Click Window > Show View > Terminal on the menu bar.
  2. In the terminal’s command prompt does not appear, in the Terminal view, click the Open a Terminal icon.
  3. Use the cd command to switch to your project’s root directory.
  4. Continue with Create a dbx project.

No IDE (terminal only)

Complete the following instructions to begin using a terminal and Python with dbx.

Follow these steps to use a terminal to begin setting up your dbx project structure:

  1. From your terminal, create a blank folder. These instructions use a folder named dbx-demo (but you can give your dbx project’s root folder any name you want). After you create the folder, switch to it.

    For Linux and macOS:

    mkdir dbx-demo
    cd dbx-demo
    

    For Windows:

    md dbx-demo
    cd dbx-demo
    
  2. Create a Python virtual environment for this project by running the pipenv command, with the following option, from the root of the dbx-demo folder, where <version> is the target version of Python that you already have installed locally, for example 3.8.14.

    pipenv --python <version>
    
  3. Activate your Python virtual environment by running pipenv shell.

    pipenv shell
    
  4. Continue with Create a dbx project.

Create a dbx project

With your dbx project structure in place from one of the previous sections, you are now ready to create one of the following types of projects:

Create a minimal dbx project for Python

The following minimal dbx project is the simplest and fastest approach to getting started with Python and dbx. It demonstrates batch running of a single Python code file on an existing Azure Databricks all-purpose cluster in your Azure Databricks workspace.

Note

To create a dbx templated project for Python that demonstrates batch running of code on all-purpose clusters and jobs clusters, remote code artifact deployments, and CI/CD platform setup, skip ahead to Create a dbx templated project for Python with CI/CD support.

To complete this procedure, you must have an existing all-purpose cluster in your workspace. (See View compute or Compute configuration reference.) Ideally (but not required), the version of Python in your Python virtual environment should match the version that is installed on this cluster. To identify the version of Python on the cluster, use the cluster’s web terminal to run the command python --version.

python --version
  1. From your terminal, from your dbx project’s root folder, run the dbx configure command with the following option. This command creates a hidden .dbx folder within your dbx project’s root folder. This .dbx folder contains lock.json and project.json files.

    dbx configure --profile DEFAULT --environment default
    

    Note

    The project.json file defines an environment named default along with a reference to the DEFAULT profile within your .databrickscfg file. If you want dbx to use a different profile, replace --profile DEFAULT with --profile followed by your target profile’s name, in the dbx configure command.

    For example, if you have a profile named DEV within your .databrickscfg file and you want dbx to use it instead of the DEFAULT profile, your project.json file might look like this instead, in which you case you would also replace --environment default with --environment dev in the dbx configure command:

    {
      "environments": {
        "default": {
          "profile": "DEFAULT",
          "storage_type": "mlflow",
          "properties": {
            "workspace_directory": "/Workspace/Shared/dbx/projects/<current-folder-name>",
            "artifact_location": "dbfs:/dbx/<current-folder-name>"
          }
        },
        "dev": {
          "profile": "DEV",
          "storage_type": "mlflow",
          "properties": {
            "workspace_directory": "/Workspace/Shared/dbx/projects/<some-other-folder-name>",
            "artifact_location": "dbfs:/dbx/<some-other-folder-name>"
          }
        }
      }
    }
    

    If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a profile in your .databrickscfg file, then leave out the --profile option altogether from the dbx configure command.

  2. Create a folder named conf within your dbx project’s root folder.

    For Linux and macOS:

    mkdir conf
    

    For Windows:

    md conf
    
  3. Add a file named deployment.yaml file to the conf directory, with the following file contents:

    build:
      no_build: true
    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            spark_python_task:
              python_file: "file://dbx-demo-job.py"
    

    Note

    The deployment.yaml file contains the lower-cased word default, which is a reference to the upper-cased DEFAULT profile within your .databrickscfg file. If you want dbx to use a different profile, replace default with your target profile’s name.

    For example, if you have a profile named DEV within your .databrickscfg file and you want dbx to use it instead of the DEFAULT profile, your deployment.yaml file might look like this instead:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            spark_python_task:
              python_file: "file://dbx-demo-job.py"
      dev:
        workflows:
          - name: "<some-other-job-name>"
            spark_python_task:
              python_file: "file://<some-other-filename>.py"
    

    If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a profile in your .databrickscfg file, then leave default in the deployment.yaml as is. dbx will use this reference by default.

    Tip

    To add Spark configuration key-value pairs to a job, use the spark_conf field, for example:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            spark_conf:
              spark.speculation: true
              spark.streaming.ui.retainedBatches: 5
              spark.driver.extraJavaOptions: "-verbose:gc -XX:+PrintGCDetails"
            # ...
    

    To add permissions to a job, use the access_control_list field, for example:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            access_control_list:
              - user_name: "someone@example.com"
                permission_level: "IS_OWNER"
              - group_name: "some-group"
                permission_level: "CAN_VIEW"
            # ...
    

    Note that the access_control_list field must be exhaustive, so the job’s owner should be added to the list as well as adding other user and group permissions.

  4. Add the code to run on the cluster to a file named dbx-demo-job.py and add the file to the root folder of your dbx project. (If you do not have any code handy, you can use the Python code in the Code example, listed toward the end of this article.)

    Note

    You do not have to name this file dbx-demo-job.py. If you choose a different file name, be sure to update the python_file field in the conf/deployment.yaml file to match.

  5. Run the command dbx execute command with the following options. In this command, replace <existing-cluster-id> with the ID of the target cluster in your workspace. (To get the ID, see Cluster URL and ID.)

    dbx execute --cluster-id=<existing-cluster-id> dbx-demo-job --no-package
    
  6. To view the run’s results locally, see your terminal’s output. To view the run’s results on your cluster, go to the Standard output pane in the Driver logs tab for your cluster. (See Compute driver and worker logs.)

  7. Continue with Next steps.

Create a minimal dbx project for Scala or Java

The following minimal dbx project is the simplest and fastest approach to getting started with dbx and Scala or Java. It demonstrates deploying a single Scala or Java JAR to your Azure Databricks workspace and then running that deployed JAR on an Azure Databricks jobs cluster in your Azure Databricks workspace.

Note

Azure Databricks limits how you can run Scala and Java code on clusters:

  • You cannot run a single Scala or Java file as a job on a cluster as you can with a single Python file. To run Scala or Java code, you must first build it into a JAR.
  • You can run a JAR as a job on an existing all-purpose cluster. However, you cannot reinstall any updates to that JAR on the same all-purpose cluster. In this case, you must use a job cluster instead. This section uses the job cluster approach.
  • You must first deploy the JAR to your Azure Databricks workspace before you can run that deployed JAR on any all-purpose cluster or jobs cluster in that workspace.
  1. In your terminal, from your project’s root folder, run the dbx configure command with the following option. This command creates a hidden .dbx folder within your project’s root folder. This .dbx folder contains lock.json and project.json files.

    dbx configure --profile DEFAULT --environment default
    

    Note

    The project.json file defines an environment named default along with a reference to the DEFAULT profile within your .databrickscfg file. If you want dbx to use a different profile, replace --profile DEFAULT with --profile followed by your target profile’s name, in the dbx configure command.

    For example, if you have a profile named DEV within your .databrickscfg file and you want dbx to use it instead of the DEFAULT profile, your project.json file might look like this instead, in which you case you would also replace --environment default with --environment dev in the dbx configure command:

    {
      "environments": {
        "default": {
          "profile": "DEFAULT",
          "storage_type": "mlflow",
          "properties": {
            "workspace_directory": "/Workspace/Shared/dbx/projects/<current-folder-name>",
            "artifact_location": "dbfs:/dbx/<current-folder-name>"
          }
        },
        "dev": {
          "profile": "DEV",
          "storage_type": "mlflow",
          "properties": {
            "workspace_directory": "/Workspace/Shared/dbx/projects/<some-other-folder-name>",
            "artifact_location": "dbfs:/dbx/<some-other-folder-name>"
          }
        }
      }
    }
    

    If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a profile in your .databrickscfg file, then leave out the --profile option altogether from the dbx configure command.

  2. Create a folder named conf within your project’s root folder.

    For Linux and macOS:

    mkdir conf
    

    For Windows:

    md conf
    
  3. Add a file named deployment.yaml file to the conf directory, with the following minimal file contents:

    build:
      no_build: true
    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            new_cluster:
              spark_version: "10.4.x-scala2.12"
              node_type_id: "Standard_DS3_v2"
              num_workers: 2
              instance_pool_id: "my-instance-pool"
            libraries:
              - jar: "file://out/artifacts/dbx_demo_jar/dbx-demo.jar"
            spark_jar_task:
              main_class_name: "com.example.demo.SampleApp"
    

    Replace:

    • The value of spark_version with the appropriate runtime version for your target jobs cluster.
    • The value of node_type_id with the appropriate Worker and driver node types for your target jobs cluster.
    • The value of instance_pool_id with the ID of an existing instance pool in your workspace, to enable faster running of jobs. If you do not have an existing instance pool available or you do not want to use an instance pool, remove this line altogether.
    • The value of jar with the path in the project to the JAR. For IntelliJ IDEA with Scala, it could be file://out/artifacts/dbx_demo_jar/dbx-demo.jar. For the Eclipse IDE with Java, it could be file://target/dbx-demo-0.0.1-SNAPSHOT.jar.
    • The value of main_class_name with the name of main class in the JAR, for example com.example.demo.SampleApp.

    Note

    The deployment.yaml file contains the word default, which is a reference to the default environment in the .dbx/project.json file, which in turn is a reference to the DEFAULT profile within your .databrickscfg file. If you want dbx to use a different profile, replace default in this deployment.yaml file with the corresponding reference in the .dbx/project.json file, which in turn references the corresponding profile within your .databrickscfg file.

    For example, if you have a profile named DEV within your .databrickscfg file and you want dbx to use it instead of the DEFAULT profile, your deployment.yaml file might look like this instead:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            # ...
      dev:
        workflows:
          - name: "<some-other-job-name>"
            # ...
    

    If you want dbx to use the DATABRICKS_HOST and DATABRICKS_TOKEN environment variables instead of a profile in your .databrickscfg file, then leave default in the deployment.yaml as is. dbx will use the default environment settings (except for the profile value) in the .dbx/project.json file by default.

    Tip

    To add Spark configuration key-value pairs to a job, use the spark_conffield, for example:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            spark_conf:
              spark.speculation: true
              spark.streaming.ui.retainedBatches: 5
              spark.driver.extraJavaOptions: "-verbose:gc -XX:+PrintGCDetails"
            # ...
    

    To add permissions to a job, use the access_control_list field, for example:

    environments:
      default:
        workflows:
          - name: "dbx-demo-job"
            access_control_list:
              - user_name: "someone@example.com"
                permission_level: "IS_OWNER"
              - group_name: "some-group"
                permission_level: "CAN_VIEW"
            # ...
    

    Note that the access_control_list field must be exhaustive, so the job’s owner should be added to the list as well as adding other user and group permissions.

  4. Run the dbx deploy command. dbx deploys the JAR to the location in the .dbx/project.json file’s artifact_location path for the matching environment. dbx also deploys the project’s files as part of an MLflow experiment, to the location listed in the .dbx/project.json file’s workspace_directory path for the matching environment.

    dbx deploy --no-package
    
  5. Run the dbx launch command with the following options. This command runs the job with the matching name in conf/deployment.yaml. To find the deployed JAR to run as part of the job, dbx references the location in the .dbx/project.json file’s artifact_location path for the matching environment. To determine which specific JAR to run, dbx references the MLflow experiment in the location listed in the .dbx/project.json file’s workspace_directory path for the matching environment.

    dbx launch dbx-demo-job
    
  6. To view the job run’s results on your jobs cluster, see View jobs.

  7. To view the experiment that the job referenced, see Organize training runs with MLflow experiments.

  8. Continue with Next steps.

Create a dbx templated project for Python with CI/CD support

The following dbx templated project for Python demonstrates support for batch running of Python code on Azure Databricks all-purpose clusters and jobs clusters in your Azure Databricks workspaces, remote code artifact deployments, and CI/CD platform setup. (To create a minimal dbx project for Python that only demonstrates batch running of a single Python code file on an existing all-purpose cluster, skip back to Create a minimal dbx project for Python.)

  1. From your terminal, in your dbx project’s root folder, run the dbx init command.

    dbx init
    
  2. For project_name, enter a name for your project, or press Enter to accept the default project name.

  3. For version, enter a starting version number for your project, or press Enter to accept the default project version.

  4. For cloud, select the number that corresponds to the Azure Databricks cloud version that you want your project to use, or press Enter to accept the default.

  5. For cicd_tool, select the number that corresponds to the supported CI/CD tool that you want your project to use, or press Enter to accept the default.

  6. For project_slug, enter a prefix that you want to use for resources in your project, or press Enter to accept the default.

  7. For workspace_directory, enter the local path to the workspace directory for your project, or press Enter to accept the default.

  8. For artifact_location, enter the path in your Azure Databricks workspace to where your project’s artifacts will be written, or press Enter to accept the default.

  9. For profile, enter the name of the CLI authentication profile that you want your project to use, or press Enter to accept the default.

Tip

You can skip the preceding steps by running dbx init with hard-coded template parameters, for example:

dbx init --template="python_basic" \
-p "project_name=cicd-sample-project" \
-p "cloud=Azure" \
-p "cicd_tool=Azure DevOps" \
-p "profile=DEFAULT" \
--no-input

dbx calculates the parameters project_slug, workspace_directory, and artifact_location automatically. These three parameters are optional, and they are useful only for more advanced use cases.

See the init command in CLI Reference in the dbx documentation.

See also Next steps.

Code example

If you do not have any code readily available to batch run with dbx, you can experiment by having dbx batch run the following code. This code creates a small table in your workspace, queries the table, and then deletes the table.

Tip

If you want to leave the table in your workspace instead of deleting it, comment out the last line of code in this example before you batch run it with dbx.

Python

# For testing and debugging of local objects, run
# "pip install pyspark=X.Y.Z", where "X.Y.Z"
# matches the version of PySpark
# on your target clusters.
from pyspark.sql import SparkSession

from pyspark.sql.types import *
from datetime import date

spark = SparkSession.builder.appName("dbx-demo").getOrCreate()

# Create a DataFrame consisting of high and low temperatures
# by airport code and date.
schema = StructType([
   StructField('AirportCode', StringType(), False),
   StructField('Date', DateType(), False),
   StructField('TempHighF', IntegerType(), False),
   StructField('TempLowF', IntegerType(), False)
])

data = [
   [ 'BLI', date(2021, 4, 3), 52, 43],
   [ 'BLI', date(2021, 4, 2), 50, 38],
   [ 'BLI', date(2021, 4, 1), 52, 41],
   [ 'PDX', date(2021, 4, 3), 64, 45],
   [ 'PDX', date(2021, 4, 2), 61, 41],
   [ 'PDX', date(2021, 4, 1), 66, 39],
   [ 'SEA', date(2021, 4, 3), 57, 43],
   [ 'SEA', date(2021, 4, 2), 54, 39],
   [ 'SEA', date(2021, 4, 1), 56, 41]
]

temps = spark.createDataFrame(data, schema)

# Create a table on the cluster and then fill
# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS demo_temps_table')
temps.write.saveAsTable('demo_temps_table')

# Query the table on the cluster, returning rows
# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM demo_temps_table " \
   "WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
   "GROUP BY AirportCode, Date, TempHighF, TempLowF " \
   "ORDER BY TempHighF DESC")
df_temps.show()

# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode|      Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# |        PDX|2021-04-03|       64|      45|
# |        PDX|2021-04-02|       61|      41|
# |        SEA|2021-04-03|       57|      43|
# |        SEA|2021-04-02|       54|      39|
# +-----------+----------+---------+--------+

# Clean up by deleting the table from the cluster.
spark.sql('DROP TABLE demo_temps_table')

Scala

package com.example.demo

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import java.sql.Date

object SampleApp {
  def main(args: Array[String]) {
    val spark = SparkSession.builder().master("local").getOrCreate()

    val schema = StructType(Array(
      StructField("AirportCode", StringType, false),
      StructField("Date", DateType, false),
      StructField("TempHighF", IntegerType, false),
      StructField("TempLowF", IntegerType, false)
    ))

    val data = List(
      Row("BLI", Date.valueOf("2021-04-03"), 52, 43),
      Row("BLI", Date.valueOf("2021-04-02"), 50, 38),
      Row("BLI", Date.valueOf("2021-04-01"), 52, 41),
      Row("PDX", Date.valueOf("2021-04-03"), 64, 45),
      Row("PDX", Date.valueOf("2021-04-02"), 61, 41),
      Row("PDX", Date.valueOf("2021-04-01"), 66, 39),
      Row("SEA", Date.valueOf("2021-04-03"), 57, 43),
      Row("SEA", Date.valueOf("2021-04-02"), 54, 39),
      Row("SEA", Date.valueOf("2021-04-01"), 56, 41)
    )

    val rdd = spark.sparkContext.makeRDD(data)
    val temps = spark.createDataFrame(rdd, schema)

    // Create a table on the Databricks cluster and then fill
    // the table with the DataFrame's contents.
    // If the table already exists from a previous run,
    // delete it first.
    spark.sql("USE default")
    spark.sql("DROP TABLE IF EXISTS demo_temps_table")
    temps.write.saveAsTable("demo_temps_table")

    // Query the table on the Databricks cluster, returning rows
    // where the airport code is not BLI and the date is later
    // than 2021-04-01. Group the results and order by high
    // temperature in descending order.
    val df_temps = spark.sql("SELECT * FROM demo_temps_table " +
      "WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
      "GROUP BY AirportCode, Date, TempHighF, TempLowF " +
      "ORDER BY TempHighF DESC")
    df_temps.show()

    // Results:
    //
    // +-----------+----------+---------+--------+
    // |AirportCode|      Date|TempHighF|TempLowF|
    // +-----------+----------+---------+--------+
    // |        PDX|2021-04-03|       64|      45|
    // |        PDX|2021-04-02|       61|      41|
    // |        SEA|2021-04-03|       57|      43|
    // |        SEA|2021-04-02|       54|      39|
    // +-----------+----------+---------+--------+

    // Clean up by deleting the table from the Databricks cluster.
    spark.sql("DROP TABLE demo_temps_table")
  }
}

Java

package com.example.demo;

import java.util.ArrayList;
import java.util.List;
import java.sql.Date;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.*;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.Dataset;

public class SampleApp {
  public static void main(String[] args) {
    SparkSession spark = SparkSession
      .builder()
      .appName("Temps Demo")
      .config("spark.master", "local")
      .getOrCreate();

    // Create a Spark DataFrame consisting of high and low temperatures
    // by airport code and date.
    StructType schema = new StructType(new StructField[] {
      new StructField("AirportCode", DataTypes.StringType, false, Metadata.empty()),
      new StructField("Date", DataTypes.DateType, false, Metadata.empty()),
      new StructField("TempHighF", DataTypes.IntegerType, false, Metadata.empty()),
      new StructField("TempLowF", DataTypes.IntegerType, false, Metadata.empty()),
    });

    List<Row> dataList = new ArrayList<Row>();
    dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-03"), 52, 43));
    dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-02"), 50, 38));
    dataList.add(RowFactory.create("BLI", Date.valueOf("2021-04-01"), 52, 41));
    dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-03"), 64, 45));
    dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-02"), 61, 41));
    dataList.add(RowFactory.create("PDX", Date.valueOf("2021-04-01"), 66, 39));
    dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-03"), 57, 43));
    dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-02"), 54, 39));
    dataList.add(RowFactory.create("SEA", Date.valueOf("2021-04-01"), 56, 41));

    Dataset<Row> temps = spark.createDataFrame(dataList, schema);

    // Create a table on the Databricks cluster and then fill
    // the table with the DataFrame's contents.
    // If the table already exists from a previous run,
    // delete it first.
    spark.sql("USE default");
    spark.sql("DROP TABLE IF EXISTS demo_temps_table");
    temps.write().saveAsTable("demo_temps_table");

    // Query the table on the Databricks cluster, returning rows
    // where the airport code is not BLI and the date is later
    // than 2021-04-01. Group the results and order by high
    // temperature in descending order.
    Dataset<Row> df_temps = spark.sql("SELECT * FROM demo_temps_table " +
      "WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " +
      "GROUP BY AirportCode, Date, TempHighF, TempLowF " +
      "ORDER BY TempHighF DESC");
    df_temps.show();

    // Results:
    //
    // +-----------+----------+---------+--------+
    // |AirportCode|      Date|TempHighF|TempLowF|
    // +-----------+----------+---------+--------+
    // |        PDX|2021-04-03|       64|      45|
    // |        PDX|2021-04-02|       61|      41|
    // |        SEA|2021-04-03|       57|      43|
    // |        SEA|2021-04-02|       54|      39|
    // +-----------+----------+---------+--------+

    // Clean up by deleting the table from the Databricks cluster.
    spark.sql("DROP TABLE demo_temps_table");
  }
}

Next steps

Additional resources