Develop a job on Azure Databricks using Databricks Asset Bundles
Databricks Asset Bundles, also known simply as bundles, contain the artifacts you want to deploy and the settings for Azure Databricks resources such as jobs that you want to run, and enable you to programmatically validate, deploy, and run them. See What are Databricks Asset Bundles?.
This article describes how to create a bundle to programmatically manage a job. See Schedule and orchestrate workflows. The bundle is created using the Databricks Asset Bundles default bundle template for Python, which consists of a notebook paired with the definition of a job to run it. You then validate, deploy, and run the deployed job in your Azure Databricks workspace.
Tip
If you have existing jobs that were created using the Azure Databricks Jobs user interface or API that you want to move to bundles, you must define them in a bundle’s configuration files. Databricks recommends that you first create a bundle using the steps below and then validate whether the bundle works. You can then add additional job definitions, notebooks, and other sources to the bundle. See Add an existing job definition to a bundle.
Requirements
- Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command
databricks -v
. To install the Databricks CLI, see Install or update the Databricks CLI. - The remote Databricks workspace must have workspace files enabled. See What are workspace files?.
Create a bundle using a project template
First, create a bundle using the Databricks Asset Bundles default Python template. For more information about bundle templates, see Databricks Asset Bundle project templates.
If you want to create a bundle from scratch, see Create a bundle manually.
Step 1: Set up authentication
In this step, you set up authentication between the Databricks CLI on your development machine and your Azure Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Azure Databricks configuration profile named DEFAULT
for authentication.
Note
U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.
In the following command, replace
<workspace-url>
with your Azure Databricks per-workspace URL, for examplehttps://adb-1234567890123456.7.azuredatabricks.net
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command
databricks auth profiles
. To view a specific profile’s existing settings, run the commanddatabricks auth env --profile <profile-name>
.In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
databricks auth token --host <workspace-url>
databricks auth token -p <profile-name>
databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same
--host
value, you might need to specify the--host
and-p
options together to help the Databricks CLI find the correct matching OAuth token information.
Step 2: Initialize the bundle
Initialize a bundle using the default Python bundle project template.
Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.
Use the Databricks CLI to run the
bundle init
command:databricks bundle init
For
Template to use
, leave the default value ofdefault-python
by pressingEnter
.For
Unique name for this project
, leave the default value ofmy_project
, or type a different value, and then pressEnter
. This determines the name of the root directory for this bundle. This root directory is created in your current working directory.For
Include a stub (sample) notebook
, selectyes
and pressEnter
.For
Include a stub (sample) DLT pipeline
, selectno
and pressEnter
. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.For
Include a stub (sample) Python package
, selectno
and pressEnter
. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle.
Step 3: Explore the bundle
To view the files that the template generated, switch to the root directory of your newly created bundle. Files of particular interest include the following:
databricks.yml
: This file specifies the bundle’s programmatic name, includes a reference to the job definition, and specifies settings about the target workspace.resources/<project-name>_job.yml
: This file specifies the job’s settings, including a default notebook task.src/notebook.ipynb
: This file is a sample notebook that, when run, simply initializes an RDD that contains the numbers 1 through 10.
For customizing jobs, the mappings in a job declaration correspond to the request payload, expressed in YAML format, of the create job operation as documented in POST /api/2.1/jobs/create in the REST API reference.
Tip
You can define, combine, and override the settings for new job clusters in bundles by using the techniques described in Override cluster settings in Databricks Asset Bundles.
Step 4: Validate the project’s bundle configuration file
In this step, you check whether the bundle configuration is valid.
From the root directory, use the Databricks CLI to run the
bundle validate
command, as follows:databricks bundle validate
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.
If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.
Step 5: Deploy the local project to the remote workspace
In this step, you deploy the local notebook to your remote Azure Databricks workspace and create the Azure Databricks job within your workspace.
From the bundle root, use the Databricks CLI to run the
bundle deploy
command as follows:databricks bundle deploy -t dev
Check whether the local notebook was deployed: In your Azure Databricks workspace’s sidebar, click Workspace.
Click into the Users >
<your-username>
> .bundle ><project-name>
> dev > files > src folder. The notebook should be in this folder.Check whether the job was created: In your Azure Databricks workspace’s sidebar, click Workflows.
On the Jobs tab, click [dev
<your-username>
]<project-name>_job
.Click the Tasks tab. There should be one task: notebook_task.
If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project.
Step 6: Run the deployed project
In this step, you trigger a run of the Azure Databricks job in your workspace from the command line.
From the root directory, use the Databricks CLI to run the
bundle run
command, as follows, replacing<project-name>
with the name of your project from Step 2:databricks bundle run -t dev <project-name>_job
Copy the value of
Run URL
that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace. See View and run a job created with a Databricks Asset BundleIn your Azure Databricks workspace, after the job task completes successfully and shows a green title bar, click the job task to see the results.
If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.
Step 7: Clean up
In this step, you delete the deployed notebook and the job from your workspace.
From the root directory, use the Databricks CLI to run the
bundle destroy
command, as follows:databricks bundle destroy -t dev
Confirm the job deletion request: When prompted to permanently destroy resources, type
y
and pressEnter
.Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type
y
and pressEnter
.If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2.
Add an existing job definition to a bundle
You can use an existing job as the basis to define a job in a bundle configuration file. To get an existing job definition, you can manually retrieve it using the UI, or you can generate it programmatically using the Databricks CLI.
For information about the job definition in bundles, see job.
Get an existing job definition using the UI
To get the YAML representation of an existing job definition from the Azure Databricks workspace UI:
In your Azure Databricks workspace’s sidebar, click Workflows.
On the Jobs tab, click your job’s Name link.
Next to the Run now button, click the kebab, and then click Switch to code (YAML).
Add the YAML that you copied to your bundle’s
databricks.yml
file, or create a configuration file for your job in theresources
directory of your bundle project and reference it from yourdatabricks.yml
file. See (/dev-tools/bundles/settings.md#resources).Download and add any Python files and notebooks that are referenced in the existing job to the bundle’s project source. Typically bundle artifacts are in the
src
directory in a bundle.Tip
You can export an existing notebook from a Azure Databricks workspace into the
.ipynb
format by clicking File > Export > IPython Notebook from the Azure Databricks notebook user interface.After you add your notebooks, Python files, and other artifacts to the bundle, make sure that your job definition properly references them. For example, for a notebook named
hello.ipynb
that is in thesrc
directory of the bundle:resources: jobs: hello-job: name: hello-job tasks: - task_key: hello-task notebook_task: notebook_path: ../src/hello.ipynb
Generate an existing job definition using the Databricks CLI
To programmatically generate bundle configuration for an existing job:
Retrieve the ID of the existing job from the Job details side panel for the job in the Jobs UI, or use the Databricks CLI
databricks jobs list
command.Run the
bundle generate job
Databricks CLI command, setting the job ID:databricks bundle generate job --existing-job-id 6565621249
This command creates a bundle configuration file for the job in the bundle’s
resources
folder and downloads any referenced artifacts to thesrc
folder.Tip
If you first use
bundle deployment bind
to bind a resource in a bundle to one in the workspace, the resource in the workspace is updated based on the configuration defined in the bundle it is bound to after the nextbundle deploy
. For information onbundle deployment bind
, see Bind bundle resources.
Configure a job that uses serverless compute
The following examples demonstrate bundle configurations to create a job that uses serverless compute.
To use serverless compute to run a job that includes notebook tasks, omit the job_clusters
configuration from the bundle configuration file.
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: baby-names
resources:
jobs:
retrieve-filter-baby-names-job-serverless:
name: retrieve-filter-baby-names-job-serverless
tasks:
- task_key: retrieve-baby-names-task
notebook_task:
notebook_path: ./retrieve-baby-names.py
- task_key: filter-baby-names-task
depends_on:
- task_key: retrieve-baby-names-task
notebook_task:
notebook_path: ./filter-baby-names.py
targets:
development:
workspace:
host: <workspace-url>
To use serverless compute to run a job that includes Python tasks, include the environments
configuration.
# yaml-language-server: $schema=bundle_config_schema.json
bundle:
name: serverless-python-tasks
resources:
jobs:
serverless-python-job:
name: serverless-job-with-python-tasks
tasks:
- task_key: wheel-task-1
python_wheel_task:
entry_point: main
package_name: wheel_package
environment_key: Default
environments:
- environment_key: Default
spec:
client: "1"
dependencies:
- workflows_authoring_toolkit==0.0.1
targets:
development:
workspace:
host: <workspace-url>
See Run your Azure Databricks job with serverless compute for workflows.