Develop a Python wheel file using Databricks Asset Bundles

Artikkel
08/30/2024

This article describes how to build, deploy, and run a Python wheel file as part of a Databricks Asset Bundle project. See What are Databricks Asset Bundles?

Requirements

Databricks CLI version 0.218.0 or above. To check your installed version of the Databricks CLI, run the command databricks -v. To install the Databricks CLI, see Install or update the Databricks CLI.
The remote workspace must have workspace files enabled. See What are workspace files?.

Decision: Create the bundle manually or by using a template

Decide whether you want to create a starter bundle by using a template or to create the bundle manually. Creating the bundle by using a template is faster and easier, but the bundle might produce content that is not needed, and the bundle’s default settings must be further customized for real applications. Creating the bundle manually gives you full control over the bundle’s settings, but you must be familiar with how bundles work, as you are doing all of the work from the beginning. Choose one of the following sets of steps:

Create the bundle by using a template
Create the bundle manually

Create the bundle by using a template

In these steps, you create the bundle by using the Azure Databricks default bundle template for Python. These steps guide you to create a bundle that consists of files to build into a Python wheel file and the definition of an Azure Databricks job to build this Python wheel file. You then validate, deploy, and build the deployed files into a Python wheel file from the Python wheel job within your Azure Databricks workspace.

The Azure Databricks default bundle template for Python uses setuptools to build the Python wheel file. If you want to use Poetry to build the Python wheel file instead, follow the instructions later in this section to swap out the setuptools implementation for a Poetry implementation instead.

Step 1: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Azure Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Azure Databricks configuration profile named DEFAULT for authentication.

Note

U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.

Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

In the following command, replace <workspace-url> with your Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.
```
databricks auth login --host <workspace-url>
```
The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.
In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
- databricks auth token --host <workspace-url>
- databricks auth token -p <profile-name>
- databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 2: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.

Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.
Use the Databricks CLI version to run the bundle init command:
```
databricks bundle init
```
For Template to use, leave the default value of default-python by pressing Enter.
For Unique name for this project, leave the default value of my_project, or type a different value, and then press Enter. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.
For Include a stub (sample) notebook, select no and press Enter. This instructs the Databricks CLI to not add a sample notebook to your bundle.
For Include a stub (sample) DLT pipeline, select no and press Enter. This instructs the Databricks CLI to not define a sample Delta Live Tables pipeline in your bundle.
For Include a stub (sample) Python package, leave the default value of yes by pressing Enter. This instructs the Databricks CLI to add sample Python wheel package files and related build instructions to your bundle.

Step 3: Explore the bundle

To view the files that the template generated, switch to the root directory of your newly created bundle and open this directory with your preferred IDE, for example Visual Studio Code. Files of particular interest include the following:

databricks.yml: This file specifies the bundle’s programmatic name, includes a reference to the Python wheel job definition, and specifies settings about the target workspace.
resources/<project-name>_job.yml: This file specifies the Python wheel job’s settings.
src/<project-name>: This directory include the files that the Python wheel job uses to build the Python wheel file.

Note

If you want to install the Python wheel file on a target cluster that has Databricks Runtime 12.2 LTS or below installed, you must add the following top-level mapping to the databricks.yml file:

# Applies to all tasks of type python_wheel_task.
experimental:
  python_wheel_wrapper: true

This mapping instructs the Databricks CLI to do the following:

Deploy a copy of the Python wheel file in the background. This deployment path is typically ${workspace.artifact_path}/.internal/<random-id>/<wheel-filename>.whl.
Create a notebook in the background that contains instructions to install the preceding deployed Python wheel file on the target cluster. This notebook’s path is typically ${workspace.file_path}/.databricks/bundle/<target-name>/.internal/notebook_<job-name>_<task-key>.
When you run a job that contains a Python wheel task, and that tasks references the preceding Python wheel file, a job is created in the background that runs the preceding notebook.

You do not need to specify this mapping for target clusters with Databricks Runtime 13.1 or above installed, as Python wheel installations from the Azure Databricks workspace file system will install automatically on these target clusters.

Step 4: Update the project’s bundle to use Poetry

By default, the bundle template specifies building the Python wheel file using setuptools along with the files setup.py and requirements-dev.txt. If you want to keep these defaults, then skip ahead to Step 5: Validate the project’s bundle configuration file.

To update the project’s bundle to use Poetry instead of setuptools, make sure that your local development machine meets the following requirements:

Poetry version 1.6 or above. To check your installed version of Poetry, run the command poetry -V or poetry --version. To install or upgrade Poetry, see Installation.
Python version 3.10 or above. To check your version of Python, run the command python -V or python --version.
Databricks CLI version 0.209.0 or above. To your version of the Databricks CLI, run the command databricks -v or databricks --version. See Install or update the Databricks CLI.

Make the following changes to the project’s bundle:

From the bundle’s root directory, instruct poetry to initialize the Python wheel builds for Poetry, by running the following command:
```
poetry init
```
Poetry displays several prompts for you to complete. For the Python wheel builds, answer these prompts as follows to match the related default settings in the project’s bundle:
1. For Package name, type the name of the child folder under /src, and then press Enter. This should also be the package’s name value that is defined in the bundle’s setup.py file.
2. For Version, type 0.0.1 and press Enter. This matches the version number that is defined in the bundle’s src/<project-name>/__init__.py file.
3. For Description, type wheel file based on <project-name>/src (replacing <project-name> with the project’s name), and press Enter. This matches the description value that is defined in the template’s setup.py file.
4. For Author, press Enter. This default value matches the author that is defined in the template’s setup.py file.
5. For License, press Enter. There is no license defined in the template.
6. For Compatible Python versions, enter the Python version that matches the one on your target Azure Databricks clusters (for example, ^3.10), and press Enter.
7. For Would you like to define your main dependencies interactively? Type no and press Enter. You will define your dependencies later.
8. For Would you like to define your development dependencies interactively? Type no and press Enter. You will define your dependencies later.
9. For Do you confirm generation? Press Enter.
After you complete the prompts, Poetry adds a pyproject.toml file to the bundle’s project. For information about the pyproject.toml file, see The pyproject.toml file.
From the bundle’s root directory, instruct poetry to read the pyproject.toml file, resolve the dependencies and install them, create a poetry.lock file to lock the dependencies, and finally to create a virtual environment. To do this, run the following command:
```
poetry install
```
Add the following section at the end of the pyproject.toml file, replacing <project-name> with the name of directory that contains the src/<project-name>/main.py file (for example, my_project):
```
[tool.poetry.scripts]
main = "<project-name>.main:main"
```
The section specifies the Python wheel’s entry point for the Python wheel job.
Add the following mapping at the top level of the bundle’s databricks.yml file:
```
artifacts:
  default:
    type: whl
    build: poetry build
    path: .
```
This mapping instructs the Databricks CLI to use Poetry to build a Python wheel file.
Delete the setup.py and requirements-dev.txt files from the bundle, as Poetry does not need them.

Step 5: Validate the project’s bundle configuration file

In this step, you check whether the bundle configuration is valid.

From the root directory, use the Databricks CLI to run the bundle validate command, as follows:
```
databricks bundle validate
```
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.

Step 6: Build the Python wheel file and deploy the local project to the remote workspace

In this step, you build the Python wheel file, deploy the built Python wheel file to your remote Azure Databricks workspace, and create the Azure Databricks job within your workspace.

If you use setuptools, install the wheel and setuptools packages if you have not done so already, by running the following command:
```
pip3 install --upgrade wheel setuptools
```
In the Visual Studio Code terminal, use the Databricks CLI to run the bundle deploy command as follows:
```
databricks bundle deploy -t dev
```
If you want to check whether the locally built Python wheel file was deployed:
1. In your Azure Databricks workspace’s sidebar, click Workspace.
2. Click into the following folder: Workspace > Users > <your-username> > .bundle > <project-name> > dev > artifacts > .internal > <random-guid>.
The Python wheel file should be in this folder.
If you want to check whether the job was created:
1. In your Azure Databricks workspace’s sidebar, click Workflows.
2. On the Jobs tab, click [dev <your-username>] <project-name>_job.
3. Click the Tasks tab.
There should be one task: main_task.

If you make any changes to your bundle after this step, you should repeat steps 5-6 to check whether your bundle configuration is still valid and then redeploy the project.

Step 7: Run the deployed project

In this step, you run the Azure Databricks job in your workspace.

From the root directory, use the Databricks CLI to run the bundle run command, as follows, replacing <project-name> with the name of your project from Step 2:
```
databricks bundle run -t dev <project-name>_job
```
Copy the value of Run URL that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace.
In your Azure Databricks workspace, after the task completes successfully and shows a green title bar, click the main_task task to see the results.

If you make any changes to your bundle after this step, you should repeat steps 5-7 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.

You have reached the end of the steps for creating a bundle by using a template.

Create the bundle manually

In these steps, you create the bundle from the beginning by hand. These steps guide you to create a bundle that consists of files to build into a Python wheel file and the definition of a Databricks job to build this Python wheel file. You then validate, deploy, and build the deployed files into a Python wheel file from the Python wheel job within your Databricks workspace.

These steps include adding content to a YAML file. Optionally, you might want to use an integrated development environment (IDE) that provides automatic schema suggestions and actions when working with YAML files. The following steps use Visual Studio Code with the YAML extension installed from the Visual Studio Code Marketplace.

These steps assume that you already know:

How to create, build, and work with Python wheel files with Poetry or setuptools. For Poetry, see Basic usage. For setuptools, see the Python Packaging User Guide.
How to use Python wheel files as part of an Azure Databricks job. See Use a Python wheel file in an Azure Databricks job.

Follow these instructions to create a sample bundle that builds a Python wheel file with Poetry or setuptools, deploys the Python wheel file, and then runs the deployed Python wheel file.

If you have already built a Python wheel file and just want to deploy and run it, skip ahead to specifying the Python wheel settings in the bundle configuration file in Step 3: Create the bundle’s configuration file.

Step 1: Set up authentication

Note

Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

In the following command, replace <workspace-url> with your Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.
```
databricks auth login --host <workspace-url>
```
The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.
In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.
To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:
- databricks auth token --host <workspace-url>
- databricks auth token -p <profile-name>
- databricks auth token --host <workspace-url> -p <profile-name>
If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 2: Create the bundle

A bundle contains the artifacts you want to deploy and the settings for the workflows you want to run.

In your bundle’s root, create the following folders and files, depending on whether you use Poetry or setuptools for building Python wheel files:

Poetry

├── src
│     └── my_package
│           ├── __init__.py
│           ├── main.py
│           └── my_module.py
└── pyproject.toml

Setuptools

├── src
│     └── my_package
│           ├── __init__.py
│           ├── main.py
│           └── my_module.py
└── setup.py

Leave the __init__.py file empty.

Add the following code to the main.py file and then save the file:

from my_package.my_module import *

def main():

  first = 200
  second = 400

  print(f"{first} + {second} = {add_two_numbers(first, second)}")
  print(f"{second} - {first} = {subtract_two_numbers(second, first)}")
  print(f"{first} * {second} = {multiply_two_numbers(first, second)}")
  print(f"{second} / {first} = {divide_two_numbers(second, first)}")

if __name__ == "__main__":
  main()

Add the following code to the my_module.py file and then save the file:

def add_two_numbers(a, b):
  return a + b

def subtract_two_numbers(a, b):
  return a - b

def multiply_two_numbers(a, b):
  return a * b

def divide_two_numbers(a, b):
  return a / b

Add the following code to the pyproject.toml or setup.py file and then save the file:

Pyproject.toml

[tool.poetry]
name = "my_package"
version = "0.0.1"
description = "<my-package-description>"
authors = ["my-author-name <my-author-name>@<my-organization>"]

[tool.poetry.dependencies]
python = "^3.10"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.scripts]
main = "my_package.main:main"

Replace my-author-name with your organization’s primary contact name.
Replace my-author-name>@<my-organization with your organization’s primary email contact address.
Replace <my-package-description> with a display description for your Python wheel file.

Setup.py

from setuptools import setup, find_packages

import src

setup(
  name = "my_package",
  version = "0.0.1",
  author = "<my-author-name>",
  url = "https://<my-url>",
  author_email = "<my-author-name>@<my-organization>",
  description = "<my-package-description>",
  packages=find_packages(where='./src'),
  package_dir={'': 'src'},
  entry_points={
    "packages": [
      "main=my_package.main:main"
    ]
  },
  install_requires=[
    "setuptools"
  ]
)

Replace https://<my-url> with your organization’s URL.
Replace <my-author-name> with your organization’s primary contact name.
Replace <my-author-name>@<my-organization> with your organization’s primary email contact address.
Replace <my-package-description> with a display description for your Python wheel file.

Step 3: Create the bundle’s configuration file

A bundle configuration file describes the artifacts you want to deploy and the workflows you want to run.

In your bundle’s root, add a bundle configuration file named databricks.yml. Add the following code to this file:

Poetry

Note

If you have already built a Python wheel file and just want to deploy it, then modify the following bundle configuration file by omitting the artifacts mapping. The Databricks CLI will then assume that the Python wheel file is already built and will automatically deploy the files that are specified in the libraries array’s whl entries.
```
bundle:
  name: my-wheel-bundle

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  jobs:
    wheel-job:
      name: wheel-job
      tasks:
        - task_key: wheel-task
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            data_security_mode: USER_ISOLATION
            num_workers: 1
          python_wheel_task:
            entry_point: main
            package_name: my_package
          libraries:
            - whl: ./dist/*.whl

targets:
  dev:
    workspace:
    host: <workspace-url>
```
Setuptools
```
bundle:
  name: my-wheel-bundle

resources:
  jobs:
    wheel-job:
      name: wheel-job
      tasks:
        - task_key: wheel-task
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            data_security_mode: USER_ISOLATION
            num_workers: 1
          python_wheel_task:
            entry_point: main
            package_name: my_package
          libraries:
            - whl: ./dist/*.whl

targets:
  dev:
    workspace:
    host: <workspace-url>
```
Replace <workspace-url> with your per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.

The artifacts mapping is required to build Python wheel files with Poetry and is optional to build Python wheel files with setuptools. The artifacts mapping contains one or more artifact definitions with the following mappings:
- The type mapping must be present and set to whl to specify that a Python wheel file is to be built. For setuptools, whl is the default if no artifact definitions are specified.
- The path mapping indicates the path to the pyproject.toml file for Poetry or to the setup.py file for setuptools. This path is relative to the databricks.yml file. For setuptools, this path is . (the same directory as the databricks.yml file) by default.
- The build mapping indicates any custom build commands to run to build the Python wheel file. For setuptools, this command is python3 setup.py bdist wheel by default.
- The files mapping consists of one or more source mappings that specify any additional files to include in the Python wheel build. There is no default.
Note

If you want to install the Python wheel file on a target cluster that has Databricks Runtime 12.2 LTS or below installed, you must add the following top-level mapping to the databricks.yml file:
```
# Applies to jobs with python_wheel_task and that use
# clusters with Databricks Runtime 13.0 or below installed.
experimental:
  python_wheel_wrapper: true
```
This mapping instructs the Databricks CLI to do the following:
- Deploys a copy of the Python wheel file in the background. This deployment path is typically ${workspace.artifact_path}/.internal/<random-id>/<wheel-filename>.whl.
- Creates a notebook in the background that contains instructions to install the preceding deployed Python wheel file on the target cluster. This notebook’s path is typically ${workspace.file_path}/.databricks/bundle/<target-name>/.internal/notebook_<job-name>_<task-key>.
- When you run a job that contains a Python wheel task, and that task references the preceding Python wheel file, a job is created in the background that runs the preceding notebook.
You do not need to specify this mapping for target clusters with Databricks Runtime 13.1 or above installed, as Python wheel installations from the Azure Databricks workspace file system will install automatically on these target clusters.
If you use Poetry, do the following:
- Install Poetry, version 1.6 or above, if it is not already installed. To check your installed version of Poetry, run the command poetry -V or poetry --version.
- Make sure you have Python version 3.10 or above installed. To check your version of Python, run the command python -V or python --version.
- Make sure you have Databricks CLI version 0.209.0 or above. To your version of the Databricks CLI, run the command databricks -v or databricks --version. See Install or update the Databricks CLI.
If you use setuptools, install the wheel and setuptools packages if they are not already installed, by running the following command:
```
pip3 install --upgrade wheel setuptools
```
If you intend to store this bundle with a Git provider, add a .gitignore file in the project’s root, and add the following entries to this file:

Poetry
```
.databricks
dist
```
Setuptools
```
.databricks
build
dist
src/my_package/my_package.egg-info
```

Step 4: Validate the project’s bundle configuration file

In this step, you check whether the bundle configuration is valid.

From the root directory, validate the bundle configuration file:
```
databricks bundle validate
```
If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.

Step 5: Build the Python wheel file and deploy the local project to the remote workspace

Build the Python wheel file locally, deploy the built Python wheel file to your workspace, deploy the notebook to your workspace, and create the job in your workspace:

databricks bundle deploy -t dev

Step 6: Run the deployed project

Run the deployed job, which uses the deployed notebook to call the deployed Python wheel file:
```
databricks bundle run -t dev wheel-job
```
In the output, copy the Run URL and paste it into your web browser’s address bar.

In the job run’s Output page, the following results appear:

200 + 400 = 600
400 - 200 = 200
200 * 400 = 80000
400 / 200 = 2.0

If you make any changes to your bundle after this step, you should repeat steps 3-5 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.

Build and install a Python wheel file for a job

To build a Python wheel file with Poetry or setuptools, and then use that Python wheel file in a job, you must add one or two mappings to your databricks.yml file.

If you use Poetry, you must include the following artifacts mapping in the databricks.yml file. This mapping runs the poetry build command and uses the pyproject.toml file that is in the same directory as the databricks.yml file:

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

Note

The artifacts mapping is optional for setuptools. By default, for setuptools the Databricks CLI runs the command python3 setup.py bdist_wheel and uses the setup.py file that is in the same directory as the databricks.yml file. The Databricks CLI assumes that you have already run a command such as pip3 install --upgrade wheel setuptools to install the wheel and setuptools packages if they are not already installed.

Also, the job task’s libraries mapping must contain a whl value that specifies the path to the built Python wheel file relative to the configuration file in which it is declared. The following example shows this in a notebook task (the ellipsis indicates omitted content for brevity):

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-job-notebook-task
          notebook_task:
            notebook_path: ./my_notebook.py
          libraries:
            - whl: ./dist/*.whl
          new_cluster:
            # ...

Build and install a Python wheel file for a pipeline

To build a Python wheel file with Poetry or setuptools and then reference that Python wheel file in a Delta Live Tables pipeline, you must add a mapping to your databricks.yml file if you use Poetry, and you must add a %pip install command to your pipeline notebook, as follows.

artifacts:
  default:
    type: whl
    build: poetry build
    path: .

Note

Also, the related pipeline notebook must include a %pip install command to install the Python wheel file that is built. See Python libraries.

Del via

Develop a Python wheel file using Databricks Asset Bundles

Requirements

Decision: Create the bundle manually or by using a template

Create the bundle by using a template

Step 1: Set up authentication

Step 2: Create the bundle

Step 3: Explore the bundle

Step 4: Update the project’s bundle to use Poetry

Step 5: Validate the project’s bundle configuration file

Step 6: Build the Python wheel file and deploy the local project to the remote workspace

Step 7: Run the deployed project

Create the bundle manually

Step 1: Set up authentication

Step 2: Create the bundle

Poetry

Setuptools

Pyproject.toml

Setup.py

Step 3: Create the bundle’s configuration file

Poetry

Setuptools

Poetry

Setuptools

Step 4: Validate the project’s bundle configuration file

Step 5: Build the Python wheel file and deploy the local project to the remote workspace

Step 6: Run the deployed project

Build and install a Python wheel file for a job

Build and install a Python wheel file for a pipeline

Tilbakemeldinger

Flere ressurser