Tutorial: Create your first custom Databricks Asset Bundle template

Artikkeli
10/09/2024

In this tutorial, you’ll create a custom Databricks Asset Bundle template for creating bundles that run a job with a specific Python task on a cluster using a specific Docker container image.

Before you start

Install the Databricks CLI version 0.218.0 or above. If you’ve already installed it, confirm the version is 0.218.0 or higher by running databricks -version from the command line.

Define user prompt variables

The first step in buidling a bundle template is to define the databricks bundle init user prompt variables. From the command line:

Create an empty directory named dab-container-template:
```
mkdir dab-container-template
```
In the directory’s root, create a file named databricks_template_schema.json:
```
cd dab-container-template
touch databricks_template_schema.json
```

Add the following contents to the databricks_template_schema.json and save the file. Each variable will be translated to a user prompt during bundle creation.

{
  "properties": {
    "project_name": {
      "type": "string",
      "default": "project_name",
      "description": "Project name",
      "order": 1
    }
  }
}

Create the bundle folder structure

Next, in the template directory, create subdirectories named resources and src. The template folder contains the directory structure for your generated bundles. The names of the subdirectories and files will follow Go package template syntax when derived from user values.

  mkdir -p "template/resources"
  mkdir -p "template/src"

Add YAML configuration templates

In the template directory, create a file named databricks.yml.tmpl and add the following YAML:

  touch template/databricks.yml.tmpl

  # This is a Databricks asset bundle definition for {{.project_name}}.
  # See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
  bundle:
    name: {{.project_name}}

  include:
    - resources/*.yml

  targets:
    # The 'dev' target, used for development purposes.
    # Whenever a developer deploys using 'dev', they get their own copy.
    dev:
      # We use 'mode: development' to make sure everything deployed to this target gets a prefix
      # like '[dev my_user_name]'. Setting this mode also disables any schedules and
      # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines.
      mode: development
      default: true
      workspace:
        host: {{workspace_host}}

    # The 'prod' target, used for production deployment.
    prod:
      # For production deployments, we only have a single copy, so we override the
      # workspace.root_path default of
      # /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.target}/${bundle.name}
      # to a path that is not specific to the current user.
      #
      # By making use of 'mode: production' we enable strict checks
      # to make sure we have correctly configured this target.
      mode: production
      workspace:
        host: {{workspace_host}}
        root_path: /Shared/.bundle/prod/${bundle.name}
      {{- if not is_service_principal}}
      run_as:
        # This runs as {{user_name}} in production. Alternatively,
        # a service principal could be used here using service_principal_name
        # (see Databricks documentation).
        user_name: {{user_name}}
      {{end -}}

Create another YAML file named {{.project_name}}_job.yml.tmpl and place it in the template/resources directory. This new YAML file splits the project job definitions from the rest of the bundle’s definition. Add the following YAML to this file to describe the template job, which contains a specific Python task to run on a job cluster using a specific Docker container image:

  touch template/resources/{{.project_name}}_job.yml.tmpl

  # The main job for {{.project_name}}
  resources:
    jobs:
      {{.project_name}}_job:
        name: {{.project_name}}_job
        tasks:
          - task_key: python_task
            job_cluster_key: job_cluster
            spark_python_task:
              python_file: ../src/{{.project_name}}/task.py
        job_clusters:
          - job_cluster_key: job_cluster
            new_cluster:
              docker_image:
                url: databricksruntime/python:10.4-LTS
              node_type_id: i3.xlarge
              spark_version: 13.3.x-scala2.12

In this example, you use a default Databricks base Docker container image, but you can specify your own custom image instead.

Add files referenced in your configuration

Next, create a template/src/{{.project_name}} directory and create the Python task file referenced by the job in the template:

  mkdir -p template/src/{{.project_name}}
  touch template/src/{{.project_name}}/task.py

Now, add the following to task.py:

  import pyspark
  from pyspark.sql import SparkSession

  spark = SparkSession.builder.master('local[*]').appName('example').getOrCreate()

  print(f'Spark version{spark.version}')

Verify the bundle template structure

Review the folder structure of your bundle template project. It should look like this:

  .
  ├── databricks_template_schema.json
  └── template
      ├── databricks.yml.tmpl
      ├── resources
      │   └── {{.project_name}}_job.yml.tmpl
      └── src
          └── {{.project_name}}
              └── task.py

Test your template

Finally, test your bundle template. To generate a bundle based on your new custom template, use the databricks bundle init command, specifying the new template location. From your bundle projects root folder:

mkdir my-new-container-bundle
cd my-new-container-bundle
databricks bundle init dab-container-template

Next steps

Create a bundle that deploys a notebook to an Azure Databricks workspace and then runs that deployed notebook as an Azure Databricks job. See Develop a job on Azure Databricks using Databricks Asset Bundles.
Create a bundle that deploys a notebook to an Azure Databricks workspace and then runs that deployed notebook as a Delta Live Tables pipeline. See Develop Delta Live Tables pipelines with Databricks Asset Bundles.
Create a bundle that deploys and runs an MLOps Stack. See Databricks Asset Bundles for MLOps Stacks.
Add a bundle to a CI/CD (continuous integration/continuous deployment) workflow in GitHub. See Run a CI/CD workflow with a Databricks Asset Bundle and GitHub Actions.

Resources

Bundle examples repository in GitHub

Jaa