Add tasks to jobs in Databricks Asset Bundles

Artikkeli
10/04/2024

This article provides examples of various types of tasks that you can add to Azure Databricks jobs in Databricks Asset Bundles. See What are Databricks Asset Bundles?.

Most job task types have task-specific parameters among their supported settings, but you can also define job parameters that get passed to tasks. Dynamic value references are supported for job parameters, which enable passing values specific to the job run between tasks. See What is a dynamic value reference?.

Note

You can override job task settings. See Override job tasks settings in Databricks Asset Bundles.

Tip

To quickly generate resource configuration for an existing job using the Databricks CLI, you can use the bundle generate job command. See bundle commands.

Notebook task

You use this task to run a notebook.

The following example adds a notebook task to a job and sets a job parameter named my_job_run_id. The path for the notebook to deploy is relative to the configuration file in which this task is declared. The task gets the notebook from its deployed location in the Azure Databricks workspace. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      # ...
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      parameters:
        - name: my_job_run_id
          default: "{{job.run_id}}"
        # ...
# ...

For additional mappings that you can set for this task, see tasks > notebook_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See Notebook task for jobs.

Python script task

You use this task to run a Python file.

The following example adds a Python script task to a job. The path for the Python file to deploy is relative to the configuration file in which this task is declared. The task gets the Python file from its deployed location in the Azure Databricks workspace. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-python-script-job:
      name: my-python-script-job
      # ...
      tasks:
        - task_key: my-python-script-task
          spark_python_task:
            python_file: ./my-script.py
          # ...
# ...

For additional mappings that you can set for this task, see tasks > spark_python_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See also Python script task for jobs.

Python wheel task

You use this task to run a Python wheel file.

The following example adds a Python wheel task to a job. The path for the Python wheel file to deploy is relative to the configuration file in which this task is declared. See Databricks Asset Bundles library dependencies. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-python-wheel-job:
      name: my-python-wheel-job
      # ...
      tasks:
        - task_key: my-python-wheel-task
          python_wheel_task:
            entry_point: run
            package_name: my_package
          libraries:
            - whl: ./my_package/dist/my_package-*.whl
          # ...
# ...

For additional mappings that you can set for this task, see tasks > python_wheel_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See also Develop a Python wheel file using Databricks Asset Bundles and Python Wheel task for jobs.

JAR task

You use this task to run a JAR. You can reference local JAR libraries or those in a workspace, a Unity Catalog volume, or an external cloud storage location. See Databricks Asset Bundles library dependencies.

The following example adds a JAR task to a job. The path for the JAR is to the specified volume location. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-jar-job:
      name: my-jar-job
      # ...
      tasks:
        - task_key: my-jar-task
          spark_jar_task:
            main_class_name: org.example.com.Main
          libraries:
            - jar: /Volumes/main/default/my-volume/my-project-0.1.0-SNAPSHOT.jar
          # ...
# ...

For additional mappings that you can set for this task, see tasks > spark_jar_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See JAR task for jobs.

SQL file task

You use this task to run a SQL file located in a workspace or a remote Git repository.

The following example adds a SQL file task to a job. This SQL file task uses the specified SQL warehouse to run the specified SQL file. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-sql-file-job:
      name: my-sql-file-job
      # ...
      tasks:
        - task_key: my-sql-file-task
          sql_task:
            file:
              path: /Users/someone@example.com/hello-world.sql
              source: WORKSPACE
            warehouse_id: 1a111111a1111aa1
          # ...
# ...

To get a SQL warehouse’s ID, open the SQL warehouse’s settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

For additional mappings that you can set for this task, see tasks > sql_task > file in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See SQL task for jobs.

Delta Live Tables pipeline task

You use this task to run a Delta Live Tables pipeline. See What is Delta Live Tables?.

The following example adds a Delta Live Tables pipeline task to a job. This Delta Live Tables pipeline task runs the specified pipeline. (Ellipses indicate omitted content, for brevity.)

# ...
resources:
  jobs:
    my-pipeline-job:
      name: my-pipeline-job
      # ...
      tasks:
        - task_key: my-pipeline-task
          pipeline_task:
            pipeline_id: 11111111-1111-1111-1111-111111111111
          # ...
# ...

You can get a pipelines’s ID by opening the pipeline in the workspace and copying the Pipeline ID value on the Pipeline details tab of the pipeline’s settings page.

For additional mappings that you can set for this task, see tasks > pipeline_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See Delta Live Tables pipeline task for jobs.

dbt task

You use this task to run one or more dbt commands. See Connect to dbt Cloud.

The following example adds a dbt task to a job. This dbt task uses the specified SQL warehouse to run the specified dbt commands.

# ...
resources:
  jobs:
    my-dbt-job:
      name: my-dbt-job
      # ...
      tasks:
        - task_key: my-dbt-task
          dbt_task:
            commands:
              - "dbt deps"
              - "dbt seed"
              - "dbt run"
            project_directory: /Users/someone@example.com/Testing
            warehouse_id: 1a111111a1111aa1
          libraries:
            - pypi:
                package: "dbt-databricks>=1.0.0,<2.0.0"
          # ...
# ...

To get a SQL warehouse’s ID, open the SQL warehouse’s settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

For additional mappings that you can set for this task, see tasks > dbt_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See dbt task for jobs.

Databricks Asset Bundles also includes a dbt-sql project template that defines a job with a dbt task, as well as dbt profiles for deployed dbt jobs. For information about Databricks Asset Bundles templates, see Use a default bundle template.

Run job task

You use this task to run another job.

The following example contains a run job task in the second job that runs the first job.

# ...
resources:
  jobs:
    my-first-job:
      name: my-first-job
      tasks:
        - task_key: my-first-job-task
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 2
          notebook_task:
            notebook_path: ./src/test.py
    my_second_job:
      name: my-second-job
      tasks:
        - task_key: my-second-job-task
          run_job_task:
            job_id: ${resources.jobs.my-first-job.id}
  # ...

This example uses a substitution to retrieve the ID of the job to run. To get a job’s ID from the UI, open the job in the workspace and copy the ID from the Job ID value in the Job details tab of the jobs’s settings page.

For additional mappings that you can set for this task, see tasks > run_job_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

Jaa