Updating from Jobs API 2.0 to 2.1
You can now orchestrate multiple tasks with Azure Databricks jobs. This article details changes to the Jobs API that support jobs with multiple tasks and has guidance to help you update your existing API clients to work with this new feature.
Databricks recommends Jobs API 2.1 for your API scripts and clients, particularly when using jobs with multiple tasks.
This article refers to jobs defined with a single task as single-task format and jobs defined with multiple tasks as multi-task format.
Jobs API 2.0 and 2.1 now support the update request. Use the update
request to change an existing job instead of the reset request to minimize changes between single-task format jobs and multi-task format jobs.
The Jobs API now defines a TaskSettings
object to capture settings for each task in a job. For multi-task format jobs, the tasks
field, an array of TaskSettings
data structures, is included in the JobSettings
object. Some fields previously part of JobSettings
are now part of the task settings for multi-task format jobs. JobSettings
is also updated to include the format
field. The format
field indicates the format of the job and is a STRING
value set to SINGLE_TASK
or MULTI_TASK
.
You need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the API client guide for more information on required changes.
Jobs API 2.1 supports the multi-task format. All API 2.1 requests must conform to this format, and responses are structured in this format.
Jobs API 2.0 is updated with an additional field to support multi-task format jobs. Except where noted, the examples in this document use API 2.0. However, Databricks recommends API 2.1 for new and existing API scripts and clients.
An example JSON document representing a multi-task format job for API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
Jobs API 2.1 supports configuration of task level clusters or one or more shared job clusters:
- A task level cluster is created and started when a task starts and terminates when the task completes.
- A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. A shared job cluster is not terminated when idle but terminates only after all tasks using it are complete. Multiple non-dependent tasks sharing a cluster can start at the same time. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created.
To configure shared job clusters, include a JobCluster
array in the JobSettings
object. You can specify a maximum of 100 clusters per job. The following is an example of an API 2.1 response for a job configured with two shared clusters:
Note
If a task has library dependencies, you must configure the libraries in the task
field settings; libraries cannot be configured in a shared job cluster configuration. In the following example, the libraries
field in the configuration of the ingest_orders
task demonstrates specification of a library dependency.
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"job_clusters": [
{
"job_cluster_key": "default_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
},
{
"job_cluster_key": "data_processing_cluster",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "r4.2xlarge",
"spark_conf": {
"spark.speculation": true
},
"aws_attributes": {
"availability": "SPOT",
"zone_id": "us-west-2a"
},
"autoscale": {
"min_workers": 8,
"max_workers": 16
}
}
}
],
"tasks": [
{
"task_key": "ingest_orders",
"description": "Ingest order data",
"depends_on": [ ],
"job_cluster_key": "auto_scaling_cluster",
"spark_jar_task": {
"main_class_name": "com.databricks.OrdersIngest",
"parameters": [
"--data",
"dbfs:/path/to/order-data.json"
]
},
"libraries": [
{
"jar": "dbfs:/mnt/databricks/OrderIngest.jar"
}
],
"timeout_seconds": 86400,
"max_retries": 3,
"min_retry_interval_millis": 2000,
"retry_on_timeout": false
},
{
"task_key": "clean_orders",
"description": "Clean and prepare the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"job_cluster_key": "default_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_orders",
"description": "Perform an analysis of the order data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"job_cluster_key": "data_processing_cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
For single-task format jobs, the JobSettings
data structure remains unchanged except for the addition of the format
field. No TaskSettings
array is included, and the task settings remain defined at the top level of the JobSettings
data structure. You will not need to make changes to your existing API clients to process single-task format jobs.
An example JSON document representing a single-task format job for API 2.0:
{
"job_id": 27,
"settings": {
"name": "Example notebook",
"existing_cluster_id": "1201-my-cluster",
"libraries": [
{
"jar": "dbfs:/FileStore/jars/spark_examples.jar"
}
],
"email_notifications": {},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "0 0 0 * * ?",
"timezone_id": "US/Pacific",
"pause_status": "UNPAUSED"
},
"notebook_task": {
"notebook_path": "/notebooks/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1504128821443,
"creator_user_name": "user@databricks.com"
}
This section provides guidelines, examples, and required changes for API calls affected by the new multi-task format feature.
To create a single-task format job through the Create a new job operation (POST /jobs/create
) in the Jobs API, you do not need to change existing clients.
To create a multi-task format job, use the tasks
field in JobSettings
to specify settings for each task. The following example creates a job with two notebook tasks. This example is for API 2.0 and 2.1:
Note
A maximum of 100 tasks can be specified per job.
{
"name": "Multi-task-job",
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"timeout_seconds": 3600,
"max_retries": 3,
"retry_on_timeout": true
}
]
}
To submit a one-time run of a single-task format job with the Create and trigger a one-time run operation (POST /runs/submit
) in the Jobs API, you do not need to change existing clients.
To submit a one-time run of a multi-task format job, use the tasks
field in JobSettings
to specify settings for each task, including clusters. Clusters must be set at the task level when submitting a multi-task format job because the runs submit
request does not support shared job clusters. See Create for an example JobSettings
specifying multiple tasks.
To update a single-task format job with the Partially update a job operation (POST /jobs/update
) in the Jobs API, you do not need to change existing clients.
To update the settings of a multi-task format job, you must use the unique task_key
field to identify new task
settings. See Create for an example JobSettings
specifying multiple tasks.
To overwrite the settings of a single-task format job with the Overwrite all settings for a job operation (POST /jobs/reset
) in the Jobs API, you do not need to change existing clients.
To overwrite the settings of a multi-task format job, specify a JobSettings
data structure with an array of TaskSettings
data structures. See Create for an example JobSettings
specifying multiple tasks.
Use Update to change individual fields without switching from single-task to multi-task format.
For single-task format jobs, no client changes are required to process the response from the List all jobs operation (GET /jobs/list
) in the Jobs API.
For multi-task format jobs, most settings are defined at the task level and not the job level. Cluster configuration may be set at the task or job level. To modify clients to access cluster or task settings for a multi-task format job returned in the Job
structure:
- Parse the
job_id
field for the multi-task format job. - Pass the
job_id
to the Get a job operation (GET /jobs/get
) in the Jobs API to retrieve job details. See Get for an example response from theGet
API call for a multi-task format job.
The following example shows a response containing single-task and multi-task format jobs. This example is for API 2.0:
{
"jobs": [
{
"job_id": 36,
"settings": {
"name": "A job with a single task",
"existing_cluster_id": "1201-my-cluster",
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/example-notebook",
"revision_timestamp": 0
},
"max_concurrent_runs": 1,
"format": "SINGLE_TASK"
},
"created_time": 1505427148390,
"creator_user_name": "user@databricks.com"
},
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com"
}
]
}
For single-task format jobs, no client changes are required to process the response from the Get a job operation (GET /jobs/get
) in the Jobs API.
Multi-task format jobs return an array of task
data structures containing task settings. If you require access to task level details, you need to modify your clients to iterate through the tasks
array and extract required fields.
The following shows an example response from the Get
API call for a multi-task format job. This example is for API 2.0 and 2.1:
{
"job_id": 53,
"settings": {
"name": "A job with multiple tasks",
"email_notifications": {},
"timeout_seconds": 0,
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "clean_data",
"description": "Clean and prepare the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/clean-data"
},
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
},
{
"task_key": "analyze_data",
"description": "Perform an analysis of the data",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/analyze-data"
},
"depends_on": [
{
"task_key": "clean_data"
}
],
"existing_cluster_id": "1201-my-cluster",
"max_retries": 3,
"min_retry_interval_millis": 0,
"retry_on_timeout": true,
"timeout_seconds": 3600,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1625841911296,
"creator_user_name": "user@databricks.com",
"run_as_user_name": "user@databricks.com"
}
For single-task format jobs, no client changes are required to process the response from the Get a job run operation (GET /jobs/runs/get
) in the Jobs API.
The response for a multi-task format job run contains an array of TaskSettings
. To retrieve run results for each task:
- Iterate through each of the tasks.
- Parse the
run_id
for each task. - Call the Get the output for a run operation (
GET /jobs/runs/get-output
) with therun_id
to get details on the run for each task. The following is an example response from this request:
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [
{
"run_id": 759601,
"task_key": "query-logs",
"description": "Query session logs",
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/log-query"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
},
{
"run_id": 759602,
"task_key": "validate_output",
"description": "Validate query output",
"depends_on": [
{
"task_key": "query-logs"
}
],
"notebook_task": {
"notebook_path": "/Users/user@databricks.com/validate-query-results"
},
"existing_cluster_id": "1201-my-cluster",
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
}
}
],
"format": "MULTI_TASK"
}
For single-task format jobs, no client changes are required to process the response from the Get the output for a run operation (GET /jobs/runs/get-output
) in the Jobs API.
For multi-task format jobs, calling Runs get output
on a parent run results in an error since run output is available only for individual tasks. To get the output and metadata for a multi-task format job:
- Call the Get the output for a run request.
- Iterate over the child
run_id
fields in the response. - Use the child
run_id
values to callRuns get output
.
For single-task format jobs, no client changes are required to process the response from the List runs for a job operation (GET /jobs/runs/list
).
For multi-task format jobs, an empty tasks
array is returned. Pass the run_id
to the Get a job run operation (GET /jobs/runs/get
) to retrieve the tasks. The following shows an example response from the Runs list
API call for a multi-task format job:
{
"runs": [
{
"job_id": 53,
"run_id": 759600,
"number_in_job": 7,
"original_attempt_run_id": 759600,
"state": {
"life_cycle_state": "TERMINATED",
"result_state": "SUCCESS",
"state_message": ""
},
"cluster_spec": {},
"start_time": 1595943854860,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"trigger": "ONE_TIME",
"creator_user_name": "user@databricks.com",
"run_name": "Query logs",
"run_type": "JOB_RUN",
"tasks": [],
"format": "MULTI_TASK"
}
],
"has_more": false
}