Create, run, and manage Azure Databricks Jobs
This article details how to create, edit, run, and monitor Azure Databricks Jobs using the Jobs UI. To learn about using the Databricks CLI to create and run jobs, see Jobs CLI. To learn about using the Jobs API, see Jobs API 2.1.
To create your first workflow with an Azure Databricks job, see the quickstart.
Important
- You can create jobs only in a Data Science & Engineering workspace or a Machine Learning workspace.
- A workspace is limited to 1000 concurrent task runs. A
429 Too Many Requests
response is returned when you request a run that cannot start immediately. - The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.
Create a job
Do one of the following:
- Click
Workflows in the sidebar and click
.
- In the sidebar, click
New and select Job.
The Tasks tab appears with the create task dialog.
- Click
Replace Add a name for your job… with your job name.
Enter a name for the task in the Task name field.
In the Type dropdown menu, select the type of task to run. See Task type options.
Configure the cluster where the task runs. In the Cluster dropdown menu, select either New job cluster or Existing All-Purpose Clusters.
- New Job Cluster: Click Edit in the Cluster dropdown menu and complete the cluster configuration.
- Existing All-Purpose Cluster: Select an existing cluster in the Cluster dropdown menu. To open the cluster in a new page, click the
icon to the right of the cluster name and description.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
To add dependent libraries, click + Add next to Dependent libraries. See Dependent libraries.
You can pass parameters for your task. Each task type has different requirements for formatting and passing the parameters.
- Notebook: Click Add and specify the key and value of each parameter to pass to the task. You can override or add additional parameters when you manually run a task using the Run a job with different parameters option. Parameters set the value of the notebook widget specified by the key of the parameter. Use task parameter variables to pass a limited set of dynamic values as part of a parameter value.
- JAR: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments to the main method of the main class. See Configure JAR job parameters.
- Spark Submit task: Parameters are specified as a JSON-formatted array of strings. Conforming to the Apache Spark spark-submit convention, parameters after the JAR path are passed to the main method of the main class.
- Python script: Use a JSON-formatted array of strings to specify parameters. These strings are passed as arguments which can be parsed using the argparse module in Python.
- Python Wheel: In the Parameters dropdown menu, select Positional arguments to enter parameters as a JSON-formatted array of strings, or select Keyword arguments > Add to enter the key and value of each parameter. Both positional and keyword arguments are passed to the Python wheel task as command-line arguments.
To optionally receive notifications for task start, success, or failure, click + Add next to Emails. Failure notifications are sent on initial task failure and any subsequent retries.
To optionally configure a retry policy for the task, click + Add next to Retries. See Retries.
To optionally configure a timeout for the task, click + Add next to Timeout in seconds. See Timeout.
Click Create.
After creating the first task, you can configure job-level settings such as notifications, job triggers, and permissions. See Edit a job.
To add another task, click in the DAG view. A shared cluster option is provided if you have configured a New Job Cluster for a previous task. You can also configure a cluster for each task when you create or edit a task. To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
Task type options
The following are the task types you can add to your Azure Databricks job and available options for the different task types:
Notebook: In the Source dropdown menu, select a location for the notebook; either Workspace for a notebook located in a Azure Databricks workspace folder or Git provider for a notebook located in a remote Git repository.
Workspace: Use the file browser to find the notebook, click the notebook name, and click Confirm.
Git provider: Click Edit and enter the Git repository information. See Use version controlled notebooks in an Azure Databricks job.
JAR: Specify the Main class. Use the fully qualified name of the class containing the main method, for example,
org.apache.spark.examples.SparkPi
. Then click Add under Dependent Libraries to add libraries required to run the task. One of these libraries must contain the main class.To learn more about JAR tasks, see JAR jobs.
Spark Submit: In the Parameters text box, specify the main class, the path to the library JAR, and all arguments, formatted as a JSON array of strings. The following example configures a spark-submit task to run the
DFSReadWriteTest
from the Apache Spark examples:["--class","org.apache.spark.examples.DFSReadWriteTest","dbfs:/FileStore/libraries/spark_examples_2_12_3_1_1.jar","/dbfs/databricks-datasets/README.md","/FileStore/examples/output/"]
Important
There are several limitations for spark-submit tasks:
- You can run spark-submit tasks only on new clusters.
- Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.
- Spark-submit does not support Databricks Utilities. To use Databricks Utilities, use JAR tasks instead.
- If you are using a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses Single User access mode. Shared access mode is not supported.
Python script: In the Source drop-down, select a location for the Python script, either Workspace for a script in the local workspace, or DBFS / S3 for a script located on DBFS or cloud storage. In the Path textbox, enter the path to the Python script:
Workspace: In the Select Python File dialog, browse to the Python script and click Confirm. Your script must be in a Databricks repo.
DBFS: Enter the URI of a Python script on DBFS or cloud storage; for example,
dbfs:/FileStore/myscript.py
.Delta Live Tables Pipeline: In the Pipeline dropdown menu, select an existing Delta Live Tables pipeline.
Important
You can use only triggered pipelines with the Pipeline task. Continuous pipelines are not supported as a job task. To learn more about triggered and continuous pipelines, see Continuous vs. triggered pipeline execution.
Python Wheel: In the Package name text box, enter the package to import, for example,
myWheel-1.0-py2.py3-none-any.whl
. In the Entry Point text box, enter the function to call when starting the wheel. Click Add under Dependent Libraries to add libraries required to run the task.SQL: In the SQL task dropdown menu, select Query, Dashboard, or Alert.
Note
- The SQL task is in Public Preview.
- The SQL task requires Databricks SQL and a serverless or pro SQL warehouse.
Query: In the SQL query dropdown menu, select the query to execute when the task runs. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.
Dashboard: In the SQL dashboard dropdown menu, select a dashboard to be updated when the task runs. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.
Alert: In the SQL alert dropdown menu, select an alert to trigger for evaluation. In the SQL warehouse dropdown menu, select a serverless or pro SQL warehouse to run the task.
dbt: See Use dbt transformations in an Azure Databricks job for a detailed example of how to configure a dbt task.
Run a job
- Click
Workflows in the sidebar.
- Select a job and click the Runs tab. You can run a job immediately or schedule the job to run later.
If one or more tasks in a job with multiple tasks are not successful, you can re-run the subset of unsuccessful tasks. See Re-run failed and skipped tasks.
Run a job immediately
To run the job immediately, click .
Tip
You can perform a test run of a job with a notebook task by clicking Run Now. If you need to make changes to the notebook, clicking Run Now again after editing the notebook will automatically run the new version of the notebook.
Run a job with different parameters
You can use Run Now with Different Parameters to re-run a job with different parameters or different values for existing parameters.
- Click
next to Run Now and select Run Now with Different Parameters or, in the Active Runs table, click Run Now with Different Parameters. Enter the new parameters depending on the type of task.
- Notebook: You can enter parameters as key-value pairs or a JSON object. The provided parameters are merged with the default parameters for the triggered run. You can use this dialog to set the values of widgets.
- JAR and spark-submit: You can enter a list of parameters or a JSON document. If you delete keys, the default parameters are used. You can also add task parameter variables for the run.
- Click Run.
View task run history
To view the run history of a task, including successful and unsuccessful runs:
- Click on a task on the Job run details page. The Task run details page appears.
- Select the task run in the run history dropdown menu.
Schedule a job
To define a schedule for the job:
Click Add trigger in the Job details panel and select Scheduled in Trigger type.
Specify the period, starting time, and time zone. Optionally select the Show Cron Syntax checkbox to display and edit the schedule in Quartz Cron Syntax.
Note
- Azure Databricks enforces a minimum interval of 10 seconds between subsequent runs triggered by the schedule of a job regardless of the seconds configuration in the cron expression.
- You can choose a time zone that observes daylight saving time or UTC. If you select a zone that observes daylight saving time, an hourly job will be skipped or may appear to not fire for an hour or two when daylight saving time begins or ends. To run at every hour (absolute time), choose UTC.
- The job scheduler is not intended for low latency jobs. Due to network or cloud issues, job runs may occasionally be delayed up to several minutes. In these situations, scheduled jobs will run immediately upon service availability.
Click Save.
You can also schedule a notebook job directly in the notebook UI.
Pause and resume a job schedule
To pause a job, you can either:
- Click Pause in the Job details panel.
To resume a paused job schedule, click Resume.
Run a continuous job
You can ensure there is always an active run of a job with the Continuous
trigger type. When you run your job with the continuous trigger, Azure Databricks Jobs ensures there is always one active run of the job. A new run of the job starts after the previous run completes successfully or with a failed status, or if there is no instance of the job currently running.
Note
- To prevent unnecessary resource usage and reduce cost, Azure Databricks automatically pauses a continuous job if there are more than five consecutive failures within a 24 hour period.
- There can be only one running instance of a continuous job.
- There is a small delay between a run finishing and a new run starting. This delay should be less than 60 seconds.
- You cannot use retry policies or task dependencies with a continuous job.
- Selecting Run now on a continuous job that is paused triggers a new job run. If the job is unpaused, an exception is thrown.
- To have your continuous job pick up a new job configuration, cancel the existing run. A new run will automatically start. You can also click Restart run to restart the job run with the updated configuration.
To run a job continuously, click Add trigger in the Job details panel, select Continuous in Trigger type, and click Save.
To stop a continuous job, click next to Run Now and click Stop.
Run a job when new files arrive
To trigger a job run when new files arrive in an external location, use a file arrival trigger.
View jobs
Click Workflows in the sidebar. The Jobs list appears. The Jobs page lists all defined jobs, the cluster definition, the schedule, if any, and the result of the last run.
Note
If you have the increased jobs limit enabled for this workspace, only 25 jobs are displayed in the Jobs list to improve the page loading time. Use the left and right arrows to page through the full list of jobs.
You can filter jobs in the Jobs list:
- Using keywords. If you have the increased jobs limit feature enabled for this workspace, searching by keywords is supported only for the name, job ID, and job tag fields.
- Selecting only the jobs you own.
- Selecting all jobs you have permissions to access. Access to this filter requires that Jobs access control is enabled.
- Using tags. To search for a tag created with only a key, type the key into the search box. To search for a tag created with a key and value, you can search by the key, the value, or both the key and value. For example, for a tag with the key
department
and the valuefinance
, you can search fordepartment
orfinance
to find matching jobs. To search by both the key and value, enter the key and value separated by a colon; for example,department:finance
.
You can also click any column header to sort the list of jobs (either descending or ascending) by that column. When the increased jobs limit feature is enabled, you can sort only by Name
, Job ID
, or Created by
. The default sorting is by Name
in ascending order.
View runs for a job
You can view a list of currently running and recently completed runs for all jobs you have access to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view the list of recent job runs:
- Click
Workflows in the sidebar.
- In the Name column, click a job name. The Runs tab appears with matrix and list views of active runs and completed runs.
The matrix view shows a history of runs for the job, including each job task.
The Run total duration row of the matrix displays the total duration of the run and the state of the run. To view details of the run, including the start time, duration, and status, hover over the bar in the Run total duration row.
Each cell in the Tasks row represents a task and the corresponding status of the task. To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task.
The job run and task run bars are color-coded to indicate the status of the run. Successful runs are green, unsuccessful runs are red, and skipped runs are pink. The height of the individual job run and task run bars provides a visual indication of the run duration.
The runs list view displays:
- The start time for the run.
- The run identifier.
- Whether the run was triggered by a job schedule or an API request, or was manually started.
- The time elapsed for a currently running job, or the total running time for a completed run.
- Links to the Spark logs.
- The status of the run, either
Pending
,Running
,Skipped
,Succeeded
,Failed
,Terminating
,Terminated
,Internal Error
,Timed Out
,Canceled
,Canceling
, orWaiting for Retry
.
To change the columns displayed in the runs list view, click Columns and select or deselect columns.
To view details for a job run, click the link for the run in the Start time column in the runs list view. To view details for the most recent successful run of this job, click Go to the latest successful run.
Azure Databricks maintains a history of your job runs for up to 60 days. If you need to preserve job runs, Databricks recommends that you export results before they expire. For more information, see View lineage information for a job.
View job run details
The job run details page contains job output and links to logs, including information about the success or failure of each task in the job run. You can access job run details from the Runs tab for the job. To view job run details from the Runs tab, click the link for the run in the Start time column in the runs list view. To return to the Runs tab for the job, click the Job ID value.
If the job contains multiple tasks, click a task to view task run details, including:
- the cluster that ran the task
- the Spark UI for the task
- logs for the task
- metrics for the task
Click the Job ID value to return to the Runs tab for the job.
View recent job runs
You can view a list of currently running and recently completed runs for all jobs in a workspace that you have access to, including runs started by external orchestration tools such as Apache Airflow or Azure Data Factory. To view the list of recent job runs:
- Click
Workflows in the sidebar. The Jobs list appears.
- Click the Job runs tab to display the Job runs list.
The Job runs list displays:
- The start time for the run.
- The name of the job associated with the run.
- The user name that the job runs as.
- Whether the run was triggered by a job schedule or an API request, or was manually started.
- The time elapsed for a currently running job, or the total running time for a completed run.
- The status of the run, either
Pending
,Running
,Skipped
,Succeeded
,Failed
,Terminating
,Terminated
,Internal Error
,Timed Out
,Canceled
,Canceling
, orWaiting for Retry
. - Any parameters for the run.
To view job run details, click the link in the Start time column for the run. To view job details, click the job name in the Job column.
View lineage information for a job
If Unity Catalog is enabled in your workspace, you can view lineage information for any Unity Catalog tables in your workflow. If lineage information is available for your workflow, you will see a link with a count of upstream and downstream tables in the Job details panel for your job, the Job run details panel for a job run, or the Task run details panel for a task run. Click the link to show the list of tables. Click a table to see detailed information in Data Explorer.
Export job run results
You can export notebook run results and job run logs for all job types.
Export notebook run results
You can persist job runs by exporting their results. For notebook job runs, you can export a rendered notebook that can later be imported into your Azure Databricks workspace.
To export notebook run results for a job with a single task:
- On the job detail page, click the View Details link for the run in the Run column of the Completed Runs (past 60 days) table.
- Click Export to HTML.
To export notebook run results for a job with multiple tasks:
- On the job detail page, click the View Details link for the run in the Run column of the Completed Runs (past 60 days) table.
- Click the notebook task to export.
- Click Export to HTML.
Export job run logs
You can also export the logs for your job run. You can set up your job to automatically deliver logs to DBFS through the Job API. See the new_cluster.cluster_log_conf
object in the request body passed to the Create a new job operation (POST /jobs/create
) in the Jobs API.
Edit a job
Some configuration options are available on the job, and other options are available on individual tasks. For example, the maximum concurrent runs can be set on the job only, while parameters must be defined for each task.
To change the configuration for a job:
- Click
Workflows in the sidebar.
- In the Name column, click the job name.
The side panel displays the Job details. You can change the trigger for the job, cluster configuration, notifications, maximum number of concurrent runs, and add or change tags. If job access control is enabled, you can also edit job permissions.
Tags
To add labels or key:value attributes to your job, you can add tags when you edit the job. You can use tags to filter jobs in the Jobs list; for example, you can use a department
tag to filter all jobs that belong to a specific department.
Note
Because job tags are not designed to store sensitive information such as personally identifiable information or passwords, Databricks recommends using tags for non-sensitive values only.
Tags also propagate to job clusters created when a job is run, allowing you to use tags with your existing cluster monitoring.
To add or edit tags, click + Tag in the Job details side panel. You can add the tag as a key and value, or a label. To add a label, enter the label in the Key field and leave the Value field empty.
Clusters
To see tasks associated with a cluster, hover over the cluster in the side panel. To change the cluster configuration for all associated tasks, click Configure under the cluster. To configure a new cluster for all associated tasks, click Swap under the cluster.
Control access to jobs
Job access control enables job owners and administrators to grant fine-grained permissions on their jobs. Job owners can choose which other users or groups can view the results of the job. Owners can also choose who can manage their job runs (Run now and Cancel run permissions).
See Jobs access control for details.
Maximum concurrent runs
The maximum number of parallel runs for this job. Azure Databricks skips the run if the job has already reached its maximum number of active runs when attempting to start a new run. Set this value higher than the default of 1 to perform multiple runs of the same job concurrently. This is useful, for example, if you trigger your job on a frequent schedule and want to allow consecutive runs to overlap with each other, or you want to trigger multiple runs that differ by their input parameters.
Edit a task
To set task configuration options:
- Click
Workflows in the sidebar.
- In the Name column, click the job name.
- Click the Tasks tab.
Task dependencies
You can define the order of execution of tasks in a job using the Depends on dropdown menu. You can set this field to one or more tasks in the job.
Note
Depends on is not visible if the job consists of only a single task.
Configuring task dependencies creates a Directed Acyclic Graph (DAG) of task execution, a common way of representing execution order in job schedulers. For example, consider the following job consisting of four tasks:
- Task 1 is the root task and does not depend on any other task.
- Task 2 and Task 3 depend on Task 1 completing first.
- Finally, Task 4 depends on Task 2 and Task 3 completing successfully.
Azure Databricks runs upstream tasks before running downstream tasks, running as many of them in parallel as possible. The following diagram illustrates the order of processing for these tasks:
Individual task configuration options
Individual tasks have the following configuration options:
In this section:
Cluster
To configure the cluster where a task runs, click the Cluster dropdown menu. You can edit a shared job cluster, but you cannot delete a shared cluster if it is still used by other tasks.
To learn more about selecting and configuring clusters to run tasks, see Cluster configuration tips.
Dependent libraries
Dependent libraries will be installed on the cluster before the task runs. You must set all task dependencies to ensure they are installed before the run starts. Follow the recommendations in Library dependencies for specifying dependencies.
Task parameter variables
You can pass templated variables into a job task as part of the task’s parameters. These variables are replaced with the appropriate values when the job task runs. You can use task parameter values to pass the context about a job run, such as the run ID or the job’s start time.
When a job runs, the task parameter variable surrounded by double curly braces is replaced and appended to an optional string value included as part of the value. For example, to pass a parameter named MyJobId
with a value of my-job-6
for any run of job ID 6, add the following task parameter:
{
"MyJobID": "my-job-{{job_id}}"
}
The contents of the double curly braces are not evaluated as expressions, so you cannot do operations or functions within double-curly braces. Whitespace is not stripped inside the curly braces, so {{ job_id }}
will not be evaluated.
The following task parameter variables are supported:
Variable | Description | Example value |
---|---|---|
{{job_id}} |
The unique identifier assigned to a job | 1276862 |
{{run_id}} |
The unique identifier assigned to a task run | 3447843 |
{{start_date}} |
The date a task run started. The format is yyyy-MM-dd in UTC timezone. | 2021-02-15 |
{{start_time}} |
The timestamp of the run’s start of execution after the cluster is created and ready. The format is milliseconds since UNIX epoch in UTC timezone, as returned by System.currentTimeMillis() . |
1551622063030 |
{{task_retry_count}} |
The number of retries that have been attempted to run a task if the first attempt fails. The value is 0 for the first attempt and increments with each retry. | 0 |
{{parent_run_id}} |
The unique identifier assigned to the run of a job with multiple tasks. | 3447835 |
{{task_key}} |
The unique name assigned to a task that’s part of a job with multiple tasks. | “clean_raw_data” |
You can set these variables with any task when you Create a job, Edit a job, or Run a job with different parameters.
You can also pass parameters between tasks in a job with task values. See Share information between tasks in an Azure Databricks job.
Timeout
The maximum completion time for a job or task. If the job or task does not complete in this time, Azure Databricks sets its status to “Timed Out”.
Retries
A policy that determines when and how many times failed runs are retried. To set the retries for the task, click Advanced options and select Edit Retry Policy. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run.
Note
If you configure both Timeout and Retries, the timeout applies to each retry.
Clone a job
You can quickly create a new job by cloning an existing job. Cloning a job creates an identical copy of the job, except for the job ID. On the job’s page, click More … next to the job’s name and select Clone from the dropdown menu.
Clone a task
You can quickly create a new task by cloning an existing task:
- On the job’s page, click the Tasks tab.
- Select the task to clone.
- Click
and select Clone task.
Delete a job
To delete a job, on the job’s page, click More … next to the job’s name and select Delete from the dropdown menu.
Delete a task
To delete a task:
- Click the Tasks tab.
- Select the task to be deleted.
- Click
and select Remove task.
Copy a task path
To copy the path to a task, for example, a notebook path:
- Click the Tasks tab.
- Select the task containing the path to copy.
- Click
next to the task path to copy the path to the clipboard.
Best practices
In this section:
Cluster configuration tips
Cluster configuration is important when you operationalize a job. The following provides general guidance on choosing and configuring job clusters, followed by recommendations for specific job types.
Use shared job clusters
To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster:
- Select New Job Clusters when you create a task and complete the cluster configuration.
- Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select New Job Clusters is available to any task in the job.
A shared job cluster is scoped to a single job run, and cannot be used by other jobs or runs of the same job.
Libraries cannot be declared in a shared job cluster configuration. You must add dependent libraries in task settings.
Choose the correct cluster type for your job
- New Job Clusters are dedicated clusters for a job or task run. A shared job cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. The cluster is not terminated when idle but terminates only after all tasks using it have completed. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created. A cluster scoped to a single task is created and started when the task starts and terminates when the task completes. In production, Databricks recommends using new shared or task scoped clusters so that each job or task runs in a fully isolated environment.
- When you run a task on a new cluster, the task is treated as a data engineering (task) workload, subject to the task workload pricing. When you run a task on an existing all-purpose cluster, the task is treated as a data analytics (all-purpose) workload, subject to all-purpose workload pricing.
- If you select a terminated existing cluster and the job owner has Can Restart permission, Azure Databricks starts the cluster when the job is scheduled to run.
- Existing all-purpose clusters work best for tasks such as updating dashboards at regular intervals.
Use a pool to reduce cluster start times
To decrease new job cluster start time, create a pool and configure the job’s cluster to use the pool.
Notebook job tips
Total notebook cell output (the combined output of all notebook cells) is subject to a 20MB size limit. Additionally, individual cell output is subject to an 8MB size limit. If total cell output exceeds 20MB in size, or if the output of an individual cell is larger than 8MB, the run is canceled and marked as failed.
If you need help finding cells near or beyond the limit, run the notebook against an all-purpose cluster and use this notebook autosave technique.
Streaming tasks
Spark Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression "* * * * * ?"
(every minute).
Since a streaming task runs continuously, it should always be the final task in a job.
JAR jobs
To learn more about packaging your code in a JAR and creating a job that uses the JAR, see Use a JAR in an Azure Databricks job.
When running a JAR job, keep in mind the following:
Output size limits
Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run is canceled and marked as failed.
To avoid encountering this limit, you can prevent stdout from being returned from the driver to Azure Databricks by setting the spark.databricks.driver.disableScalaOutput
Spark configuration to true
. By default, the flag value is false
. The flag controls cell output for Scala JAR jobs and Scala notebooks. If the flag is enabled, Spark does not return job execution results to the client. The flag does not affect the data that is written in the cluster’s log files. Setting this flag is recommended only for job clusters for JAR jobs because it will disable notebook results.
Use the shared SparkContext
Because Azure Databricks is a managed service, some code changes may be necessary to ensure that your Apache Spark jobs run correctly. JAR job programs must use the shared SparkContext
API to get the SparkContext
. Because Azure Databricks initializes the SparkContext
, programs that invoke new SparkContext()
will fail. To get the SparkContext
, use only the shared SparkContext
created by Azure Databricks:
val goodSparkContext = SparkContext.getOrCreate()
val goodSparkSession = SparkSession.builder().getOrCreate()
There are also several methods you should avoid when using the shared SparkContext
.
- Do not call
SparkContext.stop()
. - Do not call
System.exit(0)
orsc.stop()
at the end of yourMain
program. This can cause undefined behavior.
Use try-finally
blocks for job clean up
Consider a JAR that consists of two parts:
jobBody()
which contains the main part of the job.jobCleanup()
which has to be executed afterjobBody()
whether that function succeeded or returned an exception.
As an example, jobBody()
may create tables, and you can use jobCleanup()
to drop these tables.
The safe way to ensure that the clean up method is called is to put a try-finally
block in the code:
try {
jobBody()
} finally {
jobCleanup()
}
You should not try to clean up using sys.addShutdownHook(jobCleanup)
or the following code:
val cleanupThread = new Thread { override def run = jobCleanup() }
Runtime.getRuntime.addShutdownHook(cleanupThread)
Due to the way the lifetime of Spark containers is managed in Azure Databricks, the shutdown hooks are not run reliably.
Configure JAR job parameters
You pass parameters to JAR jobs with a JSON string array. See the spark_jar_task
object in the request body passed to the Create a new job operation (POST /jobs/create
) in the Jobs API. To access these parameters, inspect the String
array passed into your main
function.
Library dependencies
The Spark driver has certain library dependencies that cannot be overridden. These libraries take priority over any of your libraries that conflict with them.
To get the full list of the driver library dependencies, run the following command inside a notebook attached to a cluster of the same Spark version (or the cluster with the driver you want to examine).
%sh
ls /databricks/jars
Manage library dependencies
A good rule of thumb when dealing with library dependencies while creating JARs for jobs is to list Spark and Hadoop as provided
dependencies. On Maven, add Spark and Hadoop as provided dependencies, as shown in the following example:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>
In sbt
, add Spark and Hadoop as provided dependencies, as shown in the following example:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0" % "provided"
libraryDependencies += "org.apache.hadoop" %% "hadoop-core" % "1.2.1" % "provided"
Tip
Specify the correct Scala version for your dependencies based on the version you are running.
Feedback
Submit and view feedback for