Spark Submit (legacy)
The Spark Submit task type is a legacy pattern for configuring JARs as tasks. Databricks recommends using the JAR task. See JAR task for jobs.
Requirements
- You can run spark-submit tasks only on new clusters.
- You must upload your JAR file to a location or Maven repository compatible with your compute configuration. See Java and Scala library support.
- You cannot access JAR files stored in volumes.
- Spark-submit does not support cluster autoscaling. To learn more about autoscaling, see Cluster autoscaling.
- Spark-submit does not support Databricks Utilities (dbutils) reference. To use Databricks Utilities, use JAR tasks instead.
- If you use a Unity Catalog-enabled cluster, spark-submit is supported only if the cluster uses the single user access mode. Shared access mode is not supported. See Access modes.
- Structured Streaming jobs should never have maximum concurrent runs set to greater than 1. Streaming jobs should be set to run using the cron expression
"* * * * * ?"
(every minute). Because a streaming task runs continuously, it should always be the final task in a job.
Configure a Spark Submit task
Add a Spark Submit
task from the Tasks tab in the Jobs UI by doing the following:
- In the Type drop-down menu, select
Spark Submit
. - Use Compute to configure a cluster that supports the logic in your task.
- Use the Parameters text box to provide all arguments and configurations necessary to run your task as a JSON array of strings.
The first three arguments are used to identify the main class to run in a JAR at a specified path, as in the following example:
["--class", "org.apache.spark.mainClassName", "dbfs:/Filestore/libraries/jar_path.jar"]
You cannot override the
master
,deploy-mode
, andexecutor-cores
settings configured by Azure DatabricksUse
--jars
and--py-files
to add dependent Java, Scala, and Python libraries.Use
--conf
to set Spark configurations.The
--jars
,--py-files
,--files
arguments support DBFS paths.By default, the Spark submit job uses all available memory, excluding memory reserved for Azure Databricks services. You can set
--driver-memory
, and--executor-memory
to a smaller value to leave some room for off-heap usage.
- Click Save task.