Zdieľať cez


Set Spark configuration properties on Azure Databricks

You can set Spark configuration properties (Spark confs) to customize settings in your compute environment.

Databricks generally recommends against configuring most Spark properties. Especially when migrating from open-source Apache Spark or upgrading Databricks Runtime versions, legacy Spark configurations can override new default behaviors that optimize workloads.

For many behaviors controlled by Spark properties, Azure Databricks also provides options to either enable behavior at a table level or to configure custom behavior as part of a write operation. For example, schema evolution was previously controlled by a Spark property, but now has coverage in SQL, Python, and Scala. See Schema evolution syntax for merge.

Configure Spark properties for notebooks and jobs

You can set Spark properties for notebooks and jobs. The scope of the configuration depends on how you set it.

Properties configured Applies to
Using compute configuration All notebooks and jobs run with the compute resource.
Within a notebook Only the SparkSession for the current notebook.

For instructions on configuring Spark properties at the compute level, see Spark configuration.

To set a Spark property in a notebook, use the following syntax:

SQL

SET spark.sql.ansi.enabled = true

Python

spark.conf.set("spark.sql.ansi.enabled", "true")

Scala

spark.conf.set("spark.sql.ansi.enabled", "true")

Configure Spark properties in Databricks SQL

Databricks SQL allows admins to configure Spark properties for data access in the workspace settings menu. See Data access configurations

Other than data access configurations, Databricks SQL only allows a handful of Spark confs, which have been aliased to shorter names for simplicity. See Configuration parameters.

For most supported SQL configurations, you can override the global behavior in your current session. The following example turns off ANSI mode:

SET ANSI_MODE = false

Configure Spark properties for Lakeflow Spark Declarative Pipelines

Lakeflow Spark Declarative Pipelines allows you to configure Spark properties for a pipeline, for one compute resource configured for a pipeline, or for individual flows, materialized views, or streaming tables.

You can set pipeline and compute Spark properties using the UI or JSON. See Configure Pipelines.

Use the spark_conf option in Lakeflow Spark Declarative Pipelines decorator functions to configure Spark properties for flows, views, or tables. See Lakeflow Spark Declarative Pipelines Python language reference.

Configure Spark properties for serverless notebooks and jobs

Serverless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:

Property Default Description
spark.databricks.execution.timeout 9000 (only applicable for notebooks) The execution timeout, in seconds, for Spark Connect queries. The default value is only applicable for notebook queries. For jobs running on serverless compute (and jobs running on classic standard compute), there is no timeout unless this property is set.
spark.sql.legacy.timeParserPolicy CORRECTED The time parser policy.
spark.sql.session.timeZone Etc/UTC The ID of session local timezone in the format of either region-based zone IDs or zone offsets.
spark.sql.shuffle.partitions auto The default number of partitions to use when shuffling data for joins or aggregations.
spark.sql.ansi.enabled true When true, Spark SQL uses an ANSI compliant dialect instead of being Hive compliant.
spark.sql.files.maxPartitionBytes 134217728 (128 MB) The maximum number of bytes to pack into a single partition when reading files.

Unsupported Spark properties

The following Spark configuration properties are not supported in Azure Databricks. Unsupported Spark properties are either ignored by Azure Databricks or may cause conflicts and failures when used simultaneously with Azure Databricks features. If you are migrating workloads to Azure Databricks, replace unsupported properties with the recommended alternatives.

Unsupported Spark properties Applies to Databricks alternative
spark.dynamicAllocation.enabled
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.executorIdleTimeout
Classic compute Configure Azure Databricks autoscaling instead, which manages executor lifecycle at the platform level. See Enable autoscaling.
spark.master
spark.driver.host
spark.driver.port
Serverless compute and Lakeflow Spark Declarative Pipelines The Azure Databricks serverless infrastructure manages these internal connection properties automatically. They cannot be set by users. Setting them on serverless compute or Lakeflow Spark Declarative Pipelines pipelines results in an error.
spark.jars Serverless compute and Lakeflow Spark Declarative Pipelines Azure Databricks does not support attaching JARs to serverless compute or Lakeflow Spark Declarative Pipelines pipelines using Spark configurations, but you can run serverless JAR tasks. See Configure environment for job tasks.
spark.databricks.runtimeoptions.* Classic compute Use the runtime_options attribute in the cluster configuration instead. Runtime options cannot be set as Spark configuration on any cluster type. Attempting to set these using Spark configurations results in an error.

Get the current setting for a Spark configuration

Use the following syntax to review the current setting of a Spark configuration:

spark.conf.get("configuration_name")