Share via


Migrate from classic compute to serverless compute

Migrate your workloads from classic compute to serverless compute. Serverless compute handles provisioning, scaling, runtime upgrades, and optimization automatically.

Most classic workloads can migrate with minimal or no code changes. This page focuses on those workloads. Some features, such as df.cache, are not yet supported on serverless, but will not require code changes once available. Certain workloads that depend on R or Scala notebooks require classic compute and will not be able to migrate to serverless. For a full list of current limitations, see Serverless compute limitations.

Migration steps

To migrate your workloads from classic compute to serverless compute, follow these steps:

  1. Check prerequisites: Verify that your workspace, networking, and cloud storage access meet the requirements. See Before you begin.
  2. Update code: Make any necessary code and configuration changes. See Update your code.
  3. Test your workloads: Validate compatibility and correctness before cutting over. See Test your workloads.
  4. Choose a performance mode: Select the performance mode that best matches your workload requirements. See Choose a performance mode.
  5. Migrate in phases: Roll out serverless incrementally, starting with new and low-risk workloads. See Migrate in phases.
  6. Monitor costs: Track serverless DBU consumption and set up alerts. See Monitor costs.

Before you begin

Before you begin migrating, you might need to update some legacy configurations in your workspace.

Prerequisite Action Details
Workspace is enabled for Unity Catalog Migrate from Hive Metastore if needed Upgrade an Azure Databricks workspace to Unity Catalog
Networking configured Replace VPC peering with NCCs, Private Link, or firewall rules Serverless compute plane networking
Cloud storage access Replace legacy data access patterns with Unity Catalog external locations Connect to cloud object storage using Unity Catalog

Confirm your workspace is in a supported region.

Update your code

The following sections list the code and configuration changes required to make your workloads compatible with serverless.

Data access

Legacy data access patterns are not supported on serverless. Update your code to use Unity Catalog instead.

Classic pattern Serverless replacement Details
DBFS paths (dbfs:/...) Unity Catalog volumes What are Unity Catalog volumes?
Hive Metastore tables Unity Catalog tables (or HMS Federation) Upgrade an Azure Databricks workspace to Unity Catalog
Storage account credentials Unity Catalog external locations Connect to cloud object storage using Unity Catalog
Custom JDBC JARs Lakehouse Federation What is query federation?

Warning

DBFS access is limited on serverless. Update all dbfs:/ paths to Unity Catalog volumes before migrating. For more information, see Migrate files stored in DBFS.

Example: Replace DBFS paths and Hive Metastore references
# Classic
df = spark.read.csv("dbfs:/mnt/datalake/data.csv", header=True)
df.write.parquet("dbfs:/mnt/output/results")
df = spark.table("my_database.my_table")

# Serverless
df = spark.read.csv("/Volumes/main/sales/raw_data/data.csv", header=True)
df.write.parquet("/Volumes/main/analytics/output/results")
df = spark.table("main.my_database.my_table")  # three-level namespace

APIs and code

Certain APIs and code patterns are not supported on serverless. Reference this table to see if your code needs to be updated.

Classic pattern Serverless replacement Details
RDD APIs (sc.parallelize, rdd.map) DataFrame APIs Compare Spark Connect to Spark Classic
df.cache(), df.persist() Remove caching calls Serverless compute limitations
spark.sparkContext, sqlContext Use spark (SparkSession) directly Compare Spark Connect to Spark Classic
Hive variables (${var}) SQL DECLARE VARIABLE or Python f-strings DECLARE VARIABLE
Unsupported Spark configs Remove unsupported configs. Serverless auto-tunes most settings. Configure Spark properties for serverless notebooks and jobs
Example: Replace RDD operations with DataFrames
from pyspark.sql import functions as F

# sc.parallelize + rdd.map
# Classic:  rdd = sc.parallelize([1, 2, 3]); rdd.map(lambda x: x * 2).collect()
df = spark.createDataFrame([(1,), (2,), (3,)], ["value"])
result = df.select((F.col("value") * 2).alias("value")).collect()

# rdd.flatMap
# Classic:  sc.parallelize(["hello world"]).flatMap(lambda l: l.split(" ")).collect()
df = spark.createDataFrame([("hello world",)], ["line"])
words = df.select(F.explode(F.split("line", " ")).alias("word")).collect()

# rdd.groupByKey
# Classic:  rdd.groupByKey().mapValues(list).collect()
df = spark.createDataFrame([("a", 1), ("b", 2), ("a", 3)], ["key", "value"])
grouped = df.groupBy("key").agg(F.collect_list("value").alias("values")).collect()

# rdd.mapPartitions → applyInPandas
import pandas as pd
def process_group(pdf: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({"total": [pdf["id"].sum()]})
result = (spark.range(100).repartition(4)
    .groupBy(F.spark_partition_id())
    .applyInPandas(process_group, schema="total long").collect())

# sc.textFile → spark.read.text
df = spark.read.text("/Volumes/catalog/schema/volume/file.txt")
Example: Replace SparkContext and caching
from pyspark.sql.functions import broadcast

# sc.broadcast → broadcast join
result = main_df.join(broadcast(lookup_df), "key")

# sc.accumulator → DataFrame aggregation
total = df.agg(F.sum("amount")).collect()[0][0]

# sqlContext.sql → spark.sql
result = spark.sql("SELECT * FROM main.db.table")

# df.cache() → remove caching calls
# Materialize expensive intermediate results to Delta as a workaround:
df = spark.read.parquet(path)
result = df.filter("status = 'active'")
expensive_df.write.format("delta").mode("overwrite").saveAsTable("main.scratch.temp")
result = spark.table("main.scratch.temp")

Libraries and environments

You can manage libraries and environments at the workspace level using base environments and at the notebook level using the notebook's serverless environment.

Classic pattern Serverless replacement Details
Init scripts Serverless environments Configure the serverless environment
Cluster-scoped libraries Notebook-scoped or environment libraries Configure the serverless environment
Maven/JAR libraries JAR task support for jobs; PyPI for notebooks JAR task for jobs
Docker containers Serverless environments for library needs Configure the serverless environment

Pin Python packages in requirements.txt for reproducible environments. See Best practices for serverless compute.

Streaming

Streaming workloads are supported on serverless, but certain triggers are not supported. Update your code to use the supported triggers.

Spark trigger Supported Notes
Trigger.AvailableNow() Yes Recommended
Trigger.Once() Yes This is deprecated. Use Trigger.AvailableNow() instead.
Trigger.ProcessingTime(interval) No Returns INFINITE_STREAMING_TRIGGER_NOT_SUPPORTED
Trigger.Continuous(interval) No Use Lakeflow Spark Declarative Pipelines continuous mode instead
Default (not setting .trigger()) No Omitting .trigger() defaults to ProcessingTime("0 seconds"), which is not supported on serverless. Always set .trigger(availableNow=True) explicitly.

For continuous streaming, migrate to Spark Declarative Pipelines in continuous mode or use continuous-schedule jobs with AvailableNow. For large sources, set maxFilesPerTrigger or maxBytesPerTrigger to prevent out-of-memory errors.

Example: Fix streaming triggers
# Classic (not supported on serverless — default trigger is ProcessingTime)
query = df.writeStream.format("delta").outputMode("append").start()

# Serverless (explicit AvailableNow trigger)
query = (df.writeStream.format("delta").outputMode("append")
    .trigger(availableNow=True)
    .option("checkpointLocation", checkpoint_path)
    .start(output_path))
query.awaitTermination()

# With OOM prevention for large sources
query = (spark.readStream.format("delta")
    .option("maxFilesPerTrigger", 100)
    .option("maxBytesPerTrigger", "10g")
    .load(input_path)
    .writeStream.format("delta")
    .trigger(availableNow=True)
    .option("checkpointLocation", checkpoint_path)
    .start(output_path))

Test your workloads

  1. Quick compatibility test: Run the workload on classic compute with Standard access mode and Databricks Runtime 14.3 or above. If the run succeeds, the workload can migrate to serverless without any code changes.
  2. A/B comparison (recommended for production): Run the same workload on classic (control) and serverless (experiment). Diff output tables and verify correctness. Iterate until outputs match.
  3. Temporary configs: You can temporarily set supported Spark configs during testing. Remove them once stable.

Choose a performance mode

Serverless jobs and pipelines support two performance modes: standard and performance-optimized. The performance mode you choose depends on your workload requirements.

Mode Availability Startup Best for
Standard Jobs, Lakeflow Spark Declarative Pipelines 4-6 minutes Cost-sensitive batch
Performance-optimized Notebooks, Jobs, Lakeflow Spark Declarative Pipelines Seconds Interactive, latency-sensitive

Migrate in phases

  1. New workloads: Start all new notebooks and jobs on serverless.
  2. Low-risk workloads: Migrate PySpark/SQL workloads already on standard access mode and Databricks Runtime 14.3 or above.
  3. Complex workloads: Migrate workloads needing code changes (RDD rewrites, DBFS updates, trigger fixes).
  4. Remaining workloads: Review periodically as capabilities expand.

Monitor costs

Serverless billing is based on DBU consumption, not cluster uptime. Validate cost expectations with representative workloads before migrating at scale.

Additional resources

You can also refer to the following blog posts for more information: