On-demand state repartitioning for stateful streaming queries

Important

On-demand state repartitioning allows you to resize the number of partitions for a stateful Structured Streaming query without losing checkpoint state.

Without on-demand state repartitioning, you set the number of shuffle partitions during checkpoint creation. If you change spark.sql.shuffle.partitions, queries with existing checkpoints ignore the new value. Applying a new partition count requires you to restart the query with a new checkpoint.

On-demand state repartitioning has the following benefits:

Tune queries by resizing the number of partitions without rebuilding the checkpoint.
Scale queries up or down to match workload changes.

Requirements

Databricks Runtime 18 and above.
The query must use the RocksDB state store provider. On DBR 17.3 or above, RocksDB is the default state store provider. See Configure RocksDB state store on Azure Databricks.

Change the number of partitions

Use the spark configuration spark.sql.streaming.stateStore.partitions and restart the query to change the number of shuffle and streaming state partitions:

Python

query.stop()
spark.conf.set("spark.sql.streaming.stateStore.partitions", "<numPartitions>")
query = df.writeStream.start()

Scala

query.stop()
spark.conf.set("spark.sql.streaming.stateStore.partitions", "<numPartitions>")
val query = df.writeStream.start()

For stateful queries, spark.sql.streaming.stateStore.partitions takes precedence over spark.sql.shuffle.partitions. After the query restarts and the last planned microbatch completes, the query runs a repartition operation to redistribute state data into the new number of partitions. After the repartition operation completes, the query resumes processing.

Monitor repartition state

After the next microbatch completes, StreamingQueryProgress events include the duration of the repartition operation. In an event's durationMs metrics, controlBatch.REPARTITION shows the duration value in milliseconds. Larger state sizes might increase the time to repartition. See Monitoring Structured Streaming queries on Azure Databricks.

Structured Streaming example

The following example scales a query down from 200, the default, to 100 shuffle partitions. Stop the query, set the new partition count, and restart:

Python

# Start the query with the default partition count (200)
query = (df
  .withWatermark("event_time", "10 minutes")
  .groupBy(
    window("event_time", "5 minutes"),
    "id")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/checkpoint/path")
  .outputMode("append")
  .start()
)

# Stop the query and scale down to 100 partitions
query.stop()

spark.conf.set("spark.sql.streaming.stateStore.partitions", "100")

# Restart the query with the same options
query = (df
  .withWatermark("event_time", "10 minutes")
  .groupBy(
    window("event_time", "5 minutes"),
    "id")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/checkpoint/path")
  .outputMode("append")
  .start()
)

Scala

// Start the query with the default partition count (200)
val query = df
  .withWatermark("event_time", "10 minutes")
  .groupBy(
    window($"event_time", "5 minutes"),
    $"id")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/checkpoint/path")
  .outputMode("append")
  .start()

// Stop the query and scale down to 100 partitions
query.stop()

spark.conf.set("spark.sql.streaming.stateStore.partitions", "100")

// Restart the query with the same options
val query2 = df
  .withWatermark("event_time", "10 minutes")
  .groupBy(
    window($"event_time", "5 minutes"),
    $"id")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", "/checkpoint/path")
  .outputMode("append")
  .start()

Lakeflow pipelines example

In Lakeflow pipelines, set spark.sql.streaming.stateStore.partitions using the spark_conf parameter on the @dp.table or @dp.append_flow decorator.

Set partitions on a flow:

from pyspark import pipelines as dp
from pyspark.sql import functions as F

source_path = "/databricks-datasets/iot-stream/data-device/"

dp.create_streaming_table("target_table")

@dp.append_flow(
  target="target_table",
  name="my_flow_1",
  spark_conf={"spark.sql.streaming.stateStore.partitions": "100"}
)
def my_flow_1():
  return (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load(source_path)
    .withColumn("timestamp", F.to_timestamp("timestamp"))
    .withWatermark("timestamp", "10 minutes")
    .groupBy(F.window("timestamp", "5 minutes"), "id")
    .count())

Set partitions at the table level for the default flow:

from pyspark import pipelines as dp
from pyspark.sql import functions as F

source_path = "/databricks-datasets/iot-stream/data-device/"

@dp.table(
  name="table_1",
  spark_conf={"spark.sql.streaming.stateStore.partitions": "100"}
)
def table_1():
  return (spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load(source_path)
    .withColumn("timestamp", F.to_timestamp("timestamp"))
    .withWatermark("timestamp", "10 minutes")
    .groupBy(F.window("timestamp", "5 minutes"), "id")
    .count())

Feedback

Was this page helpful?

Last updated on 2026-07-10

On-demand state repartitioning for stateful streaming queries

Requirements

Change the number of partitions

Python

Scala

Monitor repartition state

Structured Streaming example

Python

Scala

Lakeflow pipelines example

Feedback

Additional resources