Use scheduler pools for multiple streaming workloads
To enable multiple streaming queries to execute jobs concurrently on a shared cluster, you can configure queries to execute in separate scheduler pools.
Note
You cannot specify thread pools when you work with Unity Catalog objects.
How do scheduler pools work?
By default, all queries started in a notebook run in the same fair scheduling pool. Jobs generated by triggers from all of the streaming queries in a notebook run one after another in first in, first out (FIFO) order. This can cause unnecessary delays in the queries, because they are not efficiently sharing the cluster resources.
Scheduler pools allow you to declare which Structured Streaming queries share compute resources.
The following example assigns query1
to a dedicated pool, while query2
and query3
share a scheduler pool.
# Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("delta").start(path1)
# Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("delta").start(path2)
# Run streaming query3 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query3").format("delta").start(path3)
Note
The local property configuration must be in the same notebook cell where you start your streaming query.
See Apache fair scheduler documentation for more details.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for