Configure Structured Streaming trigger intervals

Apache Spark Structured Streaming processes data incrementally; controlling the trigger interval for batch processing allows you to use Structured Streaming for workloads including near-real time processing, refreshing databases every 5 minutes or once per hour, or batch processing all new data for a day or week.

Because Databricks Auto Loader uses Structured Streaming to load data, understanding how triggers work provides you with the greatest flexibility to control costs while ingesting data with the desired frequency.

Specifying time-based trigger intervals

Structured Streaming refers to time-based trigger intervals as “fixed interval micro-batches”. Using the processingTime keyword, specify a time duration as a string, such as .trigger(processingTime='10 seconds').

When you specify a trigger interval that is too small (less than tens of seconds), the system may perform unnecessary checks to see if new data arrives. Configure your processing time to balance latency requirements and the rate that data arrives in the source.

Configuring incremental batch processing

Important

In Databricks Runtime 11.3 LTS and above, the Trigger.Once setting is deprecated. Databricks recommends you use Trigger.AvailableNow for all incremental batch processing workloads.

The available now trigger option consumes all available records as an incremental batch with the ability to configure batch size with options such as maxBytesPerTrigger (sizing options vary by data source).

Azure Databricks supports using Trigger.AvailableNow for incremental batch processing from many Structured Streaming sources. The following table includes the minimum supported Databricks Runtime version required for each data source:

Source	Minimum Databricks Runtime version
File sources (JSON, Parquet, etc.)	9.1 LTS
Delta Lake	10.4 LTS
Auto Loader	10.4 LTS
Apache Kafka	10.4 LTS
Kinesis	13.1

What is the default trigger interval?

Structured Streaming defaults to fixed interval micro-batches of 500ms. Databricks recommends you always specify a tailored trigger to minimize costs associated with checking if new data has arrived and processing undersized batches.

Changing trigger intervals between runs

You can change the trigger interval between runs while using the same checkpoint.

If a Structured Streaming job stops while a micro-batch is being processed, that micro-batch must complete before the new trigger interval applies. As such, you might observe a micro-batch processing with the previously specified settings after changing the trigger interval.

When moving from time-based interval to using AvailableNow, this might result in a micro-batch processing ahead of processing all available records as an incremental batch.

When moving from AvailableNow to a time-based interval, this might result in continuing to process all records that were available when the last AvailableNow job triggered. This is the expected behavior.

Note

If you are trying to recover from query failure associated with an incremental batch, changing the trigger interval does not solve this problem because the batch must still be completed. Databricks recommends scaling up the compute capacity used to process the batch to try to resolve the issue. In rare cases, you might need to restart the stream with a new checkpoint.

What is continuous processing mode?

Apache Spark supports an additional trigger interval known as Continuous Processing. This mode has been classified as experimental since Spark 2.3; consult with your Azure Databricks account team to make sure you understand the trade-offs of this processing model.

Note that this continuous processing mode does not relate at all to continuous processing as applied in Lakeflow Spark Declarative Pipelines.

Tilbakemeldinger

Var denne siden nyttig?

Last updated on 2024-03-01