Production considerations for Structured Streaming applications on Azure Databricks
You can easily configure production incremental processing workloads with Structured Streaming on Azure Databricks to fulfill latency and cost requirements for real-time or batch applications. Understanding key concepts of Structured Streaming on Azure Databricks can help you avoid common pitfalls as you scaling up the volume and velocity of data and move from development to production.
Azure Databricks has introduced Delta Live Tables to reduce the complexities of managing production infrastructure for Structured Streaming workloads. Databricks recommends using Delta Live Tables for new Structured Streaming pipelines; see Delta Live Tables introduction.
Using notebooks for Structured Streaming workloads
Interactive development with Databricks notebooks requires you attach your notebooks to a cluster in order to execute queries manually. You can schedule Databricks notebooks for automated deployment and automatic recovery from query failure using Workflows.
- Recover from Structured Streaming query failures
- Monitoring Structured Streaming queries on Azure Databricks
- Configure scheduler pools for multiple Structured Streaming workloads on a cluster
You can visualize Structured Streaming queries in notebooks during interactive development, or for interactive monitoring of production workloads. You should only visualize a Structured Streaming query in production if a human will regularly monitor the output of the notebook. While the
checkpointLocation parameters are optional, as a best practice Databricks recommends that you always specify them in production.
Controlling batch size and frequency for Structured Streaming on Azure Databricks
Structured Streaming on Azure Databricks has enhanced options for helping to control costs and latency while streaming with Auto Loader and Delta Lake.
- Configure Structured Streaming batch size on Azure Databricks
- Configure Structured Streaming trigger intervals on Azure Databricks
What is stateful streaming?
A stateful Structured Streaming query requires incremental updates to intermediate state information, whereas a stateless Structured Streaming query only tracks information about which rows have been processed from the source to the sink.
Stateful operations include streaming aggregation, streaming
dropDuplicates, stream-stream joins,
The intermediate state information required for stateful Structured Streaming queries can lead to unexpected latency and production problems if not configured properly.
- Optimize performance of stateful Structured Streaming queries on Azure Databricks
- Configure RocksDB state store on Azure Databricks
- Enable asynchronous state checkpointing for Structured Streaming
- Control late data threshold for Structured Streaming with multiple watermark policy
- Specify initial state for Structured Streaming mapGroupsWithState
- Test state update function for Structured Streaming mapGroupsWithState
- Enable state rebalancing for Structured Streaming workloads
Submit and view feedback for