Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This page provides reference information for real-time mode in Structured Streaming, including supported environments, languages, sources, sinks, and operators. For known limitations, see Real-time mode limitations.
Supported languages
Real-time mode supports Scala, Java, and Python.
Compute types
Real-time mode supports the following compute types:
| Compute type | Supported |
|---|---|
| Dedicated (formerly: single user) | ✓ |
| Standard (formerly: shared) | ✓ (only Python) |
| Lakeflow Spark Declarative Pipelines Classic | Not supported |
| Lakeflow Spark Declarative Pipelines Serverless | Not supported |
| Serverless | Not supported |
Execution modes
Real-time mode supports update mode only:
| Execution mode | Supported |
|---|---|
| Update mode | ✓ |
| Append mode | Not supported |
| Complete mode | Not supported |
Sources and sinks
Real-time mode supports the following sources and sinks:
| Source or sink | As source | As sink |
|---|---|---|
| Apache Kafka | ✓ | ✓ |
| Event Hubs (using Kafka connector) | ✓ | ✓ |
| Kinesis | ✓ (only EFO mode) | Not supported |
| AWS MSK | ✓ | Not supported |
| Delta | Not supported | Not supported |
| Google Pub/Sub | Not supported | Not supported |
| Apache Pulsar | Not supported | Not supported |
Arbitrary sinks (using forEachWriter) |
Not applicable | ✓ |
Operators
Real-time mode supports most Structured Streaming operators:
Stateless operations
| Operator | Supported |
|---|---|
| Selection | ✓ |
| Projection | ✓ |
UDFs
| Operator | Supported |
|---|---|
| Scala UDF | ✓ (with some limitations) |
| Python UDF | ✓ (with some limitations) |
Aggregation
| Operator | Supported |
|---|---|
| sum | ✓ |
| count | ✓ |
| max | ✓ |
| min | ✓ |
| avg | ✓ |
| Aggregation functions | ✓ |
Windowing
| Operator | Supported |
|---|---|
| Tumbling | ✓ |
| Sliding | ✓ |
| Session | Not supported |
Deduplication
| Operator | Supported |
|---|---|
| dropDuplicates | ✓ (the state is unbounded) |
| dropDuplicatesWithinWatermark | Not supported |
Stream to table join
| Operator | Supported |
|---|---|
| Broadcast table join (table should be small) | ✓ |
| Stream to stream join | Not supported |
| (flat)MapGroupsWithState | Not supported |
| transformWithState | ✓ (with some differences) |
| union | ✓ (with some limitations) |
| forEach | ✓ |
| forEachBatch | Not supported |
| mapPartitions | Not supported (see limitation) |
Special considerations
Some operators and features have specific considerations or differences when used in real-time mode.
transformWithState in real-time mode
For building custom stateful applications, Databricks supports transformWithState, an API in Apache Spark Structured Streaming. See Build a custom stateful application for more information about the API and code snippets.
However, there are some differences between how the API behaves in real-time mode and traditional streaming queries that leverage the micro-batch architecture.
- Real-time mode calls the
handleInputRows(key: String, inputRows: Iterator[T], timerValues: TimerValues)method for each row.- The
inputRowsiterator returns a single value. Micro-batch mode calls it once for each key, and theinputRowsiterator returns all values for a key in the micro batch. - Account for this difference when writing your code
- The
- Event time timers are not supported in real-time mode.
- In real-time mode, timers are delayed in firing depending on data arrival:
- If a timer is scheduled for 10:00:00 but no data arrives, the timer doesn't fire immediately.
- If data arrives at 10:00:10, the timer fires with a 10-second delay.
- If no data arrives and the long-running batch is terminating, the timer fires before the batch terminates.
Python UDFs in real-time mode
Databricks supports the majority of Python user-defined functions (UDFs) in real-time mode:
Stateless
| UDF type | Supported |
|---|---|
| Python scalar UDF (User-defined scalar functions - Python) | ✓ |
| Arrow scalar UDF | ✓ |
| Pandas scalar UDF (pandas user-defined functions) | ✓ |
Arrow function (mapInArrow) |
✓ |
| Pandas function (Map) | ✓ |
Stateful grouping (UDAF)
| UDF type | Supported |
|---|---|
transformWithState (only Row interface) |
✓ |
applyInPandasWithState |
Not supported |
Non-stateful grouping (UDAF)
| UDF type | Supported |
|---|---|
apply |
Not supported |
applyInArrow |
Not supported |
applyInPandas |
Not supported |
Table functions
| UDF type | Supported |
|---|---|
| UDTF (Python user-defined table functions (UDTFs)) | Not supported |
| UC UDF | Not supported |
There are several points to consider when using Python UDFs in real-time mode:
- To minimize the latency, configure the Arrow batch size (
spark.sql.execution.arrow.maxRecordsPerBatch) to 1.- Trade-off: This configuration optimizes for latency at the expense of throughput. For most workloads, this setting is recommended.
- Increase the batch size only if a higher throughput is required to accommodate input volume, accepting the potential increase in latency.
- Pandas UDFs and functions do not perform well with an Arrow batch size of 1.
- If you use pandas UDFs or functions, set the Arrow batch size to a higher value (for example, 100 or higher).
- This implies higher latency. Databricks recommends using an Arrow UDF or function if possible.
- Due to the performance issue with pandas, transformWithState is only supported with the
Rowinterface.